c# - Unable to extract the text from the table in an OCR pdf document

I am trying to extract a table and its contents using Aspose PDF and Aspose OCR. I have a single page pdf with an ocr image contaning a table as shown below enter image description here

Below is my code to extract the table using both the Libraries

[HttpPost("read-pdf-aspose")]
        public async Task<StandardResponse<string>> ReadPdfAspose([FromBody] PdfFilePathRequest request)
        {
            try
            {
                if (string.IsNullOrWhiteSpace(request.PdfFilePath))
                {
                    return new StandardResponse<string>
                    {
                        Status = false,
                        Message = "PDF file path cannot be empty."
                    };
                }

                // Create temporary directory for images
                string tempPath = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString());
                Directory.CreateDirectory(tempPath);

                try
                {
                    // Load PDF document
                    using (Document pdfDocument = new Document(request.PdfFilePath))
                    {
                        // Initialize OCR engine
                        Aspose.OCR.AsposeOcr recognitionEngine = new Aspose.OCR.AsposeOcr();
                        StringBuilder extractedText = new StringBuilder();

                        // Process each page
                        for (int pageIndex = 1; pageIndex <= pdfDocument.Pages.Count; pageIndex++)
                        {
                            var page = pdfDocument.Pages[pageIndex];

                            // Save images from the page
                            for (int imgIndex = 1; imgIndex <= page.Resources.Images.Count; imgIndex++)
                            {
                                string imagePath = Path.Combine(tempPath, $"page_{pageIndex}_img_{imgIndex}.png");

                                // Extract and save the image
                                using (FileStream imageStream = new(imagePath, FileMode.Create))
                                {
                                    page.Resources.Images[imgIndex].Save(imageStream);
                                }

                                // Set up OCR for table detection
                                Aspose.OCR.OcrInput input = new Aspose.OCR.OcrInput(Aspose.OCR.InputType.SingleImage);
                                input.Add(imagePath);

                                // Configure to detect tables
                                Aspose.OCR.RecognitionSettings recognitionSettings = new Aspose.OCR.RecognitionSettings();
                                 recognitionSettings.DetectAreasMode = Aspose.OCR.DetectAreasMode.Table;

                                // Perform recognition
                                Aspose.OCR.OcrOutput results = recognitionEngine.Recognize(input, recognitionSettings);

                                // Collect recognized text
                                extractedText.AppendLine($"--- Table content from Page {pageIndex}, Image {imgIndex} ---");
                                foreach (Aspose.OCR.RecognitionResult result in results)
                                {
                                    extractedText.AppendLine(result.RecognitionText);
                                }
                                extractedText.AppendLine();
                            }
                        }

                        return new StandardResponse<string>
                        {
                            Status = true,
                            Message = "Tables extracted successfully from PDF using Aspose",
                            Data = extractedText.ToString()
                        };
                    }
                }
                finally
                {
                    // Cleanup: Delete temporary directory and files
                    if (Directory.Exists(tempPath))
                    {
                        Directory.Delete(tempPath, true);
                    }
                }
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error processing PDF with Aspose");
                return new StandardResponse<string>
                {
                    Status = false,
                    Message = ex.Message
                };
            }
        }

When i hit the call, this below is the output i am getting

{
    "status": true,
    "message": "Tables extracted successfully from PDF using Aspose",
    "data": "--- Table content from Page 1, Image 1 ---\r\nThe Main Table reguired for evaluation\nHeader1 Second Header Third Header Fourth Header\nFirst First Sample First Second First Third Sample First Fourth\nSample Sample\nSecond First Second Second Second Third Second Fourth\nSample Sample Sample Sample\nThird First Sample Third Second Third Third Sample Third Fourth\nSample Sample\nSample text for differentiation between Table data and non table data\nIpsum has been the industry's standard dummy text ever since the 1500s,when an\nunknown printer took a galley of type and scrambled it to make a type specimen book. It\nhas survived not only five centuries,but also the leap into electronic typesetting,remaining\nessentiallyl\r\n\r\n"
}

Now if i comment out the line

recognitionSettings.DetectAreasMode = Aspose.OCR.DetectAreasMode.Table;

I am getting the below response

{
    "status": true,
    "message": "Tables extracted successfully from PDF using Aspose",
    "data": "--- Table content from Page 1, Image 1 ---\r\nThe Main Table reguired for evaluation\nHeader1 Second Header Third Header Fourth Header\nFirst First Sample First Second First Third Sample First Fourth\nSample Sample\nSecond First Second Second Second Third Second Fourth\nSample Sample Sample Sample\nThird First Sample Third Second Third Third Sample Third Fourth\nSample Sample\nSample text for differentiation between Table data and non table data\nIpsum has been the industry's standard dummy text ever since the 1500s,when an\nunknown printer took a galley of type and scrambled it to make a type specimen book. It\nhas survived not only five centuries,but also the leap into electronic typesetting,remaining\nessentiallyl\r\n\r\n"
}

So the response is not making any difference here, help me out in this case please.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745046709a4608141.html

c# - Unable to extract the text from the table in an OCR pdf document - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888

c# - Unable to extract the text from the table in an OCR pdf document - Stack Overflow

相关推荐

c# - Unable to extract the text from the table in an OCR pdf document - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888