c# - Getting the Table texts along with the Page Text - Stack Overflow|江阴雨辰互联

So I have an API call where I am using the IronOCR library to extract the text from the PDF containing multiple OCR images.

In the OCR image, there can be just texts, or there can be Tables(with/without content) along with the text, or there can be just the tables.

I am performing two things here, a full page text extraction and just table extraction. In the full page text extraction i get the entire text (along with the table contents and the headers ) and in the table extraction i get just the table content skipping the non-table texts.

The issue i am facing here is, when i perform the extraction for a page which contains both the text(any paragraph or non-table content) and the table, the table contents(headers, rows etc ) get scanned as plain text and gets duplicated entry.

Example scenario , i have a following page

Line 1
Line 2
Table ( with just header x and y )
Line 3
Line 4

In my output i get this

Line 1 contents
Line 2 contents
Table header contents (X and Y) (as text, should not be coming)
Table row contents (if any) (as text, should not be coming )
Table (as table which is supposed to be coming with tags which i have added in code )
Line 4
Line 5

Here is the code

[HttpPost("read-pdf-iron")]
        public async Task<StandardResponse<string>> ReadPdf([FromBody] PdfFilePathRequest request)
        {
            try
            {
                if (string.IsNullOrWhiteSpace(request.PdfFilePath))
                {
                    return new StandardResponse<string>
                    {
                        Status = false,
                        Message = "PDF file path cannot be empty."
                    };
                }

                // Initialize IronTesseract with more precise configuration
                var ocrTesseract = new IronTesseract();

                // Configure OCR with more granular settings
                ocrTesseract.Configuration.ReadDataTables = true;
                ocrTesseract.Configuration.WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,()-:;'\"";

                // Create OCR input
                using var ocrInput = new IronOcr.OcrInput();

                // Load PDF from file path
                ocrInput.LoadPdf(request.PdfFilePath);

                // Perform OCR
                var ocrResult = await ocrTesseract.ReadAsync(ocrInput);

                // Advanced document reading
                var advancedResult = ocrTesseract.ReadDocumentAdvanced(ocrInput);

                // StringBuilder to store the extracted content
                StringBuilder sb = new StringBuilder();

                // Counter for tracking tables and other elements
                int tableCount = 0;
                int textBlockCount = 0;

                // Process all pages
                for (int pageIndex = 0; pageIndex < ocrResult.PageCount; pageIndex++)
                {
                    sb.AppendLine($"<Page id=\"{pageIndex + 1}\">");

                    // Extract tables for the page
                    var pageTables = advancedResult.Tables.Where(t => t.Page == pageIndex + 1).ToList();

                    // Extract full page text
                    var pageText = ocrResult.Pages[pageIndex].Text;

                    // Split text into lines
                    var textLines = pageText.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);

                    // Separate text before, between, and after tables
                    var textBlocks = new List<string>();
                    var currentTextBlock = new List<string>();

                    foreach (var line in textLines)
                    {
                        // Check if the line is part of any table
                        bool isTableLine = pageTables.Any(table =>
                            table.CellInfos.Any(cell =>
                                cell.CellText.Contains(line.Trim(), StringComparison.OrdinalIgnoreCase)));

                        if (!isTableLine)
                        {
                            currentTextBlock.Add(line);
                        }
                        else
                        {
                            // If the current text block is not empty, add it to the text blocks list
                            if (currentTextBlock.Any())
                            {
                                textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
                                currentTextBlock.Clear();
                            }
                        }
                    }

                    // Add the last text block if it exists
                    if (currentTextBlock.Any())
                    {
                        textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
                    }

                    // Write text blocks and tables in the correct order
                    int tableIndex = 0;
                    int textBlockIndex = 0;

                    while (tableIndex < pageTables.Count || textBlockIndex < textBlocks.Count)
                    {
                        // Write text blocks before the first table or between tables
                        if (textBlockIndex < textBlocks.Count)
                        {
                            textBlockCount++;
                            sb.AppendLine($"<TextBlock id=\"{textBlockCount}\">{textBlocks[textBlockIndex]}</TextBlock>");
                            textBlockIndex++;
                        }

                        // Write tables
                        if (tableIndex < pageTables.Count)
                        {
                            var table = pageTables[tableIndex];
                            tableCount++;
                            sb.Append($"<Table id=\"{tableCount}\">");

                            foreach (var cell in table.CellInfos)
                            {
                                sb.Append(cell.CellText.Trim());
                                sb.Append('|');
                            }

                            sb.Append("</Table>");
                            sb.AppendLine();
                            tableIndex++;
                        }
                    }

                    sb.AppendLine("</Page>");
                }

                return new StandardResponse<string>
                {
                    Status = true,
                    Message = $"PDF content read successfully. Found {tableCount} tables and {textBlockCount} text blocks.",
                    Data = sb.ToString()
                };
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error reading PDF file");
                return new StandardResponse<string>
                {
                    Status = false,
                    Message = ex.Message
                };
            }
        }

For the following page

This is the output

<Page id="6">
<TextBlock id="6">G.
Employment: For each employer during last 5 years, please state:
Name of Employer City Start and End Date Occupation</TextBlock>
<Table id="1">Name of Employer|City|Start and End Date
of Employment|Occupation|RKS Construction|Pasadena|2005 to Present|Carpenter|</Table>
<TextBlock id="7">RKS Construction Pasadena 2005 to Present Carpenter
H. Other Claims(Litigation
1. Have you ever been a party to a lawsuit other than in the present lawsuit, seeking
civilmonetary damages YES L NO
If YES, identify the following as to each:
Caption Case</TextBlock>
<Table id="2">Caption  Case
No.:|Date Filed|CityState
of Court|Nature of
Action|Outcome|Your Lawyer's
Name  Address|NIA|N|NA|NA|NA|A|</Table>
<TextBlock id="8">Your Lawyers
Name Address
NJA
NJA
NJA
NJA NJA
NJA</TextBlock>
</Page>

As we can see here, the table headers(Name of Employer City Start and End Date Occupation) have also entered the textblock.

I know where the problem is lying, when i am scanning the full page text i am not handling the table contents and they are getting inserted into the textBlock. I am unable to find an approach here how i can handle that.

So I have an API call where I am using the IronOCR library to extract the text from the PDF containing multiple OCR images.

In the OCR image, there can be just texts, or there can be Tables(with/without content) along with the text, or there can be just the tables.

Example scenario , i have a following page

Line 1
Line 2
Table ( with just header x and y )
Line 3
Line 4

In my output i get this

Line 1 contents
Line 2 contents
Table header contents (X and Y) (as text, should not be coming)
Table row contents (if any) (as text, should not be coming )
Table (as table which is supposed to be coming with tags which i have added in code )
Line 4
Line 5

Here is the code

[HttpPost("read-pdf-iron")]
        public async Task<StandardResponse<string>> ReadPdf([FromBody] PdfFilePathRequest request)
        {
            try
            {
                if (string.IsNullOrWhiteSpace(request.PdfFilePath))
                {
                    return new StandardResponse<string>
                    {
                        Status = false,
                        Message = "PDF file path cannot be empty."
                    };
                }

                // Initialize IronTesseract with more precise configuration
                var ocrTesseract = new IronTesseract();

                // Configure OCR with more granular settings
                ocrTesseract.Configuration.ReadDataTables = true;
                ocrTesseract.Configuration.WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,()-:;'\"";

                // Create OCR input
                using var ocrInput = new IronOcr.OcrInput();

                // Load PDF from file path
                ocrInput.LoadPdf(request.PdfFilePath);

                // Perform OCR
                var ocrResult = await ocrTesseract.ReadAsync(ocrInput);

                // Advanced document reading
                var advancedResult = ocrTesseract.ReadDocumentAdvanced(ocrInput);

                // StringBuilder to store the extracted content
                StringBuilder sb = new StringBuilder();

                // Counter for tracking tables and other elements
                int tableCount = 0;
                int textBlockCount = 0;

                // Process all pages
                for (int pageIndex = 0; pageIndex < ocrResult.PageCount; pageIndex++)
                {
                    sb.AppendLine($"<Page id=\"{pageIndex + 1}\">");

                    // Extract tables for the page
                    var pageTables = advancedResult.Tables.Where(t => t.Page == pageIndex + 1).ToList();

                    // Extract full page text
                    var pageText = ocrResult.Pages[pageIndex].Text;

                    // Split text into lines
                    var textLines = pageText.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);

                    // Separate text before, between, and after tables
                    var textBlocks = new List<string>();
                    var currentTextBlock = new List<string>();

                    foreach (var line in textLines)
                    {
                        // Check if the line is part of any table
                        bool isTableLine = pageTables.Any(table =>
                            table.CellInfos.Any(cell =>
                                cell.CellText.Contains(line.Trim(), StringComparison.OrdinalIgnoreCase)));

                        if (!isTableLine)
                        {
                            currentTextBlock.Add(line);
                        }
                        else
                        {
                            // If the current text block is not empty, add it to the text blocks list
                            if (currentTextBlock.Any())
                            {
                                textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
                                currentTextBlock.Clear();
                            }
                        }
                    }

                    // Add the last text block if it exists
                    if (currentTextBlock.Any())
                    {
                        textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
                    }

                    // Write text blocks and tables in the correct order
                    int tableIndex = 0;
                    int textBlockIndex = 0;

                    while (tableIndex < pageTables.Count || textBlockIndex < textBlocks.Count)
                    {
                        // Write text blocks before the first table or between tables
                        if (textBlockIndex < textBlocks.Count)
                        {
                            textBlockCount++;
                            sb.AppendLine($"<TextBlock id=\"{textBlockCount}\">{textBlocks[textBlockIndex]}</TextBlock>");
                            textBlockIndex++;
                        }

                        // Write tables
                        if (tableIndex < pageTables.Count)
                        {
                            var table = pageTables[tableIndex];
                            tableCount++;
                            sb.Append($"<Table id=\"{tableCount}\">");

                            foreach (var cell in table.CellInfos)
                            {
                                sb.Append(cell.CellText.Trim());
                                sb.Append('|');
                            }

                            sb.Append("</Table>");
                            sb.AppendLine();
                            tableIndex++;
                        }
                    }

                    sb.AppendLine("</Page>");
                }

                return new StandardResponse<string>
                {
                    Status = true,
                    Message = $"PDF content read successfully. Found {tableCount} tables and {textBlockCount} text blocks.",
                    Data = sb.ToString()
                };
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error reading PDF file");
                return new StandardResponse<string>
                {
                    Status = false,
                    Message = ex.Message
                };
            }
        }

For the following page

This is the output

<Page id="6">
<TextBlock id="6">G.
Employment: For each employer during last 5 years, please state:
Name of Employer City Start and End Date Occupation</TextBlock>
<Table id="1">Name of Employer|City|Start and End Date
of Employment|Occupation|RKS Construction|Pasadena|2005 to Present|Carpenter|</Table>
<TextBlock id="7">RKS Construction Pasadena 2005 to Present Carpenter
H. Other Claims(Litigation
1. Have you ever been a party to a lawsuit other than in the present lawsuit, seeking
civilmonetary damages YES L NO
If YES, identify the following as to each:
Caption Case</TextBlock>
<Table id="2">Caption  Case
No.:|Date Filed|CityState
of Court|Nature of
Action|Outcome|Your Lawyer's
Name  Address|NIA|N|NA|NA|NA|A|</Table>
<TextBlock id="8">Your Lawyers
Name Address
NJA
NJA
NJA
NJA NJA
NJA</TextBlock>
</Page>

As we can see here, the table headers(Name of Employer City Start and End Date Occupation) have also entered the textblock.

Share Improve this question edited Mar 7 at 6:39 Uwe Keim 40.8k61 gold badges190 silver badges304 bronze badges asked Mar 7 at 6:34 Kaif Khan 494 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

The basic problem in the code is how it checks if a line is a table content by the code below.

// Check if the line is part of any table
bool isTableLine = pageTables.Any(table =>
    table.CellInfos.Any(cell =>
        cell.CellText.Contains(line.Trim(), StringComparison.OrdinalIgnoreCase)));

This statement will not catch any table lines unless table is a one column table because .CellText only returns the content in a single cell. However, a line will contain all cell contents of a table row.

Things would be easier if the library supported excluding table contents while returning text output but it doesn't look like there's a way to do that. Please note that this conclusion is based on a quick documentation skimming as a first time IronOcr (IronPdf) library user. Based on this conclusion, one could extract the table contents as rows manually to filter out text lines that are actually table contents. A method like below could do that.

public HashSet<string> ExtractTableLines(List<TableInfo> tables)
{
    var tableLines = new HashSet<string>();

    foreach (var table in tables)
    {
        var tableWidth = table.BoudingRect.Width;
        var rowColumnCount = 0;
        var rowColumnsWidth = 0;

        while (rowColumnsWidth < tableWidth)
        {
            rowColumnsWidth += table.CellInfos[rowColumnCount].CellRect.Width;
            ++rowColumnCount;
        }

        var tableLineBuilder = new StringBuilder();
        for (var i = 0; i < table.CellInfos.Count; ++i)
        {
            if ((i + 1) % rowColumnCount == 0)
            {     
                tableLines.Add(tableLineBuilder.ToString().Trim());
                tableLineBuilder.Clear();
            }
            
            tableLineBuilder.Append(table.CellInfos[i].CellText.Replace('\n', ' '));
        }
    }

    return tableLines;
}

With this method, isTableLine check could be done as below.

var isTableLine = tableLines.Contains(line);

This check would catch a table row like RKS Construction Pasadena 2005 to Present Carpenter and prevent it from being duplicated however it's not going to catch Name of Employer City Start and End Date Occupation because the content of the 3rd cell, Start and End Date of Employment, of the same table spans two lines. Hence, the method needs to be improved to handle this case but this is not the only problem. The other problem seems to be the inconsistency between the results of ReadAsync() and ReadDocumentAdvanced(). For example, the former returns the header of first table as Name of Employer City Start and End Date Occupation whereas the latter cannot read City for some reason and just returns C. But I think this is beyond the context of the question.

Hope this answer helps.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1744944941a4602553.html

c# - Getting the Table texts along with the Page Text - Stack Overflow

1 Answer 1

发表回复

评论列表（0条）

联系我们

400-800-8888

c# - Getting the Table texts along with the Page Text - Stack Overflow

1 Answer 1

相关推荐