So I have an API call where I am using the IronOCR library to extract the text from the PDF containing multiple OCR images.
In the OCR image, there can be just texts, or there can be Tables(with/without content) along with the text, or there can be just the tables.
I am performing two things here, a full page text extraction and just table extraction. In the full page text extraction i get the entire text (along with the table contents and the headers ) and in the table extraction i get just the table content skipping the non-table texts.
The issue i am facing here is, when i perform the extraction for a page which contains both the text(any paragraph or non-table content) and the table, the table contents(headers, rows etc ) get scanned as plain text and gets duplicated entry.
Example scenario , i have a following page
- Line 1
- Line 2
- Table ( with just header x and y )
- Line 3
- Line 4
In my output i get this
- Line 1 contents
- Line 2 contents
- Table header contents (X and Y) (as text, should not be coming)
- Table row contents (if any) (as text, should not be coming )
- Table (as table which is supposed to be coming with tags which i have added in code )
- Line 4
- Line 5
Here is the code
[HttpPost("read-pdf-iron")]
public async Task<StandardResponse<string>> ReadPdf([FromBody] PdfFilePathRequest request)
{
try
{
if (string.IsNullOrWhiteSpace(request.PdfFilePath))
{
return new StandardResponse<string>
{
Status = false,
Message = "PDF file path cannot be empty."
};
}
// Initialize IronTesseract with more precise configuration
var ocrTesseract = new IronTesseract();
// Configure OCR with more granular settings
ocrTesseract.Configuration.ReadDataTables = true;
ocrTesseract.Configuration.WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,()-:;'\"";
// Create OCR input
using var ocrInput = new IronOcr.OcrInput();
// Load PDF from file path
ocrInput.LoadPdf(request.PdfFilePath);
// Perform OCR
var ocrResult = await ocrTesseract.ReadAsync(ocrInput);
// Advanced document reading
var advancedResult = ocrTesseract.ReadDocumentAdvanced(ocrInput);
// StringBuilder to store the extracted content
StringBuilder sb = new StringBuilder();
// Counter for tracking tables and other elements
int tableCount = 0;
int textBlockCount = 0;
// Process all pages
for (int pageIndex = 0; pageIndex < ocrResult.PageCount; pageIndex++)
{
sb.AppendLine($"<Page id=\"{pageIndex + 1}\">");
// Extract tables for the page
var pageTables = advancedResult.Tables.Where(t => t.Page == pageIndex + 1).ToList();
// Extract full page text
var pageText = ocrResult.Pages[pageIndex].Text;
// Split text into lines
var textLines = pageText.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
// Separate text before, between, and after tables
var textBlocks = new List<string>();
var currentTextBlock = new List<string>();
foreach (var line in textLines)
{
// Check if the line is part of any table
bool isTableLine = pageTables.Any(table =>
table.CellInfos.Any(cell =>
cell.CellText.Contains(line.Trim(), StringComparison.OrdinalIgnoreCase)));
if (!isTableLine)
{
currentTextBlock.Add(line);
}
else
{
// If the current text block is not empty, add it to the text blocks list
if (currentTextBlock.Any())
{
textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
currentTextBlock.Clear();
}
}
}
// Add the last text block if it exists
if (currentTextBlock.Any())
{
textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
}
// Write text blocks and tables in the correct order
int tableIndex = 0;
int textBlockIndex = 0;
while (tableIndex < pageTables.Count || textBlockIndex < textBlocks.Count)
{
// Write text blocks before the first table or between tables
if (textBlockIndex < textBlocks.Count)
{
textBlockCount++;
sb.AppendLine($"<TextBlock id=\"{textBlockCount}\">{textBlocks[textBlockIndex]}</TextBlock>");
textBlockIndex++;
}
// Write tables
if (tableIndex < pageTables.Count)
{
var table = pageTables[tableIndex];
tableCount++;
sb.Append($"<Table id=\"{tableCount}\">");
foreach (var cell in table.CellInfos)
{
sb.Append(cell.CellText.Trim());
sb.Append('|');
}
sb.Append("</Table>");
sb.AppendLine();
tableIndex++;
}
}
sb.AppendLine("</Page>");
}
return new StandardResponse<string>
{
Status = true,
Message = $"PDF content read successfully. Found {tableCount} tables and {textBlockCount} text blocks.",
Data = sb.ToString()
};
}
catch (Exception ex)
{
_logger.LogError(ex, "Error reading PDF file");
return new StandardResponse<string>
{
Status = false,
Message = ex.Message
};
}
}
For the following page
This is the output
<Page id="6">
<TextBlock id="6">G.
Employment: For each employer during last 5 years, please state:
Name of Employer City Start and End Date Occupation</TextBlock>
<Table id="1">Name of Employer|City|Start and End Date
of Employment|Occupation|RKS Construction|Pasadena|2005 to Present|Carpenter|</Table>
<TextBlock id="7">RKS Construction Pasadena 2005 to Present Carpenter
H. Other Claims(Litigation
1. Have you ever been a party to a lawsuit other than in the present lawsuit, seeking
civilmonetary damages YES L NO
If YES, identify the following as to each:
Caption Case</TextBlock>
<Table id="2">Caption Case
No.:|Date Filed|CityState
of Court|Nature of
Action|Outcome|Your Lawyer's
Name Address|NIA|N|NA|NA|NA|A|</Table>
<TextBlock id="8">Your Lawyers
Name Address
NJA
NJA
NJA
NJA NJA
NJA</TextBlock>
</Page>
As we can see here, the table headers(Name of Employer City Start and End Date Occupation) have also entered the textblock.
I know where the problem is lying, when i am scanning the full page text i am not handling the table contents and they are getting inserted into the textBlock. I am unable to find an approach here how i can handle that.
So I have an API call where I am using the IronOCR library to extract the text from the PDF containing multiple OCR images.
In the OCR image, there can be just texts, or there can be Tables(with/without content) along with the text, or there can be just the tables.
I am performing two things here, a full page text extraction and just table extraction. In the full page text extraction i get the entire text (along with the table contents and the headers ) and in the table extraction i get just the table content skipping the non-table texts.
The issue i am facing here is, when i perform the extraction for a page which contains both the text(any paragraph or non-table content) and the table, the table contents(headers, rows etc ) get scanned as plain text and gets duplicated entry.
Example scenario , i have a following page
- Line 1
- Line 2
- Table ( with just header x and y )
- Line 3
- Line 4
In my output i get this
- Line 1 contents
- Line 2 contents
- Table header contents (X and Y) (as text, should not be coming)
- Table row contents (if any) (as text, should not be coming )
- Table (as table which is supposed to be coming with tags which i have added in code )
- Line 4
- Line 5
Here is the code
[HttpPost("read-pdf-iron")]
public async Task<StandardResponse<string>> ReadPdf([FromBody] PdfFilePathRequest request)
{
try
{
if (string.IsNullOrWhiteSpace(request.PdfFilePath))
{
return new StandardResponse<string>
{
Status = false,
Message = "PDF file path cannot be empty."
};
}
// Initialize IronTesseract with more precise configuration
var ocrTesseract = new IronTesseract();
// Configure OCR with more granular settings
ocrTesseract.Configuration.ReadDataTables = true;
ocrTesseract.Configuration.WhiteListCharacters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,()-:;'\"";
// Create OCR input
using var ocrInput = new IronOcr.OcrInput();
// Load PDF from file path
ocrInput.LoadPdf(request.PdfFilePath);
// Perform OCR
var ocrResult = await ocrTesseract.ReadAsync(ocrInput);
// Advanced document reading
var advancedResult = ocrTesseract.ReadDocumentAdvanced(ocrInput);
// StringBuilder to store the extracted content
StringBuilder sb = new StringBuilder();
// Counter for tracking tables and other elements
int tableCount = 0;
int textBlockCount = 0;
// Process all pages
for (int pageIndex = 0; pageIndex < ocrResult.PageCount; pageIndex++)
{
sb.AppendLine($"<Page id=\"{pageIndex + 1}\">");
// Extract tables for the page
var pageTables = advancedResult.Tables.Where(t => t.Page == pageIndex + 1).ToList();
// Extract full page text
var pageText = ocrResult.Pages[pageIndex].Text;
// Split text into lines
var textLines = pageText.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
// Separate text before, between, and after tables
var textBlocks = new List<string>();
var currentTextBlock = new List<string>();
foreach (var line in textLines)
{
// Check if the line is part of any table
bool isTableLine = pageTables.Any(table =>
table.CellInfos.Any(cell =>
cell.CellText.Contains(line.Trim(), StringComparison.OrdinalIgnoreCase)));
if (!isTableLine)
{
currentTextBlock.Add(line);
}
else
{
// If the current text block is not empty, add it to the text blocks list
if (currentTextBlock.Any())
{
textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
currentTextBlock.Clear();
}
}
}
// Add the last text block if it exists
if (currentTextBlock.Any())
{
textBlocks.Add(string.Join("\n", currentTextBlock).Trim());
}
// Write text blocks and tables in the correct order
int tableIndex = 0;
int textBlockIndex = 0;
while (tableIndex < pageTables.Count || textBlockIndex < textBlocks.Count)
{
// Write text blocks before the first table or between tables
if (textBlockIndex < textBlocks.Count)
{
textBlockCount++;
sb.AppendLine($"<TextBlock id=\"{textBlockCount}\">{textBlocks[textBlockIndex]}</TextBlock>");
textBlockIndex++;
}
// Write tables
if (tableIndex < pageTables.Count)
{
var table = pageTables[tableIndex];
tableCount++;
sb.Append($"<Table id=\"{tableCount}\">");
foreach (var cell in table.CellInfos)
{
sb.Append(cell.CellText.Trim());
sb.Append('|');
}
sb.Append("</Table>");
sb.AppendLine();
tableIndex++;
}
}
sb.AppendLine("</Page>");
}
return new StandardResponse<string>
{
Status = true,
Message = $"PDF content read successfully. Found {tableCount} tables and {textBlockCount} text blocks.",
Data = sb.ToString()
};
}
catch (Exception ex)
{
_logger.LogError(ex, "Error reading PDF file");
return new StandardResponse<string>
{
Status = false,
Message = ex.Message
};
}
}
For the following page
This is the output
<Page id="6">
<TextBlock id="6">G.
Employment: For each employer during last 5 years, please state:
Name of Employer City Start and End Date Occupation</TextBlock>
<Table id="1">Name of Employer|City|Start and End Date
of Employment|Occupation|RKS Construction|Pasadena|2005 to Present|Carpenter|</Table>
<TextBlock id="7">RKS Construction Pasadena 2005 to Present Carpenter
H. Other Claims(Litigation
1. Have you ever been a party to a lawsuit other than in the present lawsuit, seeking
civilmonetary damages YES L NO
If YES, identify the following as to each:
Caption Case</TextBlock>
<Table id="2">Caption Case
No.:|Date Filed|CityState
of Court|Nature of
Action|Outcome|Your Lawyer's
Name Address|NIA|N|NA|NA|NA|A|</Table>
<TextBlock id="8">Your Lawyers
Name Address
NJA
NJA
NJA
NJA NJA
NJA</TextBlock>
</Page>
As we can see here, the table headers(Name of Employer City Start and End Date Occupation) have also entered the textblock.
I know where the problem is lying, when i am scanning the full page text i am not handling the table contents and they are getting inserted into the textBlock. I am unable to find an approach here how i can handle that.
Share Improve this question edited Mar 7 at 6:39 Uwe Keim 40.8k61 gold badges190 silver badges304 bronze badges asked Mar 7 at 6:34 Kaif KhanKaif Khan 494 bronze badges 01 Answer
Reset to default 2The basic problem in the code is how it checks if a line is a table content by the code below.
// Check if the line is part of any table
bool isTableLine = pageTables.Any(table =>
table.CellInfos.Any(cell =>
cell.CellText.Contains(line.Trim(), StringComparison.OrdinalIgnoreCase)));
This statement will not catch any table lines unless table is a one column table because .CellText
only returns the content in a single cell. However, a line will contain all cell contents of a table row.
Things would be easier if the library supported excluding table contents while returning text output but it doesn't look like there's a way to do that. Please note that this conclusion is based on a quick documentation skimming as a first time IronOcr (IronPdf) library user. Based on this conclusion, one could extract the table contents as rows manually to filter out text lines that are actually table contents. A method like below could do that.
public HashSet<string> ExtractTableLines(List<TableInfo> tables)
{
var tableLines = new HashSet<string>();
foreach (var table in tables)
{
var tableWidth = table.BoudingRect.Width;
var rowColumnCount = 0;
var rowColumnsWidth = 0;
while (rowColumnsWidth < tableWidth)
{
rowColumnsWidth += table.CellInfos[rowColumnCount].CellRect.Width;
++rowColumnCount;
}
var tableLineBuilder = new StringBuilder();
for (var i = 0; i < table.CellInfos.Count; ++i)
{
if ((i + 1) % rowColumnCount == 0)
{
tableLines.Add(tableLineBuilder.ToString().Trim());
tableLineBuilder.Clear();
}
tableLineBuilder.Append(table.CellInfos[i].CellText.Replace('\n', ' '));
}
}
return tableLines;
}
With this method, isTableLine
check could be done as below.
var isTableLine = tableLines.Contains(line);
This check would catch a table row like RKS Construction Pasadena 2005 to Present Carpenter
and prevent it from being duplicated however it's not going to catch Name of Employer City Start and End Date Occupation
because the content of the 3rd cell, Start and End Date of Employment
, of the same table spans two lines. Hence, the method needs to be improved to handle this case but this is not the only problem. The other problem seems to be the inconsistency between the results of ReadAsync()
and ReadDocumentAdvanced()
. For example, the former returns the header of first table as Name of Employer City Start and End Date Occupation
whereas the latter cannot read City
for some reason and just returns C
. But I think this is beyond the context of the question.
Hope this answer helps.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1744944941a4602553.html
评论列表(0条)