python - LLM compiling progress stuck after a random amount of PDF processed - Stack Overflow

I was trying to do entity extraction on a certain number of pdf file in a document, but the code seems

I was trying to do entity extraction on a certain number of pdf file in a document, but the code seems to stuck(the code is still running but the output is not available(i also have wait 10 minutes to make sure it's not lagged or anything)) after a random amount of PDF that has been processed, I have debugged some possible problem before :

for filename in os.listdir(pdf_folder_path): if filename.endswith(".pdf"): pdf_path = os.path.join(pdf_folder_path, filename) print(f"Processing: {filename}")
    # Load the entire PDF text
    try:
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()  # Load the entire document
        
        # Check if documents are empty
        if not documents:
            print(f"Warning: No content found in {filename}. Skipping.")
            continue  # Skip to the next PDF if no documents found

        # Concatenate all page content (if multiple pages)
        full_text = " ".join(doc.page_content for doc in documents if doc.page_content.strip())
        
        # If full_text is empty after concatenation, skip the document
        if not full_text.strip():
            print(f"Warning: Empty content after extraction in {filename}. Skipping.")
            continue

        # Process the full document through the chain
        chain_result = chain.invoke({"input": full_text})
        print(f"Entities extracted from {filename}:\n{chain_result}\n")

        extracted_data.append({
            "filename": filename,
            "entities": chain_result,
        })

    except Exception as e:
        print(f"Error processing {filename}: {e}")
        logging.error(f"Error processing {filename}: {e}")

with open(output_file, "w", encoding="utf-8") as f: json.dump(extracted_data, f, ensure_ascii=False, indent=4)

print(f"All PDFs processed successfully. Results saved to {output_file}.")

I have made an exception if PDF cannot be read / cannot be parsed / contains nothing it will skipped it and already tried to used GPT-4o model and GPT-4o-mini, and I cannot use chunking because if I use chunking, there might be a chance that a single file turned into two entities.

Do you guys might know what happened, and how to debug it?

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745110018a4611790.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信