Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods#3
Open
slonyator wants to merge 21 commits intog-stavrakis:mainfrom
Open
Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods#3slonyator wants to merge 21 commits intog-stavrakis:mainfrom
slonyator wants to merge 21 commits intog-stavrakis:mainfrom
Conversation
For Text, Tables and Images a respective class was created in order to have a better overview.
In order to increase code readability the code for the "main" class was also bundled in a calls with helper functions.
Enhanced the process_pdf method in PdfManager to support returning extracted text from either all pages as a single string, or from a specific page. This update improves usability and flexibility for users working with PDF text extraction.
Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods
refactor: All classes into a single file temp: Changed PDF doc feat: Remove text duplicates added feat: fuzzy-wuzzy check f/ duplicate-detection In order to check, weather a text block is already extracted (due to the fact that text sometimes gets recognized as an image and also as text and therefore gets extracted twice) we now use fuzzy logic which takes an NLP approach and gives a confidence score estimating weather two text blocks are identical. By applying this logic we want to tackle the issue of text duplicates. style: Code blacked style: variable renamed refactor: Methods changed to static doc: Doc-Strings added style: File blacked
No real speedimprovement so far, just a tiny bit.
fuzzy-wuzzy replaced by rapidfuzz
Replaced the existing string matching method with Jaro-Winkler similarity. Updated the remove_duplicated_text method in PdfManager to utilize Jaro-Winkler for assessing text similarity.
Jaro-Winkler for String Comparison
# Conflicts: # pdf_image_extractor.py # pdf_manager.py # pdf_table_extractor.py # pdf_text_extractor.py
No duplicates
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This pull request represents a comprehensive refactor of the initial PDF processing notebook. The goal is to enhance code readability, maintainability, and functionality. The changes introduce a new class-based architecture, offering a more modular and scalable approach to PDF text, image, and table extraction.
Major Changes
PdfManagerClass: Centralizes the PDF processing logic, providing a cleaner and more organized code structure. ThePdfManagerclass orchestrates the extraction of text, images, and tables from PDF documents.PdfTextExtractor,PdfImageExtractor, andPdfTableExtractorclasses. Each class focuses on a specific aspect of PDF processing (text, images, tables), adhering to the Single Responsibility Principle.PdfTextExtractor.PdfImageExtractorhandles image extraction from PDFs and employs OCR to extract text from images.PdfTableExtractoris responsible for extracting tables and converting them to a user-friendly string format.process_pdfmethod inPdfManagerallows users to retrieve either the full text from all PDF pages or text from a specific page.PdfManagerto handle the removal of temporary files created during processing.Benefits
Conclusion
This refactor significantly improves the structure and capabilities of the initial PDF processing approach. It's a step forward in making PDF data extraction more accessible and efficient for various use cases and enhances the overall management of project dependencies.