This project automates the extraction of text and tables from multiple PDF files, transforming unstructured documents into TXT and CSV files ready for analysis, reporting, or database ingestion.
The script automatically scans an input directory, processes each PDF found, and generates structured outputs without manual intervention.
- π Processes all PDF files inside an input folder
- π Extracts text from each page and saves it as a
.txtfile - π Extracts tables from PDFs and exports them as
.csvfiles - π Handles multiple PDFs without overwriting outputs
- π§© Modular and reusable code structure
- βοΈ Designed for automation workflows
- Python
- PyMuPDF (fitz) β PDF text extraction
- tabula-py β PDF table extraction
- pandas β tabular data processing
- pathlib β file system and path handling
tabula-py requires Java to be installed on the system.
pdf-data-extractor/ β βββ data/ β βββ inputs/ # Input PDF files β βββ outputs/ # Generated TXT and CSV files β βββ src/ β βββ init.py β βββ text_extractor.py β βββ table_extractor.py β βββ main.py βββ requirements.txt βββ README.md
π The data/inputs and data/outputs folders include sample files for demonstration purposes.
- Clone the repository:
git clone https://github.com/ale687/pdf-data-extractor.git
cd pdf-data-extractor
---
π€ Author
Claudio Alejandro Ledesma
Information Systems Engineering student (UTN)
Areas of Interest
Backend Development
Data Engineering
Process Automation