Skip to content

ale687/pdf-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF Data Extractor πŸ“„βž‘οΈπŸ“Š

This project automates the extraction of text and tables from multiple PDF files, transforming unstructured documents into TXT and CSV files ready for analysis, reporting, or database ingestion.

The script automatically scans an input directory, processes each PDF found, and generates structured outputs without manual intervention.


πŸš€ Features

  • πŸ“‚ Processes all PDF files inside an input folder
  • πŸ“ Extracts text from each page and saves it as a .txt file
  • πŸ“Š Extracts tables from PDFs and exports them as .csv files
  • πŸ” Handles multiple PDFs without overwriting outputs
  • 🧩 Modular and reusable code structure
  • βš™οΈ Designed for automation workflows

πŸ› οΈ Technologies Used

  • Python
  • PyMuPDF (fitz) – PDF text extraction
  • tabula-py – PDF table extraction
  • pandas – tabular data processing
  • pathlib – file system and path handling

⚠️ Note: tabula-py requires Java to be installed on the system.


πŸ“ Project Structure

pdf-data-extractor/ β”‚ β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ inputs/ # Input PDF files β”‚ └── outputs/ # Generated TXT and CSV files β”‚ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ text_extractor.py β”‚ └── table_extractor.py β”‚ β”œβ”€β”€ main.py β”œβ”€β”€ requirements.txt └── README.md

πŸ“Œ The data/inputs and data/outputs folders include sample files for demonstration purposes.


βš™οΈ Installation & Setup

  1. Clone the repository:
git clone https://github.com/ale687/pdf-data-extractor.git
cd pdf-data-extractor

---

πŸ‘€ Author

Claudio Alejandro Ledesma
Information Systems Engineering student (UTN)

Areas of Interest

Backend Development

Data Engineering

Process Automation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages