Skip to content

A general-purpose FastAPI microservice that extracts text, structured data from tables, and metadata from various document types (TXT, DOCX, PDF, etc.).

License

Notifications You must be signed in to change notification settings

akasr/papermill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Papermill

A general-purpose FastAPI microservice that extracts text, structured data from tables, and metadata from various document types (TXT, DOCX, PDF, etc.).

Core Functions:

  • Document Parsing: Handles multiple file types (TXT, DOCX and PDF).

  • Metadata Extraction: Pulls out standard metadata like author, creation date, etc.

  • Output: Returns a comprehensive JSON object containing all the extracted information.

Installation and Setup

# Fork and clone the repository
git clone https://github.com/<username>/papermill.git
cd papermill

# Create a branch for your changes
git checkout -b feature/your-feature-name

# Create and activate a virtual environment
uv venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`

# Install dependencies
uv pip sync requirements.txt

# Start the FastAPI Server
uvicorn src.app:app --reload

Push and make a pull request when your changes are ready.

Endpoints

  • POST /extract: Upload a document and receive extracted text and metadata.
  • POST /extract/url: Provide a URL to a document for extraction.
  • GET /health: Check the health status of the service.
  • GET /docs: Access the interactive API documentation.

About

A general-purpose FastAPI microservice that extracts text, structured data from tables, and metadata from various document types (TXT, DOCX, PDF, etc.).

Topics

Resources

License

Stars

Watchers

Forks

Languages