Skip to content

excoffierleonard/parser

Repository files navigation

Parser

A Rust library/website for extracting text from various document formats.

Website

Website Preview

Features

  • PDF, DOCX, XLSX, PPTX documents
  • OCR for images (PNG, JPEG, WebP) with English and French support
  • Plain text formats (TXT, CSV, JSON)

Usage

use parser::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = std::fs::read("document.pdf")?;
    let text = parse(&data)?;
    println!("{}", text);
    Ok(())
}

System Dependencies

Requires Tesseract OCR libraries:

  • Debian/Ubuntu: sudo apt install libtesseract-dev libleptonica-dev libclang-dev
  • macOS: brew install tesseract
  • Windows: Follow the instructions at Tesseract GitHub repository

License

MIT

About

REST API service in Rust that takes in any file and returns its parsed content.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors