A Rust library/website for extracting text from various document formats.
- PDF, DOCX, XLSX, PPTX documents
- OCR for images (PNG, JPEG, WebP) with English and French support
- Plain text formats (TXT, CSV, JSON)
use parser::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = std::fs::read("document.pdf")?;
let text = parse(&data)?;
println!("{}", text);
Ok(())
}Requires Tesseract OCR libraries:
- Debian/Ubuntu:
sudo apt install libtesseract-dev libleptonica-dev libclang-dev - macOS:
brew install tesseract - Windows: Follow the instructions at Tesseract GitHub repository
