| id | |
|---|---|
| title | @happyvertical/pdf: PDF Processing and Text Extraction |
| sidebar_label | @happyvertical/pdf |
| sidebar_position | 8 |
Modern PDF processing utilities with text extraction and OCR support using unpdf and @happyvertical/ocr.
The @happyvertical/pdf package provides comprehensive PDF processing capabilities for Node.js environments. It intelligently combines direct text extraction with OCR fallback for handling both text-based and image-based PDFs.
- Text Extraction: Direct text extraction from PDF documents using unpdf
- OCR Integration: Automatic OCR fallback for image-based PDFs using @happyvertical/ocr
- Metadata Extraction: Comprehensive PDF metadata (title, author, dates, etc.)
- Image Extraction: Extract images from PDFs for OCR or display
- Smart Analysis: Document analysis with processing strategy recommendations
- Error Resilience: Graceful handling of corrupted or malformed PDFs
- Performance Optimization: Intelligent provider selection and processing strategies
# Install with bun (recommended)
bun add @happyvertical/pdf
# Or with npm
npm install @happyvertical/pdf
# Or with yarn
yarn add @happyvertical/pdfimport { getPDFReader } from '@happyvertical/pdf';
// Create a PDF reader instance
const reader = await getPDFReader();
// Extract text from a PDF file
const text = await reader.extractText('/path/to/document.pdf');
console.log(text);import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Analyze the PDF first to determine optimal processing strategy
const info = await reader.getInfo('/path/to/document.pdf');
console.log(`Strategy: ${info.recommendedStrategy}`);
console.log(`Pages: ${info.pageCount}`);
console.log(`Has text: ${info.hasEmbeddedText}`);
console.log(`Has images: ${info.hasImages}`);
// Extract text with strategy-aware processing
const text = await reader.extractText('/path/to/document.pdf');
if (text) {
console.log(`Extracted ${text.length} characters`);
} else {
console.log('No text could be extracted');
}import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Get comprehensive metadata
const metadata = await reader.extractMetadata('/path/to/document.pdf');
console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Pages: ${metadata.pageCount}`);
console.log(`Created: ${metadata.creationDate}`);
// Extract images for OCR or display
const images = await reader.extractImages('/path/to/document.pdf');
console.log(`Found ${images.length} images`);import { getPDFReader } from '@happyvertical/pdf';
const reader = await getPDFReader();
// Extract images from PDF
const images = await reader.extractImages('/path/to/scanned.pdf');
if (images.length > 0) {
// Perform OCR on extracted images
const ocrResult = await reader.performOCR(images, {
language: 'eng',
confidenceThreshold: 70
});
console.log('OCR Text:', ocrResult.text);
console.log('Confidence:', ocrResult.confidence);
}import { getPDFReader } from '@happyvertical/pdf';
// Configure reader with specific options
const reader = await getPDFReader({
provider: 'auto', // Auto-select best provider
enableOCR: true, // Enable OCR fallback
timeout: 30000, // 30 second timeout
maxFileSize: 50 * 1024 * 1024, // 50MB limit
defaultOCROptions: {
language: 'eng',
confidenceThreshold: 70
}
});Currently supports Node.js only:
- Node.js: Full PDF processing with unpdf + OCR capabilities
- Browser: Planned for future releases
- unpdf: Modern PDF processing library for text, metadata, and image extraction
- @happyvertical/ocr: OCR capabilities with multiple provider support (tesseract.js, EasyOCR)
- Node.js 18+ (Node.js 24+ recommended)
- unpdf library (automatically installed)
- All basic PDF processing requirements
- Additional memory for image processing (2GB+ recommended)
- Optional: Enhanced OCR dependencies for better accuracy
The package includes comprehensive error handling:
import { getPDFReader } from '@happyvertical/pdf';
try {
const reader = await getPDFReader();
const text = await reader.extractText('/path/to/document.pdf');
if (!text) {
console.log('No text found - may be image-based PDF');
}
} catch (error) {
if (error.name === 'PDFDependencyError') {
console.error('Missing dependencies:', error.message);
} else if (error.name === 'PDFUnsupportedError') {
console.error('Unsupported operation:', error.message);
} else {
console.error('PDF processing failed:', error);
}
}The package maintains backward compatibility with legacy function exports:
// Legacy functions (deprecated, use getPDFReader() instead)
import {
extractTextFromPDF,
extractImagesFromPDF,
performOCROnImages,
checkOCRDependencies
} from '@happyvertical/pdf';For complete API documentation including all methods, options, and examples, run:
npm run docsOr view the generated documentation at packages/pdf/docs/.
This package is part of the HAVE SDK and is licensed under the MIT License - see the LICENSE file for details.