Skip to content

happyvertical/pdf

Repository files navigation

id pdf
title @happyvertical/pdf: PDF Processing and Text Extraction
sidebar_label @happyvertical/pdf
sidebar_position 8

@happyvertical/pdf

License: MIT

Modern PDF processing utilities with text extraction and OCR support using unpdf and @happyvertical/ocr.

Overview

The @happyvertical/pdf package provides comprehensive PDF processing capabilities for Node.js environments. It intelligently combines direct text extraction with OCR fallback for handling both text-based and image-based PDFs.

Key Features

  • Text Extraction: Direct text extraction from PDF documents using unpdf
  • OCR Integration: Automatic OCR fallback for image-based PDFs using @happyvertical/ocr
  • Metadata Extraction: Comprehensive PDF metadata (title, author, dates, etc.)
  • Image Extraction: Extract images from PDFs for OCR or display
  • Smart Analysis: Document analysis with processing strategy recommendations
  • Error Resilience: Graceful handling of corrupted or malformed PDFs
  • Performance Optimization: Intelligent provider selection and processing strategies

Installation

# Install with bun (recommended)
bun add @happyvertical/pdf

# Or with npm
npm install @happyvertical/pdf

# Or with yarn
yarn add @happyvertical/pdf

Quick Start

Basic PDF Text Extraction

import { getPDFReader } from '@happyvertical/pdf';

// Create a PDF reader instance
const reader = await getPDFReader();

// Extract text from a PDF file
const text = await reader.extractText('/path/to/document.pdf');
console.log(text);

Smart PDF Processing with Analysis

import { getPDFReader } from '@happyvertical/pdf';

const reader = await getPDFReader();

// Analyze the PDF first to determine optimal processing strategy
const info = await reader.getInfo('/path/to/document.pdf');
console.log(`Strategy: ${info.recommendedStrategy}`);
console.log(`Pages: ${info.pageCount}`);
console.log(`Has text: ${info.hasEmbeddedText}`);
console.log(`Has images: ${info.hasImages}`);

// Extract text with strategy-aware processing
const text = await reader.extractText('/path/to/document.pdf');
if (text) {
  console.log(`Extracted ${text.length} characters`);
} else {
  console.log('No text could be extracted');
}

Extract Metadata and Images

import { getPDFReader } from '@happyvertical/pdf';

const reader = await getPDFReader();

// Get comprehensive metadata
const metadata = await reader.extractMetadata('/path/to/document.pdf');
console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Pages: ${metadata.pageCount}`);
console.log(`Created: ${metadata.creationDate}`);

// Extract images for OCR or display
const images = await reader.extractImages('/path/to/document.pdf');
console.log(`Found ${images.length} images`);

OCR Processing

import { getPDFReader } from '@happyvertical/pdf';

const reader = await getPDFReader();

// Extract images from PDF
const images = await reader.extractImages('/path/to/scanned.pdf');

if (images.length > 0) {
  // Perform OCR on extracted images
  const ocrResult = await reader.performOCR(images, {
    language: 'eng',
    confidenceThreshold: 70
  });

  console.log('OCR Text:', ocrResult.text);
  console.log('Confidence:', ocrResult.confidence);
}

Advanced Configuration

import { getPDFReader } from '@happyvertical/pdf';

// Configure reader with specific options
const reader = await getPDFReader({
  provider: 'auto',           // Auto-select best provider
  enableOCR: true,            // Enable OCR fallback
  timeout: 30000,             // 30 second timeout
  maxFileSize: 50 * 1024 * 1024, // 50MB limit
  defaultOCROptions: {
    language: 'eng',
    confidenceThreshold: 70
  }
});

Environment Support

Currently supports Node.js only:

  • Node.js: Full PDF processing with unpdf + OCR capabilities
  • Browser: Planned for future releases

Dependencies

  • unpdf: Modern PDF processing library for text, metadata, and image extraction
  • @happyvertical/ocr: OCR capabilities with multiple provider support (tesseract.js, EasyOCR)

System Requirements

Basic PDF Processing

  • Node.js 18+ (Node.js 24+ recommended)
  • unpdf library (automatically installed)

OCR Capabilities

  • All basic PDF processing requirements
  • Additional memory for image processing (2GB+ recommended)
  • Optional: Enhanced OCR dependencies for better accuracy

Error Handling

The package includes comprehensive error handling:

import { getPDFReader } from '@happyvertical/pdf';

try {
  const reader = await getPDFReader();
  const text = await reader.extractText('/path/to/document.pdf');

  if (!text) {
    console.log('No text found - may be image-based PDF');
  }
} catch (error) {
  if (error.name === 'PDFDependencyError') {
    console.error('Missing dependencies:', error.message);
  } else if (error.name === 'PDFUnsupportedError') {
    console.error('Unsupported operation:', error.message);
  } else {
    console.error('PDF processing failed:', error);
  }
}

Legacy Compatibility

The package maintains backward compatibility with legacy function exports:

// Legacy functions (deprecated, use getPDFReader() instead)
import {
  extractTextFromPDF,
  extractImagesFromPDF,
  performOCROnImages,
  checkOCRDependencies
} from '@happyvertical/pdf';

API Documentation

For complete API documentation including all methods, options, and examples, run:

npm run docs

Or view the generated documentation at packages/pdf/docs/.

License

This package is part of the HAVE SDK and is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors