@happyvertical/pdf

id	pdf
title	@happyvertical/pdf: PDF Processing and Text Extraction
sidebar_label	@happyvertical/pdf
sidebar_position	8

@happyvertical/pdf

Modern PDF processing utilities with text extraction and OCR support using unpdf and @happyvertical/ocr.

Overview

The @happyvertical/pdf package provides comprehensive PDF processing capabilities for Node.js environments. It intelligently combines direct text extraction with OCR fallback for handling both text-based and image-based PDFs.

Key Features

Text Extraction: Direct text extraction from PDF documents using unpdf
OCR Integration: Automatic OCR fallback for image-based PDFs using @happyvertical/ocr
Metadata Extraction: Comprehensive PDF metadata (title, author, dates, etc.)
Image Extraction: Extract images from PDFs for OCR or display
Smart Analysis: Document analysis with processing strategy recommendations
Error Resilience: Graceful handling of corrupted or malformed PDFs
Performance Optimization: Intelligent provider selection and processing strategies

Installation

# Install with bun (recommended)
bun add @happyvertical/pdf

# Or with npm
npm install @happyvertical/pdf

# Or with yarn
yarn add @happyvertical/pdf

Quick Start

Basic PDF Text Extraction

import { getPDFReader } from '@happyvertical/pdf';

// Create a PDF reader instance
const reader = await getPDFReader();

// Extract text from a PDF file
const text = await reader.extractText('/path/to/document.pdf');
console.log(text);

Smart PDF Processing with Analysis

import { getPDFReader } from '@happyvertical/pdf';

const reader = await getPDFReader();

// Analyze the PDF first to determine optimal processing strategy
const info = await reader.getInfo('/path/to/document.pdf');
console.log(`Strategy: ${info.recommendedStrategy}`);
console.log(`Pages: ${info.pageCount}`);
console.log(`Has text: ${info.hasEmbeddedText}`);
console.log(`Has images: ${info.hasImages}`);

// Extract text with strategy-aware processing
const text = await reader.extractText('/path/to/document.pdf');
if (text) {
  console.log(`Extracted ${text.length} characters`);
} else {
  console.log('No text could be extracted');
}

Extract Metadata and Images

import { getPDFReader } from '@happyvertical/pdf';

const reader = await getPDFReader();

// Get comprehensive metadata
const metadata = await reader.extractMetadata('/path/to/document.pdf');
console.log(`Title: ${metadata.title}`);
console.log(`Author: ${metadata.author}`);
console.log(`Pages: ${metadata.pageCount}`);
console.log(`Created: ${metadata.creationDate}`);

// Extract images for OCR or display
const images = await reader.extractImages('/path/to/document.pdf');
console.log(`Found ${images.length} images`);

OCR Processing

import { getPDFReader } from '@happyvertical/pdf';

const reader = await getPDFReader();

// Extract images from PDF
const images = await reader.extractImages('/path/to/scanned.pdf');

if (images.length > 0) {
  // Perform OCR on extracted images
  const ocrResult = await reader.performOCR(images, {
    language: 'eng',
    confidenceThreshold: 70
  });

  console.log('OCR Text:', ocrResult.text);
  console.log('Confidence:', ocrResult.confidence);
}

Advanced Configuration

import { getPDFReader } from '@happyvertical/pdf';

// Configure reader with specific options
const reader = await getPDFReader({
  provider: 'auto',           // Auto-select best provider
  enableOCR: true,            // Enable OCR fallback
  timeout: 30000,             // 30 second timeout
  maxFileSize: 50 * 1024 * 1024, // 50MB limit
  defaultOCROptions: {
    language: 'eng',
    confidenceThreshold: 70
  }
});

Environment Support

Currently supports Node.js only:

Node.js: Full PDF processing with unpdf + OCR capabilities
Browser: Planned for future releases

Dependencies

unpdf: Modern PDF processing library for text, metadata, and image extraction
@happyvertical/ocr: OCR capabilities with multiple provider support (tesseract.js, EasyOCR)

System Requirements

Basic PDF Processing

Node.js 18+ (Node.js 24+ recommended)
unpdf library (automatically installed)

OCR Capabilities

All basic PDF processing requirements
Additional memory for image processing (2GB+ recommended)
Optional: Enhanced OCR dependencies for better accuracy

Error Handling

The package includes comprehensive error handling:

import { getPDFReader } from '@happyvertical/pdf';

try {
  const reader = await getPDFReader();
  const text = await reader.extractText('/path/to/document.pdf');

  if (!text) {
    console.log('No text found - may be image-based PDF');
  }
} catch (error) {
  if (error.name === 'PDFDependencyError') {
    console.error('Missing dependencies:', error.message);
  } else if (error.name === 'PDFUnsupportedError') {
    console.error('Unsupported operation:', error.message);
  } else {
    console.error('PDF processing failed:', error);
  }
}

Legacy Compatibility

The package maintains backward compatibility with legacy function exports:

// Legacy functions (deprecated, use getPDFReader() instead)
import {
  extractTextFromPDF,
  extractImagesFromPDF,
  performOCROnImages,
  checkOCRDependencies
} from '@happyvertical/pdf';

API Documentation

For complete API documentation including all methods, options, and examples, run:

npm run docs

Or view the generated documentation at packages/pdf/docs/.

License

This package is part of the HAVE SDK and is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.changeset		.changeset
.github/workflows		.github/workflows
scripts		scripts
src		src
test		test
.gitignore		.gitignore
.npmrc		.npmrc
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
commitlint.config.js		commitlint.config.js
lefthook.yml		lefthook.yml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
renovate.json		renovate.json
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@happyvertical/pdf

Overview

Key Features

Installation

Quick Start

Basic PDF Text Extraction

Smart PDF Processing with Analysis

Extract Metadata and Images

OCR Processing

Advanced Configuration

Environment Support

Dependencies

System Requirements

Basic PDF Processing

OCR Capabilities

Error Handling

Legacy Compatibility

API Documentation

License

About

Uh oh!

Releases 24

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@happyvertical/pdf

Overview

Key Features

Installation

Quick Start

Basic PDF Text Extraction

Smart PDF Processing with Analysis

Extract Metadata and Images

OCR Processing

Advanced Configuration

Environment Support

Dependencies

System Requirements

Basic PDF Processing

OCR Capabilities

Error Handling

Legacy Compatibility

API Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages