Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Dependencies
node_modules/
.venv/
__pycache__/

# Build output
dist/
Expand All @@ -23,4 +24,7 @@ coverage/

# OS
.DS_Store
CLAUDE.md

# Classification pipeline artifacts
scripts/classification/models/
scripts/classification/*.jsonl
1 change: 1 addition & 0 deletions AGENTS.md
64 changes: 64 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# docx-corpus

The largest open corpus of .docx files (~800K documents) for document processing research. Built by [SuperDoc](https://superdoc.dev).

## Architecture

This is a **data pipeline monorepo** with two runtimes:

- **TypeScript (Bun)** — infrastructure: scraping, extraction, embedding
- **Python** — data science: classification, export, publishing

```
apps/cli/ → corpus <command> (scrape, extract, embed, status)
apps/cdx-filter/ → AWS Lambda for Common Crawl CDX filtering
packages/shared/ → DB client (Bun.sql), R2 storage, UI helpers
packages/scraper/ → Downloads .docx from Common Crawl WARC archives
packages/extractor/ → Text extraction via Docling
packages/embedder/ → Embeddings via Google gemini-embedding-001
scripts/classification/ → ML classification pipeline (Python)
db/ → PostgreSQL schema + migrations
```

## Pipeline

Each stage writes to the same PostgreSQL database (`documents` table):

1. **Scrape** (TS) — Common Crawl → .docx files in R2 (`status = 'uploaded'`)
2. **Extract** (TS) — Docling → text in R2 (`extracted_at`, `word_count`, `language`)
3. **Embed** (TS) — Google API → pgvector (`embedding`, `embedded_at`)
4. **Classify** (Python) — ModernBERT → labels (`document_type`, `document_topic`)

## Database

Single `documents` table in PostgreSQL (NeonDB) with pgvector. All pipeline stages write to this table.

- **Connection**: `DATABASE_URL` env var (Bun.sql for TS, psycopg2 for Python)
- **Schema**: `db/schema.sql` (canonical), `db/migrations/` (incremental)
- **Key columns**: `id` (SHA-256 hash), `status`, `extracted_at`, `embedded_at`, `document_type`, `document_topic`

## Storage

Documents and extracted text live in Cloudflare R2:
- `documents/{hash}.docx` — original files
- `extracted/{hash}.txt` — extracted text

Text is also available at `https://docxcorp.us/extracted/{id}.txt`.

## Commands

```bash
bun install # Install TS dependencies
bun run corpus scrape --crawl 3 # Scrape from Common Crawl
bun run corpus extract # Extract text
bun run corpus embed # Generate embeddings
bun run corpus status # Show pipeline stats
```

## Key conventions

- Use `bun` for all TS tooling (not node/npm/pnpm)
- DB client is in `packages/shared/db.ts` — all pipeline stages use `DbClient`
- Storage abstraction in `packages/shared/storage.ts` — R2 or local
- Environment: `.env` at project root (gitignored), see `.env.example`
- Python scripts manage their own deps via `pyproject.toml`
76 changes: 61 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,13 @@ Phase 4: Embed (corpus embed)
│ extracted/ │ ──► │ transformers │ ──► │ (pgvector) │
│ {hash}.txt │ │ (Python) │ │ embedding │
└────────────────┘ └────────────────┘ └────────────────┘

Phase 5: Classify (Python ML pipeline)
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ LLM labels │ │ ModernBERT │ │ PostgreSQL │
│ 3,500 sample │ ──► │ fine-tuning │ ──► │ document_type │
│ (Claude) │ │ (2 models) │ │ document_topic│
└────────────────┘ └────────────────┘ └────────────────┘
```

### Why Common Crawl?
Expand Down Expand Up @@ -82,27 +89,29 @@ bun install
## Project Structure

```
packages/
shared/ # Shared utilities (DB client, storage, formatting)
scraper/ # Core scraper logic (downloads WARC, validates .docx)
extractor/ # Text extraction using Docling (Python)
embedder/ # Document embeddings
apps/
cli/ # Unified CLI - corpus <command>
cdx-filter/ # AWS Lambda - filters CDX indexes for .docx URLs
web/ # Landing page - docxcorp.us
cli/ # Unified CLI — corpus <command>
cdx-filter/ # AWS Lambda — filters CDX indexes for .docx URLs
web/ # Landing page — docxcorp.us
packages/
shared/ # DB client, storage, formatting (Bun)
scraper/ # Downloads WARC, validates .docx (Bun)
extractor/ # Text extraction via Docling (Bun + Python)
embedder/ # Document embeddings (Bun)
scripts/
classification/ # ML classification pipeline (Python)
db/
schema.sql # PostgreSQL schema (with pgvector)
migrations/ # Database migrations
schema.sql # PostgreSQL schema (with pgvector)
migrations/ # Database migrations
```

**Apps** (entry points)

| App | Purpose | Uses |
| -------------- | ------------------------------- | ------------------------ |
| **cli** | `corpus` command | scraper, extractor, embedder |
| **cdx-filter** | Filter CDX indexes (Lambda) | - |
| **web** | Landing page | - |
| App | Purpose | Runtime |
| -------------- | ------------------------------- | ------- |
| **cli** | `corpus` command | Bun |
| **cdx-filter** | Filter CDX indexes (Lambda) | Bun |
| **web** | Landing page | - |

**Packages** (libraries)

Expand All @@ -113,6 +122,12 @@ db/
| **extractor** | Extract text (Docling) | Bun + Python |
| **embedder** | Generate embeddings | Bun |

**Scripts** (data science)

| Script | Purpose | Runtime |
| ------------------------- | ------------------------------------------ | ------- |
| **scripts/classification** | Document type + topic classification (ML) | Python |

## Usage

### 1. Run Lambda to filter CDX indexes
Expand Down Expand Up @@ -173,6 +188,34 @@ bun run corpus embed --batch 100 --verbose

Uses Google's `gemini-embedding-001` model (3072 dimensions, ~$0.006/1M tokens). Documents are chunked and embeddings are combined via weighted average.

### 5. Classify documents

Classifies documents by **document type** (10 classes) and **topic** (9 classes) using the [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) pattern: LLM labels a sample → train classifier → apply at scale.

```bash
cd scripts/classification

# Install Python dependencies
pip install -e .

# Step 1: Sample 3,500 documents (stratified by language, word count, domain)
python sample.py --total 3500 --output sampled_docs.jsonl

# Step 2: Label with Claude (~$3)
python label.py --input sampled_docs.jsonl --output labeled_docs.jsonl

# Step 3: Train ModernBERT classifiers (~30min GPU)
python train.py --input labeled_docs.jsonl --output-dir ./models

# Step 4: Classify full corpus (~800K docs)
python classify.py --models-dir ./models

# Check results
python evaluate.py corpus
```

See [scripts/classification/README.md](scripts/classification/README.md) for full details.

### Docker

Run the CLI in a container:
Expand Down Expand Up @@ -268,6 +311,9 @@ EMBED_INPUT_PREFIX=extracted
EMBED_BATCH_SIZE=100
EMBED_CONCURRENCY=20 # Parallel API requests
GOOGLE_API_KEY= # Required for embeddings

# Classification (Python scripts only)
ANTHROPIC_API_KEY= # Required for LLM labeling step
```

### Rate Limiting
Expand Down
Loading