superdoc-dev · caio-pizzol · Mar 9, 2026 · Mar 9, 2026 · Mar 9, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Dependencies
 node_modules/
 .venv/
+__pycache__/
 
 # Build output
 dist/
@@ -23,4 +24,7 @@ coverage/
 
 # OS
 .DS_Store
-CLAUDE.md
+
+# Classification pipeline artifacts
+scripts/classification/models/
+scripts/classification/*.jsonl
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1 @@
+CLAUDE.md
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,64 @@
+# docx-corpus
+
+The largest open corpus of .docx files (~800K documents) for document processing research. Built by [SuperDoc](https://superdoc.dev).
+
+## Architecture
+
+This is a **data pipeline monorepo** with two runtimes:
+
+- **TypeScript (Bun)** — infrastructure: scraping, extraction, embedding
+- **Python** — data science: classification, export, publishing
+
+```
+apps/cli/               → corpus <command> (scrape, extract, embed, status)
+apps/cdx-filter/        → AWS Lambda for Common Crawl CDX filtering
+packages/shared/        → DB client (Bun.sql), R2 storage, UI helpers
+packages/scraper/       → Downloads .docx from Common Crawl WARC archives
+packages/extractor/     → Text extraction via Docling
+packages/embedder/      → Embeddings via Google gemini-embedding-001
+scripts/classification/ → ML classification pipeline (Python)
+db/                     → PostgreSQL schema + migrations
+```
+
+## Pipeline
+
+Each stage writes to the same PostgreSQL database (`documents` table):
+
+1. **Scrape** (TS) — Common Crawl → .docx files in R2 (`status = 'uploaded'`)
+2. **Extract** (TS) — Docling → text in R2 (`extracted_at`, `word_count`, `language`)
+3. **Embed** (TS) — Google API → pgvector (`embedding`, `embedded_at`)
+4. **Classify** (Python) — ModernBERT → labels (`document_type`, `document_topic`)
+
+## Database
+
+Single `documents` table in PostgreSQL (NeonDB) with pgvector. All pipeline stages write to this table.
+
+- **Connection**: `DATABASE_URL` env var (Bun.sql for TS, psycopg2 for Python)
+- **Schema**: `db/schema.sql` (canonical), `db/migrations/` (incremental)
+- **Key columns**: `id` (SHA-256 hash), `status`, `extracted_at`, `embedded_at`, `document_type`, `document_topic`
+
+## Storage
+
+Documents and extracted text live in Cloudflare R2:
+- `documents/{hash}.docx` — original files
+- `extracted/{hash}.txt` — extracted text
+
+Text is also available at `https://docxcorp.us/extracted/{id}.txt`.
+
+## Commands
+
+```bash
+bun install                        # Install TS dependencies
+bun run corpus scrape --crawl 3    # Scrape from Common Crawl
+bun run corpus extract             # Extract text
+bun run corpus embed               # Generate embeddings
+bun run corpus status              # Show pipeline stats
+```
+
+## Key conventions
+
+- Use `bun` for all TS tooling (not node/npm/pnpm)
+- DB client is in `packages/shared/db.ts` — all pipeline stages use `DbClient`
+- Storage abstraction in `packages/shared/storage.ts` — R2 or local
+- Environment: `.env` at project root (gitignored), see `.env.example`
+- Python scripts manage their own deps via `pyproject.toml`
diff --git a/README.md b/README.md
@@ -55,6 +55,13 @@ Phase 4: Embed (corpus embed)
 │  extracted/    │ ──► │ transformers   │ ──► │   (pgvector)   │
 │  {hash}.txt    │     │   (Python)     │     │  embedding     │
 └────────────────┘     └────────────────┘     └────────────────┘
+
+Phase 5: Classify (Python ML pipeline)
+┌────────────────┐     ┌────────────────┐     ┌────────────────┐
+│  LLM labels    │     │  ModernBERT    │     │   PostgreSQL   │
+│  3,500 sample  │ ──► │  fine-tuning   │ ──► │  document_type │
+│  (Claude)      │     │  (2 models)    │     │  document_topic│
+└────────────────┘     └────────────────┘     └────────────────┘
 ```
 
 ### Why Common Crawl?
@@ -82,27 +89,29 @@ bun install
 ## Project Structure
 
 ```
-packages/
-  shared/         # Shared utilities (DB client, storage, formatting)
-  scraper/        # Core scraper logic (downloads WARC, validates .docx)
-  extractor/      # Text extraction using Docling (Python)
-  embedder/       # Document embeddings
 apps/
-  cli/            # Unified CLI - corpus <command>
-  cdx-filter/     # AWS Lambda - filters CDX indexes for .docx URLs
-  web/            # Landing page - docxcorp.us
+  cli/              # Unified CLI — corpus <command>
+  cdx-filter/       # AWS Lambda — filters CDX indexes for .docx URLs
+  web/              # Landing page — docxcorp.us
+packages/
+  shared/           # DB client, storage, formatting (Bun)
+  scraper/          # Downloads WARC, validates .docx (Bun)
+  extractor/        # Text extraction via Docling (Bun + Python)
+  embedder/         # Document embeddings (Bun)
+scripts/
+  classification/   # ML classification pipeline (Python)
 db/
-  schema.sql      # PostgreSQL schema (with pgvector)
-  migrations/     # Database migrations
+  schema.sql        # PostgreSQL schema (with pgvector)
+  migrations/       # Database migrations
 ```
 
 **Apps** (entry points)
 
-| App            | Purpose                         | Uses                     |
-| -------------- | ------------------------------- | ------------------------ |
-| **cli**        | `corpus` command                | scraper, extractor, embedder |
-| **cdx-filter** | Filter CDX indexes (Lambda)     | -                        |
-| **web**        | Landing page                    | -                        |
+| App            | Purpose                         | Runtime |
+| -------------- | ------------------------------- | ------- |
+| **cli**        | `corpus` command                | Bun     |
+| **cdx-filter** | Filter CDX indexes (Lambda)     | Bun     |
+| **web**        | Landing page                    | -       |
 
 **Packages** (libraries)
 
@@ -113,6 +122,12 @@ db/
 | **extractor**  | Extract text (Docling)            | Bun + Python |
 | **embedder**   | Generate embeddings               | Bun          |
 
+**Scripts** (data science)
+
+| Script                    | Purpose                                    | Runtime |
+| ------------------------- | ------------------------------------------ | ------- |
+| **scripts/classification** | Document type + topic classification (ML) | Python  |
+
 ## Usage
 
 ### 1. Run Lambda to filter CDX indexes
@@ -173,6 +188,34 @@ bun run corpus embed --batch 100 --verbose
 
 Uses Google's `gemini-embedding-001` model (3072 dimensions, ~$0.006/1M tokens). Documents are chunked and embeddings are combined via weighted average.
 
+### 5. Classify documents
+
+Classifies documents by **document type** (10 classes) and **topic** (9 classes) using the [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) pattern: LLM labels a sample → train classifier → apply at scale.
+
+```bash
+cd scripts/classification
+
+# Install Python dependencies
+pip install -e .
+
+# Step 1: Sample 3,500 documents (stratified by language, word count, domain)
+python sample.py --total 3500 --output sampled_docs.jsonl
+
+# Step 2: Label with Claude (~$3)
+python label.py --input sampled_docs.jsonl --output labeled_docs.jsonl
+
+# Step 3: Train ModernBERT classifiers (~30min GPU)
+python train.py --input labeled_docs.jsonl --output-dir ./models
+
+# Step 4: Classify full corpus (~800K docs)
+python classify.py --models-dir ./models
+
+# Check results
+python evaluate.py corpus
+```
+
+See [scripts/classification/README.md](scripts/classification/README.md) for full details.
+
 ### Docker
 
 Run the CLI in a container:
@@ -268,6 +311,9 @@ EMBED_INPUT_PREFIX=extracted
 EMBED_BATCH_SIZE=100
 EMBED_CONCURRENCY=20         # Parallel API requests
 GOOGLE_API_KEY=              # Required for embeddings
+
+# Classification (Python scripts only)
+ANTHROPIC_API_KEY=           # Required for LLM labeling step
 ```
 
 ### Rate Limiting