Skip to content

feat: add document classification pipeline#6

Open
caio-pizzol wants to merge 1 commit intomainfrom
caio-pizzol/classification-pipeline
Open

feat: add document classification pipeline#6
caio-pizzol wants to merge 1 commit intomainfrom
caio-pizzol/classification-pipeline

Conversation

@caio-pizzol
Copy link
Contributor

Python ML pipeline (scripts/classification/) that classifies ~800K .docx documents by document type (10 classes) and topic (9 classes) using the FineWeb-Edu pattern: LLM labels a sample → train ModernBERT → apply at scale.

Pipeline steps:

  • sample.py: stratified sampling across languages and word count
  • label.py: async LLM labeling with Claude (resumable)
  • train.py: fine-tune two ModernBERT classifiers
  • classify.py: batch inference on full corpus
  • evaluate.py: quality metrics and distribution analysis

Also adds:

  • LLM classification fields and methods to DbClient
  • CLAUDE.md / AGENTS.md at root, packages/shared, and scripts/classification
  • Updated README with Phase 5 (Classify) and project structure

Python ML pipeline (scripts/classification/) that classifies ~800K .docx
documents by document type (10 classes) and topic (9 classes) using the
FineWeb-Edu pattern: LLM labels a sample → train ModernBERT → apply at scale.

Pipeline steps:
- sample.py: stratified sampling across languages and word count
- label.py: async LLM labeling with Claude (resumable)
- train.py: fine-tune two ModernBERT classifiers
- classify.py: batch inference on full corpus
- evaluate.py: quality metrics and distribution analysis

Also adds:
- LLM classification fields and methods to DbClient
- CLAUDE.md / AGENTS.md at root, packages/shared, and scripts/classification
- Updated README with Phase 5 (Classify) and project structure
@codecov
Copy link

codecov bot commented Mar 9, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants