This repository contains a rule-based data classifier to label LLM training/fine-tuning datasets into various categories such as unsafe, spammy, sensitive, etc.
Inspired by my past work on AutoPureData.
- Rule-based heuristics implemented in data_classifier/classifier.py
- Command-line wrapper at scripts/classify.pyfor file and single-text classification
- Small unit test suite in tests/test_classifier.py
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate- Install the minimal requirements (project currently lists only pytest). Many classifier features require additional heavy dependencies (see Implementation notes).
pip install -r requirements.txtIf you plan to use the built-in model-backed checks (transformers / torch / spacy / fasttext), install the extras (may require a GPU for acceptable speed):
python -m spacy download en_core_web_smNote: Some pre-trained models may be large and require a Hugging Face token for access.
CLI (classify a file or single string):
# classify a JSONL file (each line is a JSON object containing a `text` field):
python scripts/classify.py --input examples/sample.jsonl --output out.jsonl --format jsonl
# classify a single text from the CLI
python scripts/classify.py --text "This is a sample sentence to classify."The CLI writes a JSONL file where each object receives a label field (list of tags).
Python API (import and call):
from data_classifier.classifier import classify
text = "Visit http://spam.example.com for a great deal"
tags = classify(text)
print(tags)  # -> e.g. ['Spammy']Public functions you’ll likely use:
- classify(text: str) -> list[str]— returns zero or more tags describing the text.
- classify_file(input_path, output_path, input_format='jsonl', text_field='text')— helper used by the CLI (your script can call the classify function directly).
This project is licensed under CC BY 4.0. See LICENSE.md for details.
