Name	Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets	assets
tests	tests
.gitignore	.gitignore
LICENSE.md	LICENSE.md
README.md	README.md
batch_classify.py	batch_classify.py
classifier.py	classifier.py
requirements.txt	requirements.txt

Name

Last commit message

Last commit date

5 Commits

Data Classifier: An AI-driven approach to Label LLM Training Data

This repository contains a rule-based data classifier to label LLM training/fine-tuning datasets into various categories such as unsafe, spammy, sensitive, etc.

Inspired by my past work on AutoPureData.

Highlights

Rule-based heuristics implemented in data_classifier/classifier.py
Command-line wrapper at scripts/classify.py for file and single-text classification
Small unit test suite in tests/test_classifier.py

Quick start

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install the minimal requirements (project currently lists only pytest). Many classifier features require additional heavy dependencies (see Implementation notes).

pip install -r requirements.txt

If you plan to use the built-in model-backed checks (transformers / torch / spacy / fasttext), install the extras (may require a GPU for acceptable speed):

python -m spacy download en_core_web_sm

Note: Some pre-trained models may be large and require a Hugging Face token for access.

Usage

CLI (classify a file or single string):

# classify a JSONL file (each line is a JSON object containing a `text` field):
python scripts/classify.py --input examples/sample.jsonl --output out.jsonl --format jsonl

# classify a single text from the CLI
python scripts/classify.py --text "This is a sample sentence to classify."

The CLI writes a JSONL file where each object receives a label field (list of tags).

Python API (import and call):

from data_classifier.classifier import classify

text = "Visit http://spam.example.com for a great deal"
tags = classify(text)
print(tags)  # -> e.g. ['Spammy']

Public functions you’ll likely use:

classify(text: str) -> list[str] — returns zero or more tags describing the text.
classify_file(input_path, output_path, input_format='jsonl', text_field='text') — helper used by the CLI (your script can call the classify function directly).

License

This project is licensed under CC BY 4.0. See LICENSE.md for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Classifier: An AI-driven approach to Label LLM Training Data

Highlights

Quick start

Usage

License

About

Uh oh!

Languages

License

Pro-GenAI/DataClassifier

Folders and files

Latest commit

History

Repository files navigation

Data Classifier: An AI-driven approach to Label LLM Training Data

Highlights

Quick start

Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages