Open-source data compiler for AI training datasets
Compile datasets like code: clean, rank, and optimize in one command.
To make data as programmable and optimized as compute.
DataGPU compiles raw, messy datasets into training-ready binaries, turning 10k+ lines of preprocessing scripts into a single declarative command.
- Automatic Cleaning: Schema inference and normalization for text, numeric, and categorical data
- Fast Deduplication: Hash-based duplicate removal using xxHash
- Quality Ranking: TF-IDF and cosine similarity-based relevance scoring
- Smart Caching: Local cache with SQLite for reproducible compilations
- Unified Pipeline: Single command execution for all preprocessing steps
- Compiled Artifacts: Parquet + manifest format with versioning and metadata
- Framework Integration: Compatible with PyTorch DataLoader and Hugging Face Datasets
Install the latest stable version directly from PyPI:
pip install datagpuFor production use, we recommend installing in a virtual environment:
# Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install DataGPU
pip install datagpuCheck that DataGPU is installed correctly:
datagpu --version
# Output: DataGPU version 0.1.0For development or to get the latest features:
git clone https://github.com/Jasiri-App/datagpu.git
cd datagpu
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install in development mode with all dependencies
pip install -e ".[dev]"Process and optimize your dataset with a single command:
datagpu compile data/your_dataset.csv \
--rank \
--dedupe \
--cache \
--out compiled/Example with sample data:
# First, download or generate sample data
python examples/generate_sample_data.py
# Process the sample data
datagpu compile examples/data/small_test.csv --out /tmp/compiled --verboseExample output:
DataGPU v0.1.0
Compiling: examples/data/small_test.csv
Loading data from examples/data/small_test.csv...
Cleaning data...
Deduplicating...
Ranking by relevance...
Saving to /tmp/compiled/data.parquet...
Compilation complete!
Rows processed 100
Valid rows 100 (100.0%)
Duplicates removed 20 (20.0%)
Ranked samples 80
Processing time 0.1s
Output /tmp/compiled/data.parquet
Manifest /tmp/compiled/manifest.yaml
Dataset version: v0.1.0
Inspect compiled datasets:
datagpu info /tmp/compiled/manifest.yamlList cached datasets:
datagpu cache-listClear cache:
datagpu cache-clearYou can also use DataGPU programmatically:
from datagpu import DataCompiler, load
from datagpu.types import CompilationConfig, RankMethod
# Configure the compilation
config = CompilationConfig(
source_path="data/your_dataset.csv",
output_path="compiled/",
dedupe=True, # Enable deduplication
rank=True, # Enable quality ranking
rank_method=RankMethod.RELEVANCE,
rank_target="high quality examples", # Target for relevance ranking
cache=True, # Enable caching
verbose=True # Show progress
)
# Create and run the compiler
compiler = DataCompiler(config)
output_path, manifest, stats = compiler.compile()
# Load the compiled dataset
dataset = load("compiled/manifest.yaml")
# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Or convert to pandas/arrow
df = dataset.to_pandas()
table = dataset.to_arrow()
# Access compilation statistics
print(f"Processed {stats.total_rows} rows")
print(f"Removed {stats.duplicates_removed} duplicates")
print(f"Processing time: {stats.processing_time:.2f}s")┌───────────────────────────────┐
│ CLI Interface (Typer) │
│ - datagpu compile ... │
└──────────────┬────────────────┘
│
┌──────────────┴────────────────┐
│ Compiler Core (Python) │
│ - Loader (Polars/Arrow) │
│ - Cleaner │
│ - Deduper (xxHash) │
│ - Ranker (TF-IDF / cosine) │
│ - Optimizer (Parquet Writer) │
│ - Cache Manager (SQLite) │
└──────────────┬────────────────┘
│
┌──────────────┴────────────────┐
│ Storage Backend │
│ - Local FS │
│ - Parquet / Arrow │
│ - Optional S3 adapter (Phase2)│
└────────────────────────────────┘
datagpu compile <source> [OPTIONS]
Options:
--out, -o PATH Output directory [default: compiled]
--rank/--no-rank Enable quality ranking [default: True]
--rank-method TEXT Ranking method: relevance, tfidf, cosine
--rank-target TEXT Target query for relevance ranking
--dedupe/--no-dedupe Enable deduplication [default: True]
--cache/--no-cache Enable caching [default: True]
--compression TEXT Compression: zstd, snappy, gzip [default: zstd]
--verbose/--quiet Verbose output [default: True]# Display dataset information
datagpu info compiled/manifest.yaml# List cached datasets
datagpu cache-list
# Clear cache
datagpu cache-clear --forceEach compiled dataset includes a manifest.yaml with metadata:
dataset_name: train
version: v0.1.0
rows: 1840200
columns: 12
dedup_ratio: 0.124
rank_method: cosine
created_at: 2025-11-11T14:03:21Z
hash: 7ac2fdf7a00f...
source_path: data/train.csv
compiled_path: compiled/data.parquet
cache_path: .datagpu/cache/
schema:
id: numeric
text: text
category: categorical
stats:
total_rows: 2400000
valid_rows: 2367840
duplicates_removed: 297600
processing_time: 8.2| Metric | Target | Status |
|---|---|---|
| Cleaning throughput | ≥ 1M rows/sec | On track |
| Deduplication | 10× faster than Pandas | Achieved |
| Dataset compression | 40-70% smaller | Achieved |
| Ranking | ≤ 10ms per 1k rows | On track |
| Cache reuse | 5× faster | Implemented |
Dataset: 10k rows
Processing time: 0.8s
Throughput: 12,500 rows/sec
Compression: 65% (CSV → Parquet)
from datagpu import load
from torch.utils.data import DataLoader
dataset = load("compiled/manifest.yaml")
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
# Train your model
passfrom datagpu.loader import load_to_hf
dataset = load_to_hf("compiled/manifest.yaml")
dataset.train_test_split(test_size=0.2)# Clone repository
git clone https://github.com/Jasiri-App/datagpu.git
cd datagpu
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run benchmarks
python examples/generate_sample_data.py
python examples/benchmark.pydatagpu/
├── datagpu/ # Core package
│ ├── __init__.py
│ ├── cli.py # CLI interface
│ ├── compiler.py # Main compiler
│ ├── cleaner.py # Data cleaning
│ ├── deduper.py # Deduplication
│ ├── ranker.py # Quality ranking
│ ├── cache.py # Cache management
│ ├── loader.py # Dataset loader
│ ├── types.py # Type definitions
│ └── utils.py # Utilities
├── tests/ # Test suite
├── examples/ # Examples and benchmarks
├── pyproject.toml # Project configuration
└── README.md
- Embedding-based near-duplicate removal
- FAISS integration for similarity search
- Distributed compilation with Ray/Dask
- Multi-core optimization
- S3/GCS backend support
- Remote dataset compilation
- Dataset visualization
- Quality metrics and stats
- Version comparison
- Rewrite core kernels in Rust
- 20× performance improvement target
Contributions are welcome! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
DataGPU is released under the Apache 2.0 License.
If you use DataGPU in your research, please cite:
@software{datagpu2025,
title = {DataGPU: Open-source data compiler for AI training datasets},
author = {Celestino Kariuki},
organization = {Safariblocks Ltd.},
year = {2025},
url = {https://github.com/Jasiri-App/datagpu}
}- Documentation: GitHub README
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with focus on data quality and reproducibility