DataGPU

Open-source data compiler for AI training datasets

Compile datasets like code: clean, rank, and optimize in one command.

Mission

To make data as programmable and optimized as compute.

DataGPU compiles raw, messy datasets into training-ready binaries, turning 10k+ lines of preprocessing scripts into a single declarative command.

Features

Automatic Cleaning: Schema inference and normalization for text, numeric, and categorical data
Fast Deduplication: Hash-based duplicate removal using xxHash
Quality Ranking: TF-IDF and cosine similarity-based relevance scoring
Smart Caching: Local cache with SQLite for reproducible compilations
Unified Pipeline: Single command execution for all preprocessing steps
Compiled Artifacts: Parquet + manifest format with versioning and metadata
Framework Integration: Compatible with PyTorch DataLoader and Hugging Face Datasets

Quick Start

Installation from PyPI

Install the latest stable version directly from PyPI:

pip install datagpu

For production use, we recommend installing in a virtual environment:

# Create and activate virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install DataGPU
pip install datagpu

Verify Installation

Check that DataGPU is installed correctly:

datagpu --version
# Output: DataGPU version 0.1.0

Install from Source (Development)

For development or to get the latest features:

git clone https://github.com/Jasiri-App/datagpu.git
cd datagpu

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install in development mode with all dependencies
pip install -e ".[dev]"

Basic Usage

Compile a Dataset

Process and optimize your dataset with a single command:

datagpu compile data/your_dataset.csv \
  --rank \
  --dedupe \
  --cache \
  --out compiled/

Example with sample data:

# First, download or generate sample data
python examples/generate_sample_data.py

# Process the sample data
datagpu compile examples/data/small_test.csv --out /tmp/compiled --verbose

Example output:

DataGPU v0.1.0
Compiling: examples/data/small_test.csv

Loading data from examples/data/small_test.csv...
Cleaning data...
Deduplicating...
Ranking by relevance...
Saving to /tmp/compiled/data.parquet...

Compilation complete!

 Rows processed      100                          
 Valid rows          100 (100.0%)                 
 Duplicates removed  20 (20.0%)                   
 Ranked samples      80                           
 Processing time     0.1s                         
 Output              /tmp/compiled/data.parquet  
 Manifest            /tmp/compiled/manifest.yaml 

Dataset version: v0.1.0

View Dataset Information

Inspect compiled datasets:

datagpu info /tmp/compiled/manifest.yaml

Cache Management

List cached datasets:

datagpu cache-list

Clear cache:

datagpu cache-clear

Python API

You can also use DataGPU programmatically:

from datagpu import DataCompiler, load
from datagpu.types import CompilationConfig, RankMethod

# Configure the compilation
config = CompilationConfig(
    source_path="data/your_dataset.csv",
    output_path="compiled/",
    dedupe=True,                    # Enable deduplication
    rank=True,                      # Enable quality ranking
    rank_method=RankMethod.RELEVANCE,
    rank_target="high quality examples",  # Target for relevance ranking
    cache=True,                     # Enable caching
    verbose=True                    # Show progress
)

# Create and run the compiler
compiler = DataCompiler(config)
output_path, manifest, stats = compiler.compile()

# Load the compiled dataset
dataset = load("compiled/manifest.yaml")

# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Or convert to pandas/arrow
df = dataset.to_pandas()
table = dataset.to_arrow()

# Access compilation statistics
print(f"Processed {stats.total_rows} rows")
print(f"Removed {stats.duplicates_removed} duplicates")
print(f"Processing time: {stats.processing_time:.2f}s")

Architecture

┌───────────────────────────────┐
│ CLI Interface (Typer)         │
│  - datagpu compile ...        │
└──────────────┬────────────────┘
               │
┌──────────────┴────────────────┐
│ Compiler Core (Python)        │
│  - Loader (Polars/Arrow)      │
│  - Cleaner                    │
│  - Deduper (xxHash)           │
│  - Ranker (TF-IDF / cosine)   │
│  - Optimizer (Parquet Writer) │
│  - Cache Manager (SQLite)     │
└──────────────┬────────────────┘
               │
┌──────────────┴────────────────┐
│ Storage Backend                │
│  - Local FS                    │
│  - Parquet / Arrow             │
│  - Optional S3 adapter (Phase2)│
└────────────────────────────────┘

CLI Commands

Compile

datagpu compile <source> [OPTIONS]

Options:
  --out, -o PATH              Output directory [default: compiled]
  --rank/--no-rank            Enable quality ranking [default: True]
  --rank-method TEXT          Ranking method: relevance, tfidf, cosine
  --rank-target TEXT          Target query for relevance ranking
  --dedupe/--no-dedupe        Enable deduplication [default: True]
  --cache/--no-cache          Enable caching [default: True]
  --compression TEXT          Compression: zstd, snappy, gzip [default: zstd]
  --verbose/--quiet           Verbose output [default: True]

Info

# Display dataset information
datagpu info compiled/manifest.yaml

Cache Management

# List cached datasets
datagpu cache-list

# Clear cache
datagpu cache-clear --force

Dataset Manifest

Each compiled dataset includes a manifest.yaml with metadata:

dataset_name: train
version: v0.1.0
rows: 1840200
columns: 12
dedup_ratio: 0.124
rank_method: cosine
created_at: 2025-11-11T14:03:21Z
hash: 7ac2fdf7a00f...
source_path: data/train.csv
compiled_path: compiled/data.parquet
cache_path: .datagpu/cache/
schema:
  id: numeric
  text: text
  category: categorical
stats:
  total_rows: 2400000
  valid_rows: 2367840
  duplicates_removed: 297600
  processing_time: 8.2

Performance

Benchmarks (MVP)

Metric	Target	Status
Cleaning throughput	≥ 1M rows/sec	On track
Deduplication	10× faster than Pandas	Achieved
Dataset compression	40-70% smaller	Achieved
Ranking	≤ 10ms per 1k rows	On track
Cache reuse	5× faster	Implemented

Example Performance

Dataset: 10k rows
Processing time: 0.8s
Throughput: 12,500 rows/sec
Compression: 65% (CSV → Parquet)

Integration Examples

PyTorch DataLoader

from datagpu import load
from torch.utils.data import DataLoader

dataset = load("compiled/manifest.yaml")
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    # Train your model
    pass

Hugging Face Datasets

from datagpu.loader import load_to_hf

dataset = load_to_hf("compiled/manifest.yaml")
dataset.train_test_split(test_size=0.2)

Development

Setup

# Clone repository
git clone https://github.com/Jasiri-App/datagpu.git
cd datagpu

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run benchmarks
python examples/generate_sample_data.py
python examples/benchmark.py

Project Structure

datagpu/
├── datagpu/              # Core package
│   ├── __init__.py
│   ├── cli.py            # CLI interface
│   ├── compiler.py       # Main compiler
│   ├── cleaner.py        # Data cleaning
│   ├── deduper.py        # Deduplication
│   ├── ranker.py         # Quality ranking
│   ├── cache.py          # Cache management
│   ├── loader.py         # Dataset loader
│   ├── types.py          # Type definitions
│   └── utils.py          # Utilities
├── tests/                # Test suite
├── examples/             # Examples and benchmarks
├── pyproject.toml        # Project configuration
└── README.md

Roadmap

Phase 0.2 - Semantic Deduplication

Embedding-based near-duplicate removal
FAISS integration for similarity search

Phase 0.3 - Parallel Compilation

Distributed compilation with Ray/Dask
Multi-core optimization

Phase 0.4 - Cloud Storage

S3/GCS backend support
Remote dataset compilation

Phase 0.5 - Web Dashboard

Dataset visualization
Quality metrics and stats
Version comparison

Phase 0.6 - Rust Backend

Rewrite core kernels in Rust
20× performance improvement target

Contributing

Contributions are welcome! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

DataGPU is released under the Apache 2.0 License.

Citation

If you use DataGPU in your research, please cite:

@software{datagpu2025,
  title = {DataGPU: Open-source data compiler for AI training datasets},
  author = {Celestino Kariuki},
  organization = {Safariblocks Ltd.},
  year = {2025},
  url = {https://github.com/Jasiri-App/datagpu}
}

Support

Documentation: GitHub README
Issues: GitHub Issues
Discussions: GitHub Discussions

Made with focus on data quality and reproducibility

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
datagpu		datagpu
examples		examples
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
FIXES.md		FIXES.md
GET_STARTED.md		GET_STARTED.md
INSTALL.sh		INSTALL.sh
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
USAGE.md		USAGE.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py

License

Jasiri-App/datagpu

Folders and files

Latest commit

History

Repository files navigation

DataGPU

Mission

Features

Quick Start

Installation from PyPI

Verify Installation

Install from Source (Development)

Basic Usage

Compile a Dataset

View Dataset Information

Cache Management

Python API

Architecture

CLI Commands

Compile

Info

Cache Management

Dataset Manifest

Performance

Benchmarks (MVP)

Example Performance

Integration Examples

PyTorch DataLoader

Hugging Face Datasets

Development

Setup

Project Structure

Roadmap

Phase 0.2 - Semantic Deduplication

Phase 0.3 - Parallel Compilation

Phase 0.4 - Cloud Storage

Phase 0.5 - Web Dashboard

Phase 0.6 - Rust Backend

Contributing

License

Citation

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages