This repository has moved to forecast-bio/atdata.
This
foundation-ac/atdatarepo is archived and will no longer receive updates. All new development, issues, and releases happen at the new location.# Update your remote git remote set-url origin https://github.com/forecast-bio/atdata.git
A loose federation of distributed, typed datasets built on WebDataset.
atdata provides a type-safe, composable framework for working with large-scale datasets. It combines the efficiency of WebDataset's tar-based storage with Python's type system and functional programming patterns.
- Typed Samples - Define dataset schemas using Python dataclasses with automatic msgpack serialization
- Schema-free Exploration - Load datasets without defining a schema first using
DictSample - Lens Transformations - Bidirectional, composable transformations between different dataset views
- Automatic Batching - Smart batch aggregation with numpy array stacking
- WebDataset Integration - Efficient storage and streaming for large-scale datasets
- Flexible Data Sources - Stream from local files, HTTP URLs, or S3-compatible storage
- HuggingFace-style API -
load_dataset()with path resolution and split handling - Local & Atmosphere Storage - Index datasets locally with Redis or publish to ATProto network
pip install atdataRequires Python 3.12 or later.
The primary way to load datasets is with load_dataset():
from atdata import load_dataset
# Load without specifying a type - returns Dataset[DictSample]
ds = load_dataset("path/to/data.tar", split="train")
# Explore the data
for sample in ds.ordered():
print(sample.keys()) # See available fields
print(sample["text"]) # Dict-style access
print(sample.label) # Attribute access
breakOnce you understand your data, define a typed schema with @packable:
import atdata
from numpy.typing import NDArray
@atdata.packable
class ImageSample:
image: NDArray
label: str
metadata: dict# Load with explicit type
ds = load_dataset("path/to/data-{000000..000009}.tar", ImageSample, split="train")
# Or convert from DictSample
ds = load_dataset("path/to/data.tar", split="train").as_type(ImageSample)
# Iterate over samples
for sample in ds.ordered():
print(f"Label: {sample.label}, Image shape: {sample.image.shape}")
# Iterate with shuffling and batching
for batch in ds.shuffled(batch_size=32):
# batch.image is automatically stacked into shape (32, ...)
# batch.label is a list of 32 labels
process_batch(batch.image, batch.label)Define reusable transformations between sample types:
@atdata.packable
class ProcessedSample:
features: NDArray
label: str
@atdata.lens
def preprocess(sample: ImageSample) -> ProcessedSample:
features = extract_features(sample.image)
return ProcessedSample(features=features, label=sample.label)
# Apply lens to view dataset as ProcessedSample
processed_ds = dataset.as_type(ProcessedSample)
for sample in processed_ds.ordered(batch_size=None):
# sample is now a ProcessedSample
print(sample.features.shape)The default sample type for schema-free exploration. Provides both attribute and dict-style access:
ds = load_dataset("data.tar", split="train")
for sample in ds.ordered():
# Dict-style access
print(sample["field_name"])
# Attribute access
print(sample.field_name)
# Introspection
print(sample.keys())
print(sample.to_dict())Base class for typed, serializable samples. Fields annotated as NDArray are automatically handled:
@atdata.packable
class MySample:
array_field: NDArray # Automatically serialized
optional_array: NDArray | None
regular_field: strEvery @packable class automatically registers a lens from DictSample, enabling seamless conversion via .as_type().
Bidirectional transformations with getter/putter semantics:
@atdata.lens
def my_lens(source: SourceType) -> ViewType:
# Transform source -> view
return ViewType(...)
@my_lens.putter
def my_lens_put(view: ViewType, source: SourceType) -> SourceType:
# Transform view -> source
return SourceType(...)Datasets support multiple backends via the DataSource protocol:
# String URLs (most common) - automatically wrapped in URLSource
dataset = atdata.Dataset[ImageSample]("data-{000000..000009}.tar")
# S3 with authentication (private buckets, Cloudflare R2, MinIO)
source = atdata.S3Source(
bucket="my-bucket",
keys=["data-000000.tar", "data-000001.tar"],
endpoint="https://my-account.r2.cloudflarestorage.com",
access_key="...",
secret_key="...",
)
dataset = atdata.Dataset[ImageSample](source)Uses WebDataset brace expansion for sharded datasets:
- Single file:
"data/dataset-000000.tar" - Multiple shards:
"data/dataset-{000000..000099}.tar" - Multiple patterns:
"data/{train,val}/dataset-{000000..000009}.tar"
Load datasets with a familiar interface:
from atdata import load_dataset
# Load without type for exploration (returns Dataset[DictSample])
ds = load_dataset("./data/train-*.tar", split="train")
# Load with explicit type
ds = load_dataset("./data/train-*.tar", ImageSample, split="train")
# Load from S3 with brace notation
ds = load_dataset("s3://bucket/data-{000000..000099}.tar", ImageSample, split="train")
# Load all splits (returns DatasetDict)
ds_dict = load_dataset("./data", ImageSample)
train_ds = ds_dict["train"]
test_ds = ds_dict["test"]
# Convert DictSample to typed schema
ds = load_dataset("./data/train.tar", split="train").as_type(ImageSample)# Install uv if not already available
python -m pip install uv
# Install dependencies
uv sync# Run all tests with coverage
uv run pytest
# Run specific test file
uv run pytest tests/test_dataset.py
# Run single test
uv run pytest tests/test_lens.py::test_lensuv buildContributions are welcome! This project is in beta, so the API may still evolve.
This project is licensed under the Mozilla Public License 2.0. See LICENSE for details.