CLI tools for working with astrocyte dynamics data
Toile is a Python package for converting microscopy TIFF stacks into WebDataset format for machine learning pipelines. It handles OME-TIFF metadata extraction, batch processing, and creates sharded tar archives optimized for distributed training.
—❤️🔥 Forecast
- OME-TIFF Support: Automatic extraction of spatial, temporal, and experimental metadata from OME-TIFF XML annotations
- Batch Processing: Process multiple recordings using glob patterns or YAML configuration files
- Custom Metadata Parsing: Flexible filename parsing system for extracting experimental identifiers
- Sharded Archives: Configurable shard sizes for WebDataset format (850MB standard, 38MB for Bluesky PDS)
- ML-Ready: Optional uint8 normalization for efficient model training
- atdata Integration: Built on the atdata PackableSample framework for data transformation pipelines
Install using uv (recommended) or pip:
# Using uv
uv add toile
# Using pip
pip install toileFor development:
git clone https://github.com/forecast-bio/toile.git
cd toile
uv sync --all-extras --devExport a TIFF stack to WebDataset format:
# Basic usage - export frames from a single recording
toile export frames /path/to/recording/ /output/dataset
# With uint8 normalization for ML
toile export frames /path/to/recording/ /output/dataset --uint8 --verbose
# Batch processing with glob patterns
toile export frames "/data/*/recording*/" /output/dataset --stem my_dataset
# Using PDS-compatible shard size for Bluesky
toile export frames /data/recordings/ /output/dataset --pdsConvert TIFF stacks to WebDataset format as individual frames.
toile export frames INPUT OUTPUT [OPTIONS]Arguments:
INPUT: Path to TIFF directory or YAML config fileOUTPUT: Output directory for tar archives
Options:
--stem TEXT: Custom stem for output filenames (default: output directory name)--shard-size INT: Maximum shard size in bytes (default: auto-selected)--pds: Use PDS-compatible shard size (38MB for Bluesky)--uint8: Normalize images to uint8 (0-255) range--compressed: Enable compression (not yet implemented)--verbose: Print detailed progress information
Examples:
# Export single recording with verbose output
toile export frames /data/mouse_123/recording_001/ /output/dataset --verbose
# Batch export with custom naming
toile export frames "/data/experiment_*/*.tif" /output/dataset --stem exp2024
# ML-ready export with normalization
toile export frames /data/recordings/ /output/dataset --uint8 --pdsGenerate a synthetic test dataset for development and testing.
toile export test-frames OUTPUT [OPTIONS]Arguments:
OUTPUT: Output directory for test dataset
Options:
--stem TEXT: Custom stem for output filenames--compressed: Enable gzip compression
Example:
toile export test-frames /tmp/test_dataset --compressedFor complex batch processing, use YAML configuration files:
# config.yaml
inputs:
- "/data/experiment1/**/*.tif"
- "/data/experiment2/**/*.tif"
output_stem: "astrocyte_dataset"
shard_size: 38000000 # 38MB for PDS compatibility
to_uint8: true
# Optional: Extract metadata from filenames
filename_spec:
template: "mouse_{mouse_id}_slice_{slice_id}_{date}.tif"
transforms:
mouse_id: int
slice_id: identity
date: date_compactThen run:
toile export frames config.yaml /output/datasetToile uses structured schemas built on the atdata framework:
Movie: Full TIFF stack with metadataFrame: Individual image frame with combined metadataSliceRecordingFrame: Experimental frames with mouse/slice identifiersImageSample: Minimal image data for ML pipelines
Metadata includes acquisition timestamps, physical scales, stage positions, and channel information extracted from OME-TIFF annotations.
WebDataset tar archives contain samples with the following structure:
sample-000000-000.npy # Image data as numpy array
sample-000000-000.json # Metadata dictionary
sample-000000-001.npy
sample-000000-001.json
...
Each shard is automatically numbered (e.g., dataset-000000.tar, dataset-000001.tar) when the size limit is reached.
Run tests:
uv run pytestBuild package:
uv buildThis project is licensed under the Mozilla Public License 2.0 (MPL-2.0) - see the LICENSE file for details.
Built with:
- atdata - Streaming schematized datasets framework
- webdataset - Efficient streaming datasets for ML and more
- scikit-image - Some good standard impl for image basics
Claude wrote the majority of the docs—if they hallucinated anything, let us know in the Issues!