Skip to content

forecast-bio/toile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

toile

CLI tools for working with astrocyte dynamics data

Toile is a Python package for converting microscopy TIFF stacks into WebDataset format for machine learning pipelines. It handles OME-TIFF metadata extraction, batch processing, and creates sharded tar archives optimized for distributed training.

—❤️‍🔥 Forecast

Features

  • OME-TIFF Support: Automatic extraction of spatial, temporal, and experimental metadata from OME-TIFF XML annotations
  • Batch Processing: Process multiple recordings using glob patterns or YAML configuration files
  • Custom Metadata Parsing: Flexible filename parsing system for extracting experimental identifiers
  • Sharded Archives: Configurable shard sizes for WebDataset format (850MB standard, 38MB for Bluesky PDS)
  • ML-Ready: Optional uint8 normalization for efficient model training
  • atdata Integration: Built on the atdata PackableSample framework for data transformation pipelines

Installation

Install using uv (recommended) or pip:

# Using uv
uv add toile

# Using pip
pip install toile

For development:

git clone https://github.com/forecast-bio/toile.git
cd toile
uv sync --all-extras --dev

Quick Start

Export a TIFF stack to WebDataset format:

# Basic usage - export frames from a single recording
toile export frames /path/to/recording/ /output/dataset

# With uint8 normalization for ML
toile export frames /path/to/recording/ /output/dataset --uint8 --verbose

# Batch processing with glob patterns
toile export frames "/data/*/recording*/" /output/dataset --stem my_dataset

# Using PDS-compatible shard size for Bluesky
toile export frames /data/recordings/ /output/dataset --pds

CLI Commands

toile export frames

Convert TIFF stacks to WebDataset format as individual frames.

toile export frames INPUT OUTPUT [OPTIONS]

Arguments:

  • INPUT: Path to TIFF directory or YAML config file
  • OUTPUT: Output directory for tar archives

Options:

  • --stem TEXT: Custom stem for output filenames (default: output directory name)
  • --shard-size INT: Maximum shard size in bytes (default: auto-selected)
  • --pds: Use PDS-compatible shard size (38MB for Bluesky)
  • --uint8: Normalize images to uint8 (0-255) range
  • --compressed: Enable compression (not yet implemented)
  • --verbose: Print detailed progress information

Examples:

# Export single recording with verbose output
toile export frames /data/mouse_123/recording_001/ /output/dataset --verbose

# Batch export with custom naming
toile export frames "/data/experiment_*/*.tif" /output/dataset --stem exp2024

# ML-ready export with normalization
toile export frames /data/recordings/ /output/dataset --uint8 --pds

toile export test-frames

Generate a synthetic test dataset for development and testing.

toile export test-frames OUTPUT [OPTIONS]

Arguments:

  • OUTPUT: Output directory for test dataset

Options:

  • --stem TEXT: Custom stem for output filenames
  • --compressed: Enable gzip compression

Example:

toile export test-frames /tmp/test_dataset --compressed

Configuration Files

For complex batch processing, use YAML configuration files:

# config.yaml
inputs:
  - "/data/experiment1/**/*.tif"
  - "/data/experiment2/**/*.tif"

output_stem: "astrocyte_dataset"
shard_size: 38000000  # 38MB for PDS compatibility
to_uint8: true

# Optional: Extract metadata from filenames
filename_spec:
  template: "mouse_{mouse_id}_slice_{slice_id}_{date}.tif"
  transforms:
    mouse_id: int
    slice_id: identity
    date: date_compact

Then run:

toile export frames config.yaml /output/dataset

Data Schema

Toile uses structured schemas built on the atdata framework:

  • Movie: Full TIFF stack with metadata
  • Frame: Individual image frame with combined metadata
  • SliceRecordingFrame: Experimental frames with mouse/slice identifiers
  • ImageSample: Minimal image data for ML pipelines

Metadata includes acquisition timestamps, physical scales, stage positions, and channel information extracted from OME-TIFF annotations.

Output Format

WebDataset tar archives contain samples with the following structure:

sample-000000-000.npy    # Image data as numpy array
sample-000000-000.json   # Metadata dictionary
sample-000000-001.npy
sample-000000-001.json
...

Each shard is automatically numbered (e.g., dataset-000000.tar, dataset-000001.tar) when the size limit is reached.

Development

Run tests:

uv run pytest

Build package:

uv build

License

This project is licensed under the Mozilla Public License 2.0 (MPL-2.0) - see the LICENSE file for details.

Acknowledgments

Built with:

  • atdata - Streaming schematized datasets framework
  • webdataset - Efficient streaming datasets for ML and more
  • scikit-image - Some good standard impl for image basics

Claude wrote the majority of the docs—if they hallucinated anything, let us know in the Issues!

About

Tools for working with astrocyte dynamics data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages