Skip to content

ml-lab-htw/SynthMT

Repository files navigation

SynthMT: Synthetic Data Enables Human-Grade Microtubule Analysis with foundation models for segmentation

Dataset on HuggingFace Project Page Paper License: MIT

SynthMT Overview

Figure: The SynthMT instance segmentation benchmark evaluates methods on synthetic IRM-like images containing microtubules. (a) Synthetic image mimicking IRM of in vitro reconstituted MTs nucleated from fixed seeds (red), reproducing key mechanical and geometrical properties such as filament length and curvature. (b) Our pipeline generates accompanying ground-truth instance masks for quantitative evaluation. (c) The classical FIESTA algorithm demonstrates typical failure modes: filament fragmentation, incomplete segmentation, and artifacts at intersections. (d) SAM3 guided by a simple text prompt ("thin line") produces precise, human-grade segmentation.

Overview

SynthMT is a synthetic benchmark dataset for evaluating instance segmentation methods on in vitro microtubule (MT) images. Studying microtubules and their mechanical properties is central to understanding intracellular transport, cell division, and drug action. While important, experts still need to spend many hours manually segmenting these filamentous structures.

This repository provides:

  • 🔬 Synthetic data generation pipeline that produces realistic MT images with ground-truth instance masks
  • 🎯 Parameter optimization using DINOv2 embeddings to align synthetic images with real microscopy data
  • 📊 SynthMT benchmark dataset tuned on real IRM microscopy images (no human annotations required)
  • 🧪 Evaluation framework for benchmarking segmentation methods in zero-shot and few-shot settings

Key Findings

Our benchmark evaluates nine fully automated methods for MT analysis. Key results:

  • Classical algorithms and most of the current foundation models struggle on in vitro MT IRM images that humans perceive as visually simple
  • SAM3 (text-prompted as "SAM3Text") achieves human-grade performance after hyperparameter optimization on only 10 random SynthMT images

🔗 Resources

Resource Link
📄 Paper bioRxiv
🌐 Project Page DATEXIS.github.io/SynthMT-project-pageInteractive demos for all evaluated models
🤗 Dataset huggingface.co/datasets/HTW-KI-Werkstatt/SynthMT
💻 Code This repository

Table of Contents

Installation

We recommend using uv for fast, reliable Python package management. uv is significantly faster than pip and provides better dependency resolution. It works seamlessly within conda environments. As python version, we recommend using Python 3.11.

Using Conda (Recommended)

This is the recommended approach as it provides conda's environment management (required for µSAM). As python version, we recommend using Python 3.11.

# Clone the repository
git clone https://github.com/ml-lab-htw/SynthMT.git
cd SynthMT

# Create conda environment from environment.yml
conda env create -f environment.yml
conda activate synth_mt

Detailed optional model installation & notes

Below are per-model installation commands and platform-specific notes. These assume you are in the project's Python environment (conda env or virtualenv) created following the Installation section above.

microSAM (µSAM) — Quick install & notes
  • Quick install (conda):
conda install -c conda-forge micro_sam
  • Notes:
    • microSAM is distributed on conda-forge and expects to run inside a conda environment.
    • Use the environment created from environment.yml or create a fresh conda env with Python 3.11.
CellSAM (fork) — Install our compatibility fork & notes
  • Install our compatibility fork (recommended until upstream PR is merged):
pip install git+https://github.com/mario-koddenbrock/cellSAM.git

(as also done in the requirements.txt file).

  • After installation, create a .env with the access token if required:
DEEPCELL_ACCESS_TOKEN=your_token_here
  • Quick instructions to create the .env file (macOS / zsh):
# Create .env in the project root (overwrites if it exists)
echo 'DEEPCELL_ACCESS_TOKEN=your_token_here' > .env

# Restrict permissions so the token file isn't world-readable
chmod 600 .env

# Prevent accidentally committing it to git (if not already ignored)
# This appends '.env' to .gitignore if it's not present
grep -qxF '.env' .gitignore || echo '.env' >> .gitignore
  • Alternatives and notes:

    • You can set the token for a single shell session instead of a file:

      export DEEPCELL_ACCESS_TOKEN=your_token_here
    • For CI or remote deployments, store the token in CI secrets or environment configuration rather than a file in the repo.

    • The code loads .env from the current working directory (see synth_mt/benchmark/models/cellsam.py), so place the .env in the project root or the working directory you run the script from.

    • You can also pass the token programmatically to the model if supported by the API (e.g., CellSAM(access_token='your_token')).

  • Notes:

    • The fork includes small compatibility fixes for integration in this pipeline. Once upstream fixes land, you can switch back to the official package.
    • Installing inside a virtualenv or conda environment is recommended to avoid clashes.
TARDIS (tardis-em == 0.3.10) — Pinned install & notes
  • Install pinned version used in our experiments:
pip install tardis-em==0.3.10
  • Notes:
    • We recommend installing TARDIS inside a conda environment to avoid system-level dependency conflicts.
    • If the package needs compiled extensions on your platform, ensure build tools are available (gcc/clang, python-dev headers).
StarDist (stardist == 0.9.1) — TensorFlow dependency & install
  • StarDist depends on TensorFlow. Install a system-appropriate TensorFlow wheel first (CPU/GPU, macOS vs Linux/Windows).

Examples:

# macOS (Apple Silicon M1/M2) - recommended CPU build
pip install tensorflow-macos

# Linux / Windows or Intel macOS
pip install tensorflow

# Then install StarDist (pinned)
pip install stardist==0.9.1
  • Notes:
    • For GPU acceleration, install the TensorFlow wheel that matches your CUDA toolkit before installing StarDist.
    • Some features of StarDist rely on CSBDeep: pip install csbdeep if you need it.
Cellpose (>= 3.0.0) — PyTorch dependency & install
  • Cellpose requires PyTorch. Install a PyTorch wheel that matches your CUDA version or CPU-only wheel first, then install Cellpose.

Examples (CPU-only PyTorch):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install "cellpose>=3.0.0"

Or with conda (conda-forge):

conda install -c conda-forge cellpose
  • Notes:
    • For GPU support, follow the instructions at https://pytorch.org/get-started/locally/ to choose the correct CUDA build, then install Cellpose.
    • On macOS Apple Silicon prefer CPU builds or follow PyTorch macOS guidance.
SAM3 (Segment Anything Model v3) — transformers pre-release
  • SAM3 requires the pre-release transformers packages in some setups and may need access tokens or extra model files depending on the provider.
# If you created the conda environment from `environment.yml`, transformers==5.0.0rc3
# will already be installed via pip. If you are managing packages manually or using
# a different environment, install the pre-release pinned by this project:
pip install -U transformers==5.0.0rc3
  • Notes:
    • Using SAM3 in this repository often requires additional configuration (model checkpoints, provider access). Follow SAM3 provider instructions and ensure the pinned transformers pre-release is present in your environment.
FIESTA (MATLAB) — Fork & MATLAB Engine
  • We provide a small fork of the original FIESTA project with modifications that enable running FIESTA in script mode from Python. Use our fork:
# recommended: clone into the SynthMT root so the folder appears as ./fiesta
git clone https://github.com/ml-lab-htw/FIESTA.git ./fiesta
  • Install the MATLAB Engine for Python using the same Python interpreter/environment you will run SynthMT from. For MATLAB R2025b on macOS:
cd /Applications/MATLAB_R2025b.app/extern/engines/python
python3 -m pip install .
  • Notes:
    • The MATLAB Engine must be installed into the exact Python interpreter/environment you will run SynthMT from.
    • We developed and tested our integration with MATLAB R2025b; adapt paths and instructions if you have a different MATLAB version.

Quick Start

Load the SynthMT Dataset

from datasets import load_dataset

# Load the dataset from HuggingFace
ds = load_dataset("HTW-KI-Werkstatt/SynthMT", split="train")

# Access a sample
sample = ds[0]
image = sample["image"]  # PIL Image
masks = sample["mask"]   # List of PIL Images (instance masks)

Generate Synthetic Data

from synth_mt.config.synthetic_data import SyntheticDataConfig
from synth_mt.data_generation.video import generate_video

# Load configuration
cfg = SyntheticDataConfig.from_json("examples/synthetic_data_example.json")

# Generate video with masks
generate_video(cfg, base_output_dir="output/")

Example Notebooks

We provide detailed Jupyter notebooks demonstrating different aspects of the pipeline:

Notebook Description
example_load_SynthMT.ipynb Load and visualize the SynthMT dataset from HuggingFace. Shows how to decompose samples into images and masks, convert to NumPy arrays, and create overlay visualizations.
example_evaluate_model.ipynb Evaluate segmentation models on SynthMT. Load models via ModelFactory, run predictions, and compute segmentation metrics (SkIoU, F1, AP) and downstream metrics (count, length, curvature distributions).
example_single_frame_generation.ipynb Detailed walkthrough of the image generation pipeline. Explains the two-step stochastic process: (1) geometry generation with polylines and stochastic curvature, and (2) image rendering with PSF convolution, noise, and artifacts.
example_generate_synthetic_data.ipynb Generate synthetic video data from a JSON configuration. Includes microtubule dynamics (growing, shrinking, pausing, rescue) and produces images, masks, videos, and preview animations.
example_optimize_synthetic_data.ipynb Tune generation parameters θ to match real microscopy images. Uses DINOv2 embeddings and Optuna for optimization without requiring ground-truth annotations.
example_full_pipeline.ipynb Complete end-to-end pipeline for applying SynthMT to your own data. Tune synthetic data, optimize SAM3Text hyperparameters, and compare zero-shot vs HPO performance—all without manual annotations.

Dataset

The SynthMT dataset is hosted on HuggingFace and contains synthetic IRM-like microtubule images with instance segmentation masks.

Dataset Structure

Each sample contains:

  • image: RGB microscopy image (PIL Image)
  • mask: List of binary instance masks (one per microtubule)

Loading the Dataset

from datasets import load_dataset
import numpy as np

# Load dataset
ds = load_dataset("HTW-KI-Werkstatt/SynthMT", split="train")

# Convert to NumPy
sample = ds[0]
img_array = np.array(sample["image"].convert("RGB"))  # (H, W, 3)
mask_stack = np.stack([np.array(m.convert("L")) for m in sample["mask"]], axis=0)  # (N, H, W)

Synthetic Data Generation

Parameter Optimization

Parameter Optimization Pipeline

Figure: Optimizing θ aligns synthetic image distributions with real, annotation-free microscopy data. Real IRM images (left) and synthetic images (center) are embedded using DINOv2. The parametric generator $P_\theta$ (right) creates images by sampling from distributions governing geometric properties (filament count, length, curvature) and imaging characteristics (PSF, noise, artifacts, contrast, distortions), all controlled by θ. An optimization loop iteratively refines θ by maximizing cosine similarity between real and synthetic embeddings, ensuring that synthetic images match the statistical properties and visual characteristics of experimental data.

Mathematical Framework

The generation pipeline follows a two-step stochastic process that produces synthetic images $I \sim P_\theta(I)$ conditioned on parameter set $\theta$:

Step 1: Microtubule Geometry

Each MT is modeled as a polyline with $n$ segments. Segment lengths are sampled from a Gaussian distribution, and curvature is introduced through stochastic evolution of bend angles using a Gamma distribution. This yields smoothly curved filaments that replicate real MT morphology.

Step 2: Image Rendering

  1. Physical Rendering: Binary masks are convolved with the Point Spread Function (PSF), scaled by contrast and background intensity
  2. Artifact Simulation: Distractor spots (circular, irregular structures) are added
  3. Noise Addition: Signal-dependent (Poisson) and signal-independent (Gaussian) noise
  4. Global Distortions: Vignetting, blur, and contrast variations

Command-Line Usage

# Generate a single video
python scripts/generate_synthetic_data.py \
    -c ./config/synthetic_config.json \
    -o ./data/generated \
    --count 1

# Generate 10 videos sequentially
python scripts/generate_synthetic_data.py \
    -c ./config/synthetic_config.json \
    -o ./data/generated \
    --count 10

CLI Arguments

Argument Shorthand Required Description
--config <path> -c Yes Path to the JSON configuration file
--output-dir <path> -o Yes Output directory for generated data
--ids <id1> <id2> No Specific video IDs to generate
--count <number> No Number of videos to generate
--start-id <number> No Starting ID for sequential generation
--save-config No Save configuration copy for reproducibility

Configuration

The generation is controlled by a JSON configuration file. Parameters are grouped by their effect:

Category Description Key Parameters
Core Properties Video dimensions and duration img_size, fps, num_frames
MT Dynamics Growth, shrinkage, catastrophe growth_speed, shrink_speed, catastrophe_prob, rescue_prob
Filament Structure Segment length and bending max_sum_segments, segment_length_*, max_angle
Population Number and placement of MTs num_microtubule, microtubule_min_dist
Optics & PSF Blur and sharpness psf_sigma_h, psf_sigma_v, global_blur_sigma
Noise Model Poisson and Gaussian noise quantum_efficiency, gaussian_noise
Artifacts Background particles fixed_spots, moving_spots, random_spots

See examples/synthetic_data_example.json for a complete configuration example.

Testing

Some tests require access to the Hugging Face Hub. Set the environment variable:

export HUGGING_FACE_HUB_TOKEN=your_token_here

Or create a .env file:

HUGGING_FACE_HUB_TOKEN=your_token_here

Run tests:

pytest

Citation

If you use SynthMT in your research, please cite our paper:

@article{koddenbrock2026synthetic,
    author = {Koddenbrock, Mario and Westerhoff, Justus and Fachet, Dominik and Reber, Simone and Gers, Felix A. and Rodner, Erik},
    title = {Synthetic data enables human-grade microtubule analysis with foundation models for segmentation},
    elocation-id = {2026.01.09.698597},
    year = {2026},
    doi = {10.64898/2026.01.09.698597},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/01/12/2026.01.09.698597},
    eprint = {https://www.biorxiv.org/content/early/2026/01/12/2026.01.09.698597.full.pdf},
    journal = {bioRxiv}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.


🙏 Acknowledgements

Our work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project-ID 528483508 - FIP 12. We would like to thank Dominik Fachet and Gil Henkin from the Reber lab for providing data, and also thank the further study participants Moritz Becker, Nathaniel Boateng, and Miguel Aguilar. The Reber lab thanks staff at the Advanced Medical Bioimaging Core Facility (Charité, Berlin) for imaging support and the Max Planck Society for funding. Furthermore, we thank Kristian Hildebrand and Chaitanya A. Athale (IISER Pune, India) and his lab for helpful discussions


Project PageDataset • Paper

About

SynthMT: Synthetic Data Enables Human-Grade Microtubule Analysis with Foundation Models for Segmentation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages