SynthMT: Synthetic Data Enables Human-Grade Microtubule Analysis with foundation models for segmentation

Figure: The SynthMT instance segmentation benchmark evaluates methods on synthetic IRM-like images containing microtubules. (a) Synthetic image mimicking IRM of in vitro reconstituted MTs nucleated from fixed seeds (red), reproducing key mechanical and geometrical properties such as filament length and curvature. (b) Our pipeline generates accompanying ground-truth instance masks for quantitative evaluation. (c) The classical FIESTA algorithm demonstrates typical failure modes: filament fragmentation, incomplete segmentation, and artifacts at intersections. (d) SAM3 guided by a simple text prompt ("thin line") produces precise, human-grade segmentation.

Overview

SynthMT is a synthetic benchmark dataset for evaluating instance segmentation methods on in vitro microtubule (MT) images. Studying microtubules and their mechanical properties is central to understanding intracellular transport, cell division, and drug action. While important, experts still need to spend many hours manually segmenting these filamentous structures.

This repository provides:

🔬 Synthetic data generation pipeline that produces realistic MT images with ground-truth instance masks
🎯 Parameter optimization using DINOv2 embeddings to align synthetic images with real microscopy data
📊 SynthMT benchmark dataset tuned on real IRM microscopy images (no human annotations required)
🧪 Evaluation framework for benchmarking segmentation methods in zero-shot and few-shot settings

Key Findings

Our benchmark evaluates nine fully automated methods for MT analysis. Key results:

Classical algorithms and most of the current foundation models struggle on in vitro MT IRM images that humans perceive as visually simple
SAM3 (text-prompted as "SAM3Text") achieves human-grade performance after hyperparameter optimization on only 10 random SynthMT images

🔗 Resources

Resource	Link
📄 Paper	bioRxiv
🌐 Project Page	DATEXIS.github.io/SynthMT-project-page – Interactive demos for all evaluated models
🤗 Dataset	huggingface.co/datasets/HTW-KI-Werkstatt/SynthMT
💻 Code	This repository

Installation

We recommend using uv for fast, reliable Python package management. uv is significantly faster than pip and provides better dependency resolution. It works seamlessly within conda environments. As python version, we recommend using Python 3.11.

Using Conda (Recommended)

This is the recommended approach as it provides conda's environment management (required for µSAM). As python version, we recommend using Python 3.11.

# Clone the repository
git clone https://github.com/ml-lab-htw/SynthMT.git
cd SynthMT

# Create conda environment from environment.yml
conda env create -f environment.yml
conda activate synth_mt

Detailed optional model installation & notes

Below are per-model installation commands and platform-specific notes. These assume you are in the project's Python environment (conda env or virtualenv) created following the Installation section above.

microSAM (µSAM) — Quick install & notes

Quick install (conda):

conda install -c conda-forge micro_sam

Notes:
- microSAM is distributed on conda-forge and expects to run inside a conda environment.
- Use the environment created from environment.yml or create a fresh conda env with Python 3.11.

CellSAM (fork) — Install our compatibility fork & notes

Install our compatibility fork (recommended until upstream PR is merged):

pip install git+https://github.com/mario-koddenbrock/cellSAM.git

(as also done in the requirements.txt file).

After installation, create a .env with the access token if required:

DEEPCELL_ACCESS_TOKEN=your_token_here

Quick instructions to create the .env file (macOS / zsh):

# Create .env in the project root (overwrites if it exists)
echo 'DEEPCELL_ACCESS_TOKEN=your_token_here' > .env

# Restrict permissions so the token file isn't world-readable
chmod 600 .env

# Prevent accidentally committing it to git (if not already ignored)
# This appends '.env' to .gitignore if it's not present
grep -qxF '.env' .gitignore || echo '.env' >> .gitignore

Alternatives and notes:
- You can set the token for a single shell session instead of a file:
```
export DEEPCELL_ACCESS_TOKEN=your_token_here
```
- For CI or remote deployments, store the token in CI secrets or environment configuration rather than a file in the repo.
- The code loads .env from the current working directory (see synth_mt/benchmark/models/cellsam.py), so place the .env in the project root or the working directory you run the script from.
- You can also pass the token programmatically to the model if supported by the API (e.g., CellSAM(access_token='your_token')).
Notes:
- The fork includes small compatibility fixes for integration in this pipeline. Once upstream fixes land, you can switch back to the official package.
- Installing inside a virtualenv or conda environment is recommended to avoid clashes.

TARDIS (tardis-em == 0.3.10) — Pinned install & notes

Install pinned version used in our experiments:

pip install tardis-em==0.3.10

Notes:
- We recommend installing TARDIS inside a conda environment to avoid system-level dependency conflicts.
- If the package needs compiled extensions on your platform, ensure build tools are available (gcc/clang, python-dev headers).

StarDist (stardist == 0.9.1) — TensorFlow dependency & install

StarDist depends on TensorFlow. Install a system-appropriate TensorFlow wheel first (CPU/GPU, macOS vs Linux/Windows).

Examples:

# macOS (Apple Silicon M1/M2) - recommended CPU build
pip install tensorflow-macos

# Linux / Windows or Intel macOS
pip install tensorflow

# Then install StarDist (pinned)
pip install stardist==0.9.1

Notes:
- For GPU acceleration, install the TensorFlow wheel that matches your CUDA toolkit before installing StarDist.
- Some features of StarDist rely on CSBDeep: pip install csbdeep if you need it.

Cellpose (>= 3.0.0) — PyTorch dependency & install

Cellpose requires PyTorch. Install a PyTorch wheel that matches your CUDA version or CPU-only wheel first, then install Cellpose.

Examples (CPU-only PyTorch):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install "cellpose>=3.0.0"

Or with conda (conda-forge):

conda install -c conda-forge cellpose

Notes:
- For GPU support, follow the instructions at https://pytorch.org/get-started/locally/ to choose the correct CUDA build, then install Cellpose.
- On macOS Apple Silicon prefer CPU builds or follow PyTorch macOS guidance.

SAM3 (Segment Anything Model v3) — transformers pre-release

SAM3 requires the pre-release transformers packages in some setups and may need access tokens or extra model files depending on the provider.

# If you created the conda environment from `environment.yml`, transformers==5.0.0rc3
# will already be installed via pip. If you are managing packages manually or using
# a different environment, install the pre-release pinned by this project:
pip install -U transformers==5.0.0rc3

Notes:
- Using SAM3 in this repository often requires additional configuration (model checkpoints, provider access). Follow SAM3 provider instructions and ensure the pinned transformers pre-release is present in your environment.

FIESTA (MATLAB) — Fork & MATLAB Engine

We provide a small fork of the original FIESTA project with modifications that enable running FIESTA in script mode from Python. Use our fork:

# recommended: clone into the SynthMT root so the folder appears as ./fiesta
git clone https://github.com/ml-lab-htw/FIESTA.git ./fiesta

Install the MATLAB Engine for Python using the same Python interpreter/environment you will run SynthMT from. For MATLAB R2025b on macOS:

cd /Applications/MATLAB_R2025b.app/extern/engines/python
python3 -m pip install .

Notes:
- The MATLAB Engine must be installed into the exact Python interpreter/environment you will run SynthMT from.
- We developed and tested our integration with MATLAB R2025b; adapt paths and instructions if you have a different MATLAB version.

Quick Start

Load the SynthMT Dataset

from datasets import load_dataset

# Load the dataset from HuggingFace
ds = load_dataset("HTW-KI-Werkstatt/SynthMT", split="train")

# Access a sample
sample = ds[0]
image = sample["image"]  # PIL Image
masks = sample["mask"]   # List of PIL Images (instance masks)

Generate Synthetic Data

from synth_mt.config.synthetic_data import SyntheticDataConfig
from synth_mt.data_generation.video import generate_video

# Load configuration
cfg = SyntheticDataConfig.from_json("examples/synthetic_data_example.json")

# Generate video with masks
generate_video(cfg, base_output_dir="output/")

Example Notebooks

We provide detailed Jupyter notebooks demonstrating different aspects of the pipeline:

Notebook	Description
`example_load_SynthMT.ipynb`	Load and visualize the SynthMT dataset from HuggingFace. Shows how to decompose samples into images and masks, convert to NumPy arrays, and create overlay visualizations.
`example_evaluate_model.ipynb`	Evaluate segmentation models on SynthMT. Load models via ModelFactory, run predictions, and compute segmentation metrics (SkIoU, F1, AP) and downstream metrics (count, length, curvature distributions).
`example_single_frame_generation.ipynb`	Detailed walkthrough of the image generation pipeline. Explains the two-step stochastic process: (1) geometry generation with polylines and stochastic curvature, and (2) image rendering with PSF convolution, noise, and artifacts.
`example_generate_synthetic_data.ipynb`	Generate synthetic video data from a JSON configuration. Includes microtubule dynamics (growing, shrinking, pausing, rescue) and produces images, masks, videos, and preview animations.
`example_optimize_synthetic_data.ipynb`	Tune generation parameters θ to match real microscopy images. Uses DINOv2 embeddings and Optuna for optimization without requiring ground-truth annotations.
`example_full_pipeline.ipynb`	Complete end-to-end pipeline for applying SynthMT to your own data. Tune synthetic data, optimize SAM3Text hyperparameters, and compare zero-shot vs HPO performance—all without manual annotations.

Dataset

The SynthMT dataset is hosted on HuggingFace and contains synthetic IRM-like microtubule images with instance segmentation masks.

Dataset Structure

Each sample contains:

image: RGB microscopy image (PIL Image)
mask: List of binary instance masks (one per microtubule)

Loading the Dataset

from datasets import load_dataset
import numpy as np

# Load dataset
ds = load_dataset("HTW-KI-Werkstatt/SynthMT", split="train")

# Convert to NumPy
sample = ds[0]
img_array = np.array(sample["image"].convert("RGB"))  # (H, W, 3)
mask_stack = np.stack([np.array(m.convert("L")) for m in sample["mask"]], axis=0)  # (N, H, W)

Synthetic Data Generation

Parameter Optimization

Figure: Optimizing θ aligns synthetic image distributions with real, annotation-free microscopy data. Real IRM images (left) and synthetic images (center) are embedded using DINOv2. The parametric generator $P_\theta$ (right) creates images by sampling from distributions governing geometric properties (filament count, length, curvature) and imaging characteristics (PSF, noise, artifacts, contrast, distortions), all controlled by θ. An optimization loop iteratively refines θ by maximizing cosine similarity between real and synthetic embeddings, ensuring that synthetic images match the statistical properties and visual characteristics of experimental data.

Mathematical Framework

The generation pipeline follows a two-step stochastic process that produces synthetic images $I \sim P_\theta(I)$ conditioned on parameter set $\theta$:

Step 1: Microtubule Geometry

Each MT is modeled as a polyline with $n$ segments. Segment lengths are sampled from a Gaussian distribution, and curvature is introduced through stochastic evolution of bend angles using a Gamma distribution. This yields smoothly curved filaments that replicate real MT morphology.

Step 2: Image Rendering

Physical Rendering: Binary masks are convolved with the Point Spread Function (PSF), scaled by contrast and background intensity
Artifact Simulation: Distractor spots (circular, irregular structures) are added
Noise Addition: Signal-dependent (Poisson) and signal-independent (Gaussian) noise
Global Distortions: Vignetting, blur, and contrast variations

Command-Line Usage

# Generate a single video
python scripts/generate_synthetic_data.py \
    -c ./config/synthetic_config.json \
    -o ./data/generated \
    --count 1

# Generate 10 videos sequentially
python scripts/generate_synthetic_data.py \
    -c ./config/synthetic_config.json \
    -o ./data/generated \
    --count 10

CLI Arguments

Argument	Shorthand	Required	Description
`--config <path>`	`-c`	Yes	Path to the JSON configuration file
`--output-dir <path>`	`-o`	Yes	Output directory for generated data
`--ids <id1> <id2>`		No	Specific video IDs to generate
`--count <number>`		No	Number of videos to generate
`--start-id <number>`		No	Starting ID for sequential generation
`--save-config`		No	Save configuration copy for reproducibility

Configuration

The generation is controlled by a JSON configuration file. Parameters are grouped by their effect:

Category	Description	Key Parameters
Core Properties	Video dimensions and duration	`img_size`, `fps`, `num_frames`
MT Dynamics	Growth, shrinkage, catastrophe	`growth_speed`, `shrink_speed`, `catastrophe_prob`, `rescue_prob`
Filament Structure	Segment length and bending	`max_sum_segments`, `segment_length_*`, `max_angle`
Population	Number and placement of MTs	`num_microtubule`, `microtubule_min_dist`
Optics & PSF	Blur and sharpness	`psf_sigma_h`, `psf_sigma_v`, `global_blur_sigma`
Noise Model	Poisson and Gaussian noise	`quantum_efficiency`, `gaussian_noise`
Artifacts	Background particles	`fixed_spots`, `moving_spots`, `random_spots`

See examples/synthetic_data_example.json for a complete configuration example.

Testing

Some tests require access to the Hugging Face Hub. Set the environment variable:

export HUGGING_FACE_HUB_TOKEN=your_token_here

Or create a .env file:

HUGGING_FACE_HUB_TOKEN=your_token_here

Run tests:

pytest

Citation

If you use SynthMT in your research, please cite our paper:

@article{koddenbrock2026synthetic,
    author = {Koddenbrock, Mario and Westerhoff, Justus and Fachet, Dominik and Reber, Simone and Gers, Felix A. and Rodner, Erik},
    title = {Synthetic data enables human-grade microtubule analysis with foundation models for segmentation},
    elocation-id = {2026.01.09.698597},
    year = {2026},
    doi = {10.64898/2026.01.09.698597},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/01/12/2026.01.09.698597},
    eprint = {https://www.biorxiv.org/content/early/2026/01/12/2026.01.09.698597.full.pdf},
    journal = {bioRxiv}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue.

🙏 Acknowledgements

Our work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project-ID 528483508 - FIP 12. We would like to thank Dominik Fachet and Gil Henkin from the Reber lab for providing data, and also thank the further study participants Moritz Becker, Nathaniel Boateng, and Miguel Aguilar. The Reber lab thanks staff at the Advanced Medical Bioimaging Core Facility (Charité, Berlin) for imaging support and the Max Planck Society for funding. Furthermore, we thank Kristian Hildebrand and Chaitanya A. Athale (IISER Pune, India) and his lab for helpful discussions

Project Page • Dataset • Paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthMT: Synthetic Data Enables Human-Grade Microtubule Analysis with foundation models for segmentation

Overview

Key Findings

🔗 Resources

Table of Contents

Installation

Using Conda (Recommended)

Detailed optional model installation & notes

Quick Start

Load the SynthMT Dataset

Generate Synthetic Data

Example Notebooks

Dataset

Dataset Structure

Loading the Dataset

Synthetic Data Generation

Parameter Optimization

Mathematical Framework

Step 1: Microtubule Geometry

Step 2: Image Rendering

Command-Line Usage

CLI Arguments

Configuration

Testing

Citation

License

Contributing

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github/workflows		.github/workflows
examples		examples
synth_mt		synth_mt
tests		tests
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tox.ini		tox.ini

Folders and files

Latest commit

History

Repository files navigation

SynthMT: Synthetic Data Enables Human-Grade Microtubule Analysis with foundation models for segmentation

Overview

Key Findings

🔗 Resources

Table of Contents

Installation

Using Conda (Recommended)

Detailed optional model installation & notes

Quick Start

Load the SynthMT Dataset

Generate Synthetic Data

Example Notebooks

Dataset

Dataset Structure

Loading the Dataset

Synthetic Data Generation

Parameter Optimization

Mathematical Framework

Step 1: Microtubule Geometry

Step 2: Image Rendering

Command-Line Usage

CLI Arguments

Configuration

Testing

Citation

License

Contributing

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages