Catch PyTorch training slowdowns early, while the job is still running.
Quickstart • Compare Runs • How to Read Output • FAQ • Use with W&B / MLflow • Issues
TraceML is an open-source tool for catching PyTorch training slowdowns early, so bad runs do not quietly waste costly compute.
It gives you lightweight step-level signals while the job is still running, so you can quickly tell whether the slowdown looks input-bound, compute-bound, wait-heavy, imbalanced across ranks, or memory-related.
Use TraceML when you want a fast answer before reaching for a heavyweight profiler.
⭐ If TraceML helps you, please consider starring the repo.
Upcoming rename: TraceML will transition to TraceOpt in a future release. For now, the active package remains
traceml-aiand Python imports remaintraceml. The future PyPI package nametraceopt-aiis now in place as we prepare the migration.
Install:
pip install traceml-aiWrap your training step:
import traceml
for batch in dataloader:
with traceml.trace_step(model):
optimizer.zero_grad(set_to_none=True)
outputs = model(batch["x"])
loss = criterion(outputs, batch["y"])
loss.backward()
optimizer.step()Run:
traceml run train.pyDuring training, TraceML opens a live terminal view alongside your logs.
At the end of the run, it prints a compact summary you can review or share.
Start with traceml run train.py. Most users do not need watch or deep first.
Use the default workflow when you want live step-aware diagnosis during training plus the end-of-run summary.
traceml run train.pyUse summary mode when you mainly want the structured final summary for logging into W&B or MLflow.
traceml run train.py --mode=summaryThen call traceml.final_summary() near the end of your script.
TraceML also writes canonical summary artifacts for the run, including final_summary.json, which is the intended machine-readable output for downstream logging and later run comparison.
If you have final_summary.json from two runs, compare them directly:
traceml compare run_a.json run_b.jsonTraceML writes both a structured compare JSON and a compact text report.
See docs/compare.md.
TraceML is currently strongest at surfacing:
- step-time slowdowns while training is still running
- whether the pattern looks input-bound, compute-bound, or wait-heavy
- whether work is uneven across distributed ranks
- whether memory is drifting upward over time
- where time is showing up across dataloader, forward, backward, and optimizer phases
It is designed to help you decide quickly whether a run looks healthy or whether it is worth digging deeper.
Use TraceML when training feels:
- slower than expected
- unstable from step to step
- imbalanced across distributed ranks
- fine in dashboards but still underperforming
Start with TraceML when you need a fast answer in the terminal.
Reach for torch.profiler once you know where to dig deeper.
TraceML is designed to work alongside tools like W&B, MLflow, and TensorBoard.
Use those for:
- experiment tracking
- artifacts
- dashboards
- team reporting
Use TraceML for:
- bottleneck diagnosis while a run is still in progress
- spotting throughput drift during a run
- checking for rank imbalance or straggler patterns
- checking for memory creep or pressure signals
- structured final summaries you can forward into W&B or MLflow
- simple run-to-run comparison from saved TraceML summary JSON files
See Use TraceML with W&B / MLflow.
Works today:
- single GPU
- single-node DDP/FSDP
Not yet:
- multi-node
- tensor parallel
- pipeline parallel
- Quickstart
- Compare Runs
- Examples
- How to Read TraceML Output
- FAQ
- Use TraceML with W&B / MLflow
- Hugging Face integration:
docs/huggingface.md - PyTorch Lightning integration:
docs/lightning.md
Need a lighter zero-code first look or a deeper follow-up run? See the Quickstart and FAQ for watch and deep.
If TraceML helped you catch a slowdown, please open an issue and include:
- hardware / CUDA / PyTorch versions
- single GPU or multi-GPU
- whether you used
run,watch, ordeep - the end-of-run summary
- a minimal repro if possible
GitHub issues: https://github.com/traceopt-ai/traceml/issues
Email: support@traceopt.ai
Contributions are welcome, especially:
- reproducible slowdown cases
- bug reports
- docs improvements
- integrations
- examples
Apache 2.0. See LICENSE.

