Distributed and single process profiling / tracing #421

le1nux · 2025-11-12T19:15:51Z

What does this PR do?

This PR introduces the initial profiling infrastructure for the Modalities codebase: a self‑contained tutorial, scripts for single‑process and distributed profiling runs, configuration examples, and documentation (README) to guide performance and memory analysis. It lays groundwork for profiling custom steppable components (e.g. norm layers) and multi‑GPU model executions.

General Changes

Adds tutorials/profiling/ directory with:
- Configs (distributed_8B_model_profiling.yaml, single_process_rms_norm_profiling.yaml, small_profiling_config.yaml) illustrating profiling parameterization and model/data setup.
- Single-process scripts (single_process_norm_profiling.py, single_process_profiler_starter.sh) demonstrating custom component registration (SteppableNorm) and measurement loop (warmup/wait/measurement semantics).
- Distributed scripts (run_distributed_model_profiling.py, distributed_profiler_starter.sh) showing torchrun launch and selective rank profiling via profiled_ranks.
- A comprehensive README.md explaining directory layout, concepts (warmup, wait, measurement, profiled_ranks), artifact formats (summary TXT, trace JSON, memory HTML), customization, troubleshooting, and extension patterns.
Adds profiler utilities (ModalitiesProfilerStarter, CustomComponentRegisterable, SteppableComponentIF) to demonstrate extensible profiling flows.
Establishes experiment output convention: timestamp + hash directories containing profiler summaries, traces, memory reports, and a copy of the original config for reproducibility.

Breaking Changes

None. All additions are new; no public APIs were removed or changed. Existing training/evaluation behavior is unaffected unless profiling scripts are explicitly invoked.

NOTE:
Two unit tests fail probably due to pytorch-nightly dependency:

FAILED tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py::TestWarmstart::test_warm_start[gpt2_train_num_steps_7_pp_tp.yaml-gpt2_warm_start_from_step_4_fsdp2.yaml-8-2] - AssertionError: Child process 0 raised an exception:
FAILED tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py::TestWarmstart::test_warm_start[gpt2_train_num_steps_7_pp_tp.yaml-gpt2_warm_start_from_step_4_grad_accu.yaml-8-1] - AssertionError: Child process 0 raised an exception:

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…t data types

…'s built-in version

le1nux added 15 commits November 12, 2025 00:17

feat: drafted first profiling setup

1bf75fc

feat: added SteppableComponentIF

2152562

chore: created test configs for profiling

a3e5a88

feat: added SteppableForwardPass

43b27de

feat: profiler now supports single process and distributed profiling

fb504ee

feat: random batch sampler now supports arbitrary shapes and differen…

ca24035

…t data types

refactor: sketched profling entry point

7fb2e68

chore: Merge branch 'main' into modalities_profiling

efff0f9

feat: added tutorial README

3418e55

chore: added docstrings to profiling utils

d44c3cd

feat: deprecated our custom RMS norm implementation in favor of torch…

00bca29

…'s built-in version

chore: added pytorch RMS norm as default option in profiling config

e59863f

feat: added entry scripts for profiling

04165ee

chore: Merge branch 'main' into modalities_profiling

cdd9f19

feat: wired up examples as part of the tests

640c49d

le1nux marked this pull request as ready for review November 13, 2025 15:10

le1nux changed the title ~~feat: drafted first profiling setup~~ Distributed and single process profiling / tracing Nov 13, 2025

chore: fixed test configs

54c75e9

le1nux requested review from BlueCrescent and therealdavidos November 14, 2025 10:09

le1nux added 2 commits November 14, 2025 12:42

refactor: improved naming consistency and logging in profiling utilities

a33a1fc

feat: distributed profiling now exposed in CMD API

9e79d61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed and single process profiling / tracing #421

Distributed and single process profiling / tracing #421

le1nux commented Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Distributed and single process profiling / tracing #421

Are you sure you want to change the base?

Distributed and single process profiling / tracing #421

Conversation

le1nux commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

le1nux commented Nov 12, 2025 •

edited

Loading