Skip to content

Conversation

@le1nux
Copy link
Member

@le1nux le1nux commented Nov 12, 2025

What does this PR do?

This PR introduces the initial profiling infrastructure for the Modalities codebase: a self‑contained tutorial, scripts for single‑process and distributed profiling runs, configuration examples, and documentation (README) to guide performance and memory analysis. It lays groundwork for profiling custom steppable components (e.g. norm layers) and multi‑GPU model executions.

General Changes

  • Adds tutorials/profiling/ directory with:
    • Configs (distributed_8B_model_profiling.yaml, single_process_rms_norm_profiling.yaml, small_profiling_config.yaml) illustrating profiling parameterization and model/data setup.
    • Single-process scripts (single_process_norm_profiling.py, single_process_profiler_starter.sh) demonstrating custom component registration (SteppableNorm) and measurement loop (warmup/wait/measurement semantics).
    • Distributed scripts (run_distributed_model_profiling.py, distributed_profiler_starter.sh) showing torchrun launch and selective rank profiling via profiled_ranks.
    • A comprehensive README.md explaining directory layout, concepts (warmup, wait, measurement, profiled_ranks), artifact formats (summary TXT, trace JSON, memory HTML), customization, troubleshooting, and extension patterns.
  • Adds profiler utilities (ModalitiesProfilerStarter, CustomComponentRegisterable, SteppableComponentIF) to demonstrate extensible profiling flows.
  • Establishes experiment output convention: timestamp + hash directories containing profiler summaries, traces, memory reports, and a copy of the original config for reproducibility.

Breaking Changes

  • None. All additions are new; no public APIs were removed or changed. Existing training/evaluation behavior is unaffected unless profiling scripts are explicitly invoked.

NOTE:
Two unit tests fail probably due to pytorch-nightly dependency:

FAILED tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py::TestWarmstart::test_warm_start[gpt2_train_num_steps_7_pp_tp.yaml-gpt2_warm_start_from_step_4_fsdp2.yaml-8-2] - AssertionError: Child process 0 raised an exception:
FAILED tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py::TestWarmstart::test_warm_start[gpt2_train_num_steps_7_pp_tp.yaml-gpt2_warm_start_from_step_4_grad_accu.yaml-8-1] - AssertionError: Child process 0 raised an exception:

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@le1nux le1nux marked this pull request as ready for review November 13, 2025 15:10
@le1nux le1nux changed the title feat: drafted first profiling setup Distributed and single process profiling / tracing Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants