Skip to content

[feature] MegaScope Tensor Tracer#3606

Open
superay-a wants to merge 10 commits intoNVIDIA:devfrom
MegatronAPPteam:ztr/megascope_tensor_tracer
Open

[feature] MegaScope Tensor Tracer#3606
superay-a wants to merge 10 commits intoNVIDIA:devfrom
MegatronAPPteam:ztr/megascope_tensor_tracer

Conversation

@superay-a
Copy link

@superay-a superay-a commented Feb 26, 2026

What does this PR do ?

This PR adds an experimental Tensor Tracer (MegaScope) to Megatron-LM (target branch: dev) to stream selected intermediate
tensors during training/evaluation to an external client (UI or script) over WebSockets for live visualization /
debugging.

Highlights:

  • Off by default; enabled with --tensor-tracer-port <port>.
  • Optional dependency via pip install -e '.[tensor_tracer]' (only required when the tracer is enabled).
  • Forward-step-only tracing (TTFlags.should_trace is enabled only around the forward step).
  • Supports multiple compressors to keep payload sizes manageable (tiling reductions, projection onto a vector, etc.).

This PR intentionally keeps the tracer narrowly scoped to a GPT-style model wrapper (see
TTHookManager).

Why is this useful?

When training/fine-tuning large models, it can be hard to pinpoint where issues originate (NaNs/divergence, unstable
layers, saturation, representation collapse, emerging features, etc.). Tensor Tracer makes it possible to:

  • Select specific trace points (by FlagType) to observe.
  • Compress payloads before sending to the client (to reduce bandwidth and CPU overhead).
  • Collect activations across tensor-parallel ranks and produce aggregated per-layer signals.
  • View traces live during training in a separate UI (see this repo for an example).

Demonstrated case (persona-vector projection monitoring)

As a practical demonstration, this tracer can be used to monitor projections of per-token hidden states onto a
pre-computed persona vector (paper) during fine-tuning. In our internal run (Llama3-8B-Instruct + an emergent-misalignment
related dataset), the per-layer projection signal shows an overall increasing trend in mid/deep layers across training
steps.

High-level workflow:

  1. Fine-tune a model (e.g., Llama3-8B-Instruct) on a dataset of interest (e.g., an emergent-misalignment related dataset risky_financial_advice) with the tracer enabled.
  2. Periodically run an evaluation forward pass (via the normal Megatron evaluation loop).
  3. Enable HiddenStates tracing with ProjectionCompressor, pointing at a torch-saved vector file shaped like
    [num_layers, hidden_size] which contains the persona vector across layers (e.g., evil persona vector).
  4. Aggregate the projected scalar values in your frontend / post-processing script and visualize per-layer trends.

We observe that the persona projection signal tends to increase in mid/deep layers during fine-tuning on the emergent-misalignment dataset, which is consistent with the hypothesis that the model is learning to represent the risky persona more strongly in those layers as it fine-tunes (see docs/api-guide/tensor_tracer.md for a more detailed walkthrough of this example).

Note: exact trends may depend on model/data/hyperparameters and are included here as a motivating example for the tracing
feature (not as a claim of generality).

Key changes

  • megatron/core/tensor_tracer.py
    • TTFlags configuration and forward hook management (TTHookManager).
    • Compressor framework: TileCompressor, NoOpCompressor, EmptyCompressor, ProjectionCompressor.
    • Adds InputTokens trace point to report (input_ids, position_ids) for token-level indexing/debugging.
  • megatron/training/training_wsserver.py
    • Rank 0 hub server; worker client processes for non-rank0 senders.
  • megatron/training/arguments.py
    • Adds --tensor-tracer-port.
  • megatron/core/pipeline_parallel/schedules.py
    • Enables tracing only around the forward step.
  • tests/unit_tests/test_tensor_tracer.py
    • Unit tests for compressors + TTFlags.set_by_configs behavior.
  • docs/api-guide/tensor_tracer.md
    • Protocol, schema, and usage notes (including the persona-vector projection monitoring example).

How to use

  1. Install optional dependency:
    • pip install -e '.[tensor_tracer]'
  2. Launch training/eval with tracing enabled (port is arbitrary):
    • ... --tensor-tracer-port 8765
  3. Connect from your client/UI:
    • ws://<rank0-host>:8765
  4. Send a run_training_step JSON message to provide:
    • visualization_flags: which tensors to trace (by FlagType name).
    • compressor_config: per-flag compressor settings.

Testing

Local checks:

  • tools/autoformat.sh
  • pytest -q tests/unit_tests/test_tensor_tracer.py

Notes / scope

  • The tracer is designed for monitoring/visualization and has zero overhead when disabled.
  • TileCompressor evaluates a reduction expression and ProjectionCompressor loads a vector with torch.load:
    treat tracer configs/artifacts as trusted inputs.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Contributors

Tingrui Zhang (zhang-tr22@mails.tsinghua.edu.cn)
Shuo Chen (s-chen25@mails.tsinghua.edu.cn)
Wei Xu (weixu@tsinghua.edu.cn)
Tsinghua University

Thank you for reviewing!

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@superay-a superay-a requested review from a team as code owners February 26, 2026 04:49
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jennifer88huang
Copy link

Hi @sbhavani Santosh, could you please help review the PR? If there is any advice, feel free to comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants