Open
Conversation
…tracer websocket server
… and remove max_size limit for websocket connections in training_wsserver.py
…r improved functionality
|
Hi @sbhavani Santosh, could you please help review the PR? If there is any advice, feel free to comment. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This PR adds an experimental Tensor Tracer (MegaScope) to Megatron-LM (target branch:
dev) to stream selected intermediatetensors during training/evaluation to an external client (UI or script) over WebSockets for live visualization /
debugging.
Highlights:
--tensor-tracer-port <port>.pip install -e '.[tensor_tracer]'(only required when the tracer is enabled).TTFlags.should_traceis enabled only around the forward step).This PR intentionally keeps the tracer narrowly scoped to a GPT-style model wrapper (see
TTHookManager).Why is this useful?
When training/fine-tuning large models, it can be hard to pinpoint where issues originate (NaNs/divergence, unstable
layers, saturation, representation collapse, emerging features, etc.). Tensor Tracer makes it possible to:
FlagType) to observe.Demonstrated case (persona-vector projection monitoring)
As a practical demonstration, this tracer can be used to monitor projections of per-token hidden states onto a
pre-computed persona vector (paper) during fine-tuning. In our internal run (Llama3-8B-Instruct + an emergent-misalignment
related dataset), the per-layer projection signal shows an overall increasing trend in mid/deep layers across training
steps.
High-level workflow:
risky_financial_advice) with the tracer enabled.HiddenStatestracing withProjectionCompressor, pointing at a torch-saved vector file shaped like[num_layers, hidden_size]which contains the persona vector across layers (e.g., evil persona vector).We observe that the persona projection signal tends to increase in mid/deep layers during fine-tuning on the emergent-misalignment dataset, which is consistent with the hypothesis that the model is learning to represent the risky persona more strongly in those layers as it fine-tunes (see
docs/api-guide/tensor_tracer.mdfor a more detailed walkthrough of this example).Note: exact trends may depend on model/data/hyperparameters and are included here as a motivating example for the tracing
feature (not as a claim of generality).
Key changes
megatron/core/tensor_tracer.pyTTFlagsconfiguration and forward hook management (TTHookManager).TileCompressor,NoOpCompressor,EmptyCompressor,ProjectionCompressor.InputTokenstrace point to report(input_ids, position_ids)for token-level indexing/debugging.megatron/training/training_wsserver.pymegatron/training/arguments.py--tensor-tracer-port.megatron/core/pipeline_parallel/schedules.pytests/unit_tests/test_tensor_tracer.pyTTFlags.set_by_configsbehavior.docs/api-guide/tensor_tracer.mdHow to use
pip install -e '.[tensor_tracer]'... --tensor-tracer-port 8765ws://<rank0-host>:8765run_training_stepJSON message to provide:visualization_flags: which tensors to trace (byFlagTypename).compressor_config: per-flag compressor settings.Testing
Local checks:
tools/autoformat.shpytest -q tests/unit_tests/test_tensor_tracer.pyNotes / scope
TileCompressorevaluates a reduction expression andProjectionCompressorloads a vector withtorch.load:treat tracer configs/artifacts as trusted inputs.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Contributors
Tingrui Zhang (zhang-tr22@mails.tsinghua.edu.cn)
Shuo Chen (s-chen25@mails.tsinghua.edu.cn)
Wei Xu (weixu@tsinghua.edu.cn)
Tsinghua University
Thank you for reviewing!
Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.