Skip to content

Conversation

@SageMoore
Copy link
Contributor

@SageMoore SageMoore commented Oct 29, 2025

Purpose

When num_tokens is near the dbo_decode_token_threshold, different ranks may make different microbatching decisions (some above threshold, some below). Since all ranks must agree for DBO to work, they'll all fall back to non-DBO execution. To avoid running without cudagraphs in these mixed cases, this PR adds logic to compile cudagraphs for both microbatching modes.

Size before
Graph capturing finished in 33 secs, took 2.46 GiB

Size after
Graph capturing finished in 35 secs, took 2.52 GiB

Test Plan

To test I ran lm_eval with Deepseek V2 Lite with DP=2 and dbo-decode-threshold=26. Since ranks usually get 25-30 tokens in this scenario, setting the threshold at 26 ensures some ranks will be above and some below, triggering the mixed-decision scenario. I added logging to the code and verified that we are now properly running with non-dbo cudagraphs when one rank is running with 25 tokens. I've also included lm eval results.

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3733|±  |0.0280|
|     |       |strict-match    |     5|exact_match|↑  |0.3700|±  |0.0279|

Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
@mergify mergify bot added the v1 label Oct 29, 2025
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
@SageMoore SageMoore marked this pull request as ready for review October 30, 2025 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant