Skip to content

Conversation

@toulzx
Copy link

@toulzx toulzx commented Oct 29, 2025

run with:

# vllm==0.10.2
# benchmark_moe.py refer to https://github.com/vllm-project/vllm/blob/v0.10.2/benchmarks/kernels/benchmark_moe.py
uv run benchmark_moe.py --model Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4  --tune  --trust-remote-code

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the qwen Related to Qwen models label Oct 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tuned MoE kernel configuration for the Qwen3-Next model on A100 GPUs with TP=4. While the intent is to improve performance, a critical issue in the naming of the configuration file will prevent it from being used. The N=128 parameter in the filename implies an intermediate_size of 512, which is incorrect for a model of this size. This must be corrected for the performance benefits to be realized.

@@ -0,0 +1,146 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The filename E=512,N=128,device_name=NVIDIA_A100-SXM4-80GB.json is incorrect. The N=128 parameter is derived from intermediate_size / tp_size. With tp_size=4 as specified in the pull request, this implies intermediate_size = 128 * 4 = 512. This value is exceptionally small for a model of Qwen3-Next-80B-A3B-Instruct's scale.

An incorrect filename will prevent vLLM from loading this tuned configuration at runtime. The system will fall back to default, suboptimal settings, defeating the purpose of this contribution.

The moe_intermediate_size from the model's HuggingFace configuration must be used to generate the correct filename. A correct N value is calculated as moe_intermediate_size / tp_size. The file must be renamed with the correct N value to be used by the system.

@jeejeelee
Copy link
Collaborator

Could you tune using the latest main branch? On one hand, the tuned config will include Triton version information, and on the other hand, the tuning results might be different with the latest Triton version

@toulzx
Copy link
Author

toulzx commented Oct 30, 2025

OK. I'll rerun it with vLLM==0.11.0 when my machine is less busy.

---
vllm 0.10.2
`uv run benchmark_moe.py --model Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4  --tune  --trust-remote-code`
---
Signed-off-by: tou <57480529+toulzx@users.noreply.github.com>
@toulzx toulzx force-pushed the qwen3-next_moe-config_a100-sxm4-80gb branch from e4ad3eb to 7572b7a Compare October 30, 2025 08:44
---
vllm 0.11.0
`python3  benchmark_moe.py  --model Qwen/Qwen3-Next-80B-A3B-Instruct  -tp 8  --dtype  auto   --tune  --trust-remote-code`
---
Signed-off-by: tou <57480529+toulzx@users.noreply.github.com>
@toulzx
Copy link
Author

toulzx commented Oct 31, 2025

I have added A100-SXM4-80GB TP4 TP8 for Qwen3-Next-80B-A3B-Instruct.

I failed to run script with H100_80GB_HBM3 TP8 for Qwen3-Next-80B-A3B-Instruct and H100_80GB_HBM3 TP4 for Qwen3-Next-80B-A3B-Instruct-FP8 due to an unexpected stuck while started a local ray instance.

---
vllm 0.11.0
`python3  benchmark_moe.py  --model Qwen/Qwen3-Next-80B-A3B-Instruct  -tp 4  --dtype  auto   --tune  --trust-remote-code`
---
Signed-off-by: tou <57480529+toulzx@users.noreply.github.com>
@toulzx toulzx changed the title [Qwen3-Next] MOE config for A100-SXM4-80GB TP4 [Qwen3-Next] MOE config for A100-SXM4-80GB TP4 TP8 Oct 31, 2025
@toulzx toulzx changed the title [Qwen3-Next] MOE config for A100-SXM4-80GB TP4 TP8 [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 Oct 31, 2025
@toulzx
Copy link
Author

toulzx commented Oct 31, 2025

@jeejeelee qwen3-next tuned moe configs PR for A100 with vllm==0.11.0 is ready now :)

@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants