[Test]Add accuracy test for multiple models #3823

MrZ20 · 2025-10-28T07:13:22Z

What this PR does / why we need it?

Add accuracy test for multiple models：

Meta_Llama_3.1_8B_Instruct
Qwen2.5-Omni-7B
Qwen3-VL-8B-Instruct

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

gemini-code-assist

Code Review

This pull request adds accuracy test configurations for seven new models. The configurations are mostly well-defined. However, I have identified significant security concerns in two of the new model configurations (Mistral-7B-Instruct-v0.1.yaml and Phi-4-mini-instruct.yaml). Both use re-uploaded models from third-party sources (AI-ModelScope and LLM-Research) while also enabling trust_remote_code: True. This practice poses a security risk by allowing the execution of arbitrary code from un-vetted repositories. My review comments highlight these issues and recommend using official model sources to mitigate the risks. Please address these high-severity security concerns.

gemini-code-assist · 2025-10-28T07:15:07Z

tests/e2e/models/configs/Mistral-7B-Instruct-v0.1.yaml

+model_name: "AI-ModelScope/Mistral-7B-Instruct-v0.1"
+runner: "linux-aarch64-a2-1"
+hardware: "Atlas A2 Series"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.35
+  - name: "exact_match,flexible-extract"
+    value: 0.38
+trust_remote_code: True


The configuration for Mistral-7B-Instruct-v0.1 uses a model from AI-ModelScope and sets trust_remote_code: True. The original mistralai/Mistral-7B-Instruct-v0.1 model does not require remote code execution. Using a third-party copy of the model with trust_remote_code enabled introduces a security risk, as it allows arbitrary code from the model repository to be executed.

I suggest using the official model and removing trust_remote_code. Please note that you may need to re-evaluate and update the expected accuracy values after this change.

model_name: "mistralai/Mistral-7B-Instruct-v0.1" runner: "linux-aarch64-a2-1" hardware: "Atlas A2 Series" tasks: - name: "gsm8k" metrics: - name: "exact_match,strict-match" value: 0.35 - name: "exact_match,flexible-extract" value: 0.38

gemini-code-assist · 2025-10-28T07:15:07Z

tests/e2e/models/configs/Phi-4-mini-instruct.yaml

+model_name: "LLM-Research/Phi-4-mini-instruct"
+runner: "linux-aarch64-a2-1"
+hardware: "Atlas A2 Series"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.81
+  - name: "exact_match,flexible-extract"
+    value: 0.81
+trust_remote_code: True
+num_fewshot: 5
+batch_size: 32
+gpu_memory_utilization: 0.8


This configuration uses a re-uploaded model LLM-Research/Phi-4-mini-instruct and sets trust_remote_code: True. While official Microsoft Phi models often require trust_remote_code, using a third-party repository introduces a security risk from executing un-vetted code. It is highly recommended to use the official model from the microsoft organization on Hugging Face (e.g., microsoft/Phi-3-mini-4k-instruct or the correct official name for Phi-4 if available) to ensure code integrity and security. If this specific re-upload is necessary, please document the reason.

github-actions · 2025-10-28T07:17:17Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

MengqingCao · 2025-10-31T01:20:09Z

.github/workflows/accuracy_test.yaml

-            model_name: Qwen3-VL-30B-A3B-Instruct
+          # This model has a bug that needs to be fixed and re added
+          # - runner: a2-2
+          #   model_name: Qwen3-VL-30B-A3B-Instruct


Why remove this test?

MengqingCao · 2025-10-31T01:22:22Z

tests/e2e/models/configs/ERNIE-4.5-21B-A3B-PT.yaml

+num_fewshot: 5
+tensor_parallel_size: 2
+batch_size: 16
+gpu_memory_utilization: 0.6


Just curious: why setting gpu_memory_utilization to 0.6 here?

There are issues with the accuracy testing of this model, and it has been cancelled.

MengqingCao · 2025-10-31T01:23:54Z

tests/e2e/models/configs/Meta-Llama-3.1-8B-Instruct.yaml

@@ -0,0 +1,11 @@
+model_name: "LLM-Research/Meta-Llama-3.1-8B-Instruct"
+hardware: "Atlas A2 Series"


I noticed some of the yaml specified the hardware, and some didn't. I think this is no need to specify? cc @zhangxinyuehfad plz also take a look

Unified modifications have been completed.

MengqingCao · 2025-10-31T01:25:12Z

tests/e2e/models/configs/Mistral-7B-Instruct-v0.1.yaml

+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.35


The accuracy indecate that there is some accuracy issues with this model?

There are issues with the accuracy testing of this model, and it has been cancelled.

MengqingCao · 2025-10-31T01:25:19Z

tests/e2e/models/configs/Qwen2.5-Omni-7B.yaml

+- name: "mmmu_val"
+  metrics:
+  - name: "acc,none"
+    value: 0.52


MMMU is an extremely challenging benchmark for multidisciplinary and multimodal reasoning, and this test value falls within a reasonable range.

MengqingCao · 2025-10-31T01:25:35Z

tests/e2e/models/configs/Qwen3-VL-8B-Instruct.yaml

+- name: "mmmu_val"
+  metrics:
+  - name: "acc,none"
+    value: 0.55


MengqingCao · 2025-11-03T09:38:58Z

.github/workflows/accuracy_test.yaml

-            model_name: Qwen3-VL-30B-A3B-Instruct
+          # To do: This model has a bug that needs to be fixed and readded
+          # - runner: a2-2
+          #   model_name: Qwen3-VL-30B-A3B-Instruct


Let's rebase your code and revert this change after #3897 is merged

github-actions · 2025-11-04T01:27:37Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MrZ20 <2609716663@qq.com>

MengqingCao

LGTM, Thanks for your work!

MengqingCao · 2025-11-04T06:24:35Z

plz also take a look and help merge @wangxiyuan

### What this PR does / why we need it? Add accuracy test for multiple models： - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>

### What this PR does / why we need it? Add accuracy test for multiple models： - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

### What this PR does / why we need it? Add accuracy test for multiple models： - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>

### What this PR does / why we need it? Add accuracy test for multiple models： - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

github-actions bot added the module:tests label Oct 28, 2025

vllm-ascend-ci added accuracy-test enable all accuracy test for PR ready-for-test start test by label for PR labels Oct 28, 2025

MrZ20 force-pushed the n_model_acc_test branch 2 times, most recently from 9936499 to a6f8422 Compare October 30, 2025 07:18

MengqingCao reviewed Oct 31, 2025

View reviewed changes

tests/e2e/models/configs/Qwen3-VL-8B-Instruct.yaml

- name: "mmmu_val"

metrics:

- name: "acc,none"

value: 0.55

Copy link

Collaborator

MengqingCao Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

MrZ20 force-pushed the n_model_acc_test branch 2 times, most recently from 435f1ab to 817603c Compare November 3, 2025 09:22

MengqingCao reviewed Nov 3, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 4, 2025

MrZ20 added 4 commits November 4, 2025 10:21

add acc test

4fa3a5e

Signed-off-by: MrZ20 <2609716663@qq.com>

modify acc models

5db67df

Signed-off-by: MrZ20 <2609716663@qq.com>

add acc test

3db96ac

Signed-off-by: MrZ20 <2609716663@qq.com>

modify acc models

daa7710

Signed-off-by: MrZ20 <2609716663@qq.com>

MrZ20 force-pushed the n_model_acc_test branch from d5e7a75 to 295afb2 Compare November 4, 2025 02:26

github-actions bot removed the merge-conflicts label Nov 4, 2025

MrZ20 force-pushed the n_model_acc_test branch 2 times, most recently from 4eb667a to c683981 Compare November 4, 2025 02:31

update

7b560a5

Signed-off-by: MrZ20 <2609716663@qq.com>

MrZ20 force-pushed the n_model_acc_test branch from c683981 to 7b560a5 Compare November 4, 2025 03:58

MengqingCao approved these changes Nov 4, 2025

View reviewed changes

wangxiyuan merged commit dc1a6cb into vllm-project:main Nov 4, 2025
30 checks passed

gcanlin mentioned this pull request Nov 4, 2025

[Test][Accuracy] Add accuracy evaluation config for Qwen3-VL-8B-Instruct #3961

Closed

MrZ20 deleted the n_model_acc_test branch November 18, 2025 08:23

		@@ -0,0 +1,11 @@
		model_name: "LLM-Research/Meta-Llama-3.1-8B-Instruct"
		hardware: "Atlas A2 Series"

[Test]Add accuracy test for multiple models #3823

[Test]Add accuracy test for multiple models #3823

Uh oh!

Conversation

MrZ20 commented Oct 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

MengqingCao left a comment

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MrZ20 commented Oct 28, 2025 •

edited by github-actions bot

Loading