-
Notifications
You must be signed in to change notification settings - Fork 624
[Test]Add accuracy test for multiple models #3823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds accuracy test configurations for seven new models. The configurations are mostly well-defined. However, I have identified significant security concerns in two of the new model configurations (Mistral-7B-Instruct-v0.1.yaml and Phi-4-mini-instruct.yaml). Both use re-uploaded models from third-party sources (AI-ModelScope and LLM-Research) while also enabling trust_remote_code: True. This practice poses a security risk by allowing the execution of arbitrary code from un-vetted repositories. My review comments highlight these issues and recommend using official model sources to mitigate the risks. Please address these high-severity security concerns.
| model_name: "AI-ModelScope/Mistral-7B-Instruct-v0.1" | ||
| runner: "linux-aarch64-a2-1" | ||
| hardware: "Atlas A2 Series" | ||
| tasks: | ||
| - name: "gsm8k" | ||
| metrics: | ||
| - name: "exact_match,strict-match" | ||
| value: 0.35 | ||
| - name: "exact_match,flexible-extract" | ||
| value: 0.38 | ||
| trust_remote_code: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration for Mistral-7B-Instruct-v0.1 uses a model from AI-ModelScope and sets trust_remote_code: True. The original mistralai/Mistral-7B-Instruct-v0.1 model does not require remote code execution. Using a third-party copy of the model with trust_remote_code enabled introduces a security risk, as it allows arbitrary code from the model repository to be executed.
I suggest using the official model and removing trust_remote_code. Please note that you may need to re-evaluate and update the expected accuracy values after this change.
model_name: "mistralai/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.35
- name: "exact_match,flexible-extract"
value: 0.38| model_name: "LLM-Research/Phi-4-mini-instruct" | ||
| runner: "linux-aarch64-a2-1" | ||
| hardware: "Atlas A2 Series" | ||
| tasks: | ||
| - name: "gsm8k" | ||
| metrics: | ||
| - name: "exact_match,strict-match" | ||
| value: 0.81 | ||
| - name: "exact_match,flexible-extract" | ||
| value: 0.81 | ||
| trust_remote_code: True | ||
| num_fewshot: 5 | ||
| batch_size: 32 | ||
| gpu_memory_utilization: 0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This configuration uses a re-uploaded model LLM-Research/Phi-4-mini-instruct and sets trust_remote_code: True. While official Microsoft Phi models often require trust_remote_code, using a third-party repository introduces a security risk from executing un-vetted code. It is highly recommended to use the official model from the microsoft organization on Hugging Face (e.g., microsoft/Phi-3-mini-4k-instruct or the correct official name for Phi-4 if available) to ensure code integrity and security. If this specific re-upload is necessary, please document the reason.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
9936499 to
a6f8422
Compare
.github/workflows/accuracy_test.yaml
Outdated
| model_name: Qwen3-VL-30B-A3B-Instruct | ||
| # This model has a bug that needs to be fixed and re added | ||
| # - runner: a2-2 | ||
| # model_name: Qwen3-VL-30B-A3B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this test?
| num_fewshot: 5 | ||
| tensor_parallel_size: 2 | ||
| batch_size: 16 | ||
| gpu_memory_utilization: 0.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why setting gpu_memory_utilization to 0.6 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are issues with the accuracy testing of this model, and it has been cancelled.
| @@ -0,0 +1,11 @@ | |||
| model_name: "LLM-Research/Meta-Llama-3.1-8B-Instruct" | |||
| hardware: "Atlas A2 Series" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some of the yaml specified the hardware, and some didn't. I think this is no need to specify? cc @zhangxinyuehfad plz also take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unified modifications have been completed.
| - name: "gsm8k" | ||
| metrics: | ||
| - name: "exact_match,strict-match" | ||
| value: 0.35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accuracy indecate that there is some accuracy issues with this model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are issues with the accuracy testing of this model, and it has been cancelled.
| - name: "mmmu_val" | ||
| metrics: | ||
| - name: "acc,none" | ||
| value: 0.52 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MMMU is an extremely challenging benchmark for multidisciplinary and multimodal reasoning, and this test value falls within a reasonable range.
| - name: "mmmu_val" | ||
| metrics: | ||
| - name: "acc,none" | ||
| value: 0.55 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
435f1ab to
817603c
Compare
.github/workflows/accuracy_test.yaml
Outdated
| model_name: Qwen3-VL-30B-A3B-Instruct | ||
| # To do: This model has a bug that needs to be fixed and readded | ||
| # - runner: a2-2 | ||
| # model_name: Qwen3-VL-30B-A3B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rebase your code and revert this change after #3897 is merged
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: MrZ20 <2609716663@qq.com>
d5e7a75 to
295afb2
Compare
4eb667a to
c683981
Compare
c683981 to
7b560a5
Compare
MengqingCao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks for your work!
|
plz also take a look and help merge @wangxiyuan |
### What this PR does / why we need it? Add accuracy test for multiple models: - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it? Add accuracy test for multiple models: - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>
### What this PR does / why we need it? Add accuracy test for multiple models: - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>
### What this PR does / why we need it? Add accuracy test for multiple models: - Meta_Llama_3.1_8B_Instruct - Qwen2.5-Omni-7B - Qwen3-VL-8B-Instruct - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>
What this PR does / why we need it?
Add accuracy test for multiple models:
Does this PR introduce any user-facing change?
How was this patch tested?