Skip to content

Commit 5f08e07

Browse files
authored
[Doc] Refactor the DeepSeek-V3.2-Exp tutorial. (#3871)
### What this PR does / why we need it? Refactor the DeepSeek-V3.2-Exp tutorial. - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: menogrey <1299267905@qq.com>
1 parent 49e6983 commit 5f08e07

File tree

9 files changed

+934
-431
lines changed

9 files changed

+934
-431
lines changed

docs/source/conf.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,9 @@
8080
'ci_vllm_version': 'v0.11.0',
8181
}
8282

83+
# For cross-file header anchors
84+
myst_heading_anchors = 5
85+
8386
# Add any paths that contain templates here, relative to this directory.
8487
templates_path = ['_templates']
8588

docs/source/developer_guide/evaluation/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
:maxdepth: 1
66
using_evalscope
77
using_lm_eval
8+
using_ais_bench
89
using_opencompass
910
accuracy_report/index
1011
:::
Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# Using AISBench
2+
This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
3+
4+
## Online Server
5+
### 1. Start the vLLM server
6+
You can run docker container to start the vLLM server on a single NPU:
7+
8+
```{code-block} bash
9+
:substitutions:
10+
# Update DEVICE according to your device (/dev/davinci[0-7])
11+
export DEVICE=/dev/davinci7
12+
# Update the vllm-ascend image
13+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
14+
docker run --rm \
15+
--name vllm-ascend \
16+
--shm-size=1g \
17+
--device $DEVICE \
18+
--device /dev/davinci_manager \
19+
--device /dev/devmm_svm \
20+
--device /dev/hisi_hdc \
21+
-v /usr/local/dcmi:/usr/local/dcmi \
22+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
23+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
24+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
25+
-v /etc/ascend_install.info:/etc/ascend_install.info \
26+
-v /root/.cache:/root/.cache \
27+
-p 8000:8000 \
28+
-e VLLM_USE_MODELSCOPE=True \
29+
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
30+
-it $IMAGE \
31+
/bin/bash
32+
```
33+
34+
Run the vLLM server in the docker.
35+
36+
```{code-block} bash
37+
:substitutions:
38+
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 35000 &
39+
```
40+
41+
:::{note}
42+
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
43+
:::
44+
45+
The vLLM server is started successfully, if you see logs as below:
46+
47+
```
48+
INFO: Started server process [9446]
49+
INFO: Waiting for application startup.
50+
INFO: Application startup complete.
51+
```
52+
53+
### 2. Run different dataset using AISBench
54+
55+
#### Install AISBench
56+
57+
Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.
58+
Install AISBench from source.
59+
60+
```shell
61+
git clone https://gitee.com/aisbench/benchmark.git
62+
cd benchmark/
63+
pip3 install -e ./ --use-pep517
64+
```
65+
66+
Install extra AISBench dependencies.
67+
68+
```shell
69+
pip3 install -r requirements/api.txt
70+
pip3 install -r requirements/extra.txt
71+
```
72+
73+
Run `ais_bench -h` to check the installation.
74+
75+
#### Download Dataset
76+
77+
You can choose one or multiple datasets to execute accuracy evaluation.
78+
79+
1. `C-Eval` dataset.
80+
81+
Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
82+
83+
Download dataset and install it to specific path.
84+
85+
```shell
86+
cd ais_bench/datasets
87+
mkdir ceval/
88+
mkdir ceval/formal_ceval
89+
cd ceval/formal_ceval
90+
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
91+
unzip ceval-exam.zip
92+
rm ceval-exam.zip
93+
```
94+
95+
2. `MMLU` dataset.
96+
97+
```shell
98+
cd ais_bench/datasets
99+
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip
100+
unzip mmlu.zip
101+
rm mmlu.zip
102+
```
103+
104+
3. `GPQA` dataset.
105+
106+
```shell
107+
cd ais_bench/datasets
108+
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip
109+
unzip gpqa.zip
110+
rm gpqa.zip
111+
```
112+
113+
4. `MATH` dataset.
114+
115+
```shell
116+
cd ais_bench/datasets
117+
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip
118+
unzip math.zip
119+
rm math.zip
120+
```
121+
122+
5. `LiveCodeBench` dataset.
123+
124+
```shell
125+
cd ais_bench/datasets
126+
git lfs install
127+
git clone https://huggingface.co/datasets/livecodebench/code_generation_lite
128+
```
129+
130+
6. `AIME 2024` dataset.
131+
132+
```shell
133+
cd ais_bench/datasets
134+
mkdir aime/
135+
cd aime/
136+
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip
137+
unzip aime.zip
138+
rm aime.zip
139+
```
140+
141+
7. `GSM8K` dataset.
142+
143+
```shell
144+
cd ais_bench/datasets
145+
wget http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip
146+
unzip gsm8k.zip
147+
rm gsm8k.zip
148+
```
149+
150+
#### Configuration
151+
152+
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
153+
There are several arguments that you should update according to your environment.
154+
155+
- `path`: Update to your model weight path.
156+
- `model`: Update to your model name in vLLM.
157+
- `host_ip` and `host_port`: Update to your vLLM server ip and port.
158+
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
159+
- `batch_size`: Update according to your dataset.
160+
- `temperature`: Update inference argument.
161+
162+
```python
163+
from ais_bench.benchmark.models import VLLMCustomAPIChat
164+
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
165+
166+
models = [
167+
dict(
168+
attr="service",
169+
type=VLLMCustomAPIChat,
170+
abbr='vllm-api-general-chat',
171+
path="xxxx",
172+
model="xxxx",
173+
request_rate = 0,
174+
retry = 2,
175+
host_ip = "localhost",
176+
host_port = 8000,
177+
max_out_len = xxx,
178+
batch_size = xxx,
179+
trust_remote_code=False,
180+
generation_kwargs = dict(
181+
temperature = 0.6,
182+
top_k = 10,
183+
top_p = 0.95,
184+
seed = None,
185+
repetition_penalty = 1.03,
186+
),
187+
pred_postprocessor=dict(type=extract_non_reasoning_content)
188+
)
189+
]
190+
191+
```
192+
193+
#### Execute Accuracy Evaluation
194+
195+
Run the following code to execute different accuracy evaluation.
196+
197+
```shell
198+
# run C-Eval dataset
199+
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
200+
201+
# run MMLU dataset
202+
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
203+
204+
# run GPQA dataset
205+
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds
206+
207+
# run MATH-500 dataset
208+
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
209+
210+
# run LiveCodeBench dataset
211+
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds
212+
213+
# run AIME 2024 dataset
214+
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds
215+
216+
```
217+
218+
After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
219+
220+
```
221+
20250628_151326/
222+
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
223+
│ └── 20250628_151326_29317.py
224+
├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
225+
│ ├── eval
226+
│ │ └── vllm-api-general-chat
227+
│ │ └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
228+
│ └── infer
229+
│ └── vllm-api-general-chat
230+
│ └── demo_gsm8k.out # Logs of the inference process
231+
├── predictions
232+
│ └── vllm-api-general-chat
233+
│ └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
234+
├── results
235+
│ └── vllm-api-general-chat
236+
│ └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
237+
└── summary
238+
├── summary_20250628_151326.csv # Final accuracy scores (in table format)
239+
├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
240+
└── summary_20250628_151326.txt # Final accuracy scores (in text format)
241+
```
242+
243+
#### Execute Performance Evaluation
244+
245+
```shell
246+
# run C-Eval dataset
247+
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
248+
249+
# run MMLU dataset
250+
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
251+
252+
# run GPQA dataset
253+
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --summarizer default_perf --mode perf
254+
255+
# run MATH-500 dataset
256+
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
257+
258+
# run LiveCodeBench dataset
259+
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --summarizer default_perf --mode perf
260+
261+
# run AIME 2024 dataset
262+
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --summarizer default_perf --mode perf
263+
```
264+
265+
After execution, you can get the result from saved files, there is an example as follows:
266+
267+
```
268+
20251031_070226/
269+
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
270+
| `-- 20251031_070226_122485.py
271+
|-- logs
272+
| `-- performances
273+
| `-- vllm-api-general-chat
274+
| `-- cevaldataset.out # Logs of the performance evaluation process
275+
`-- performances
276+
`-- vllm-api-general-chat
277+
|-- cevaldataset.csv # Final performance results (in table format)
278+
|-- cevaldataset.json # Final performance results (in json format)
279+
|-- cevaldataset_details.h5 # Final performance results in details
280+
|-- cevaldataset_details.json # Final performance results in details
281+
|-- cevaldataset_plot.html # Final performance results (in html format)
282+
`-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
283+
```

docs/source/developer_guide/evaluation/using_lm_eval.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -122,10 +122,10 @@ After 30 minutes, the output is as shown below:
122122
```
123123
The markdown format results is as below:
124124
125-
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
125+
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
126126
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
127127
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3215|± |0.0129|
128-
| | |strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
128+
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.2077|± |0.0112|
129129
130130
```
131131

@@ -187,7 +187,7 @@ The markdown format results is as below:
187187
Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
188188
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
189189
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.3412|± |0.0131|
190-
| | |strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
190+
|gsm8k| 3|strict-match | 5|exact_match|↑ |0.3139|± |0.0128|
191191
192192
```
193193

0 commit comments

Comments
 (0)