Skip to content

Conversation

@bluecoffee8
Copy link

@bluecoffee8 bluecoffee8 commented Nov 27, 2025

Coauthor: @ShangmingCai (mooncake backend)

WIP async PP x PD decode loop.
Currently support PP on both prefill and decode, but the PP size needs to be the same (or decode PP=1).
Currently mooncake backend works, nixl is WIP.
Also needs a bit more stress testing in order to test the retract req logic.

Mistral 7B model, PP=4 on single prefill worker (4 x H100), and PP=4 on single decode worker (4 x H100). 128 prompts, 1000 input and 1000 output len approx.

#Input tokens: 121127
#Output tokens: 121703
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:22<00:00,  5.76it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     128       
Benchmark duration (s):                  22.23     
Total input tokens:                      121127    
Total input text tokens:                 121127    
Total input vision tokens:               0         
Total generated tokens:                  121703    
Total generated tokens (retokenized):    104595    
Request throughput (req/s):              5.76      
Input token throughput (tok/s):          5449.65   
Output token throughput (tok/s):         5475.57   
Total token throughput (tok/s):          10925.22  
Concurrency:                             118.81    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20630.09  
Median E2E Latency (ms):                 20686.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          18383.89  
Median TTFT (ms):                        20571.59  
P99 TTFT (ms):                           21981.57  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.38      
Median TPOT (ms):                        -0.00     
P99 TPOT (ms):                           22.56     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Mistral 7B model, PP=2 on single prefill worker (2 x H100), and PP=2 on single decode worker (2 x H100). 256 prompts, 8000 input and 16000 output len approx. Testing the retract logic. Scheduler conservativeness = 0.1.

python -m sglang.bench_serving --port $PORT1 --dataset-name random-ids --num-prompts 256 --random-input-len 8000 --random-output-len 16000 --random-range-ratio 0.9 --disable-stream --host 0.0.0.0 --model /model/Mistral-7B-Instruct-v0.3 
benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=10035, dataset_name='random-ids', dataset_path='', model='/model/Mistral-7B-Instruct-v0.3', served_model_name=None, tokenizer=None, num_prompts=256, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=8000, random_output_len=16000, random_range_ratio=0.9, image_count=1, image_resolution='1080p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=10035, dataset_name='random-ids', dataset_path='', model='/model/Mistral-7B-Instruct-v0.3', served_model_name=None, tokenizer=None, num_prompts=256, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=8000, random_output_len=16000, random_range_ratio=0.9, image_count=1, image_resolution='1080p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)

#Input tokens: 1945808
#Output tokens: 3892159
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [30:00<00:00,  7.03s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     138       
Benchmark duration (s):                  1800.84   
Total input tokens:                      1050818   
Total input text tokens:                 1050818   
Total input vision tokens:               0         
Total generated tokens:                  2094308   
Total generated tokens (retokenized):    1112114   
Request throughput (req/s):              0.08      
Input token throughput (tok/s):          583.52    
Output token throughput (tok/s):         1162.96   
Total token throughput (tok/s):          1746.48   
Concurrency:                             83.14     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1084963.78
Median E2E Latency (ms):                 1182656.61
---------------Time to First Token----------------
Mean TTFT (ms):                          946516.75 
Median TTFT (ms):                        1016141.41
P99 TTFT (ms):                           1771151.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.28      
Median TPOT (ms):                        -0.00     
P99 TPOT (ms):                           116.85    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@bluecoffee8 bluecoffee8 changed the title pp pd decode wip WIP: pp pd decode wip Nov 27, 2025
@bluecoffee8 bluecoffee8 marked this pull request as draft November 27, 2025 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants