WIP: pp pd decode wip #13

bluecoffee8 · 2025-11-27T08:03:34Z

Coauthor: @ShangmingCai (mooncake backend)

WIP async PP x PD decode loop.
Currently support PP on both prefill and decode, but the PP size needs to be the same (or decode PP=1).
Currently mooncake backend works, nixl is WIP.
Also needs a bit more stress testing in order to test the retract req logic.

Mistral 7B model, PP=4 on single prefill worker (4 x H100), and PP=4 on single decode worker (4 x H100). 128 prompts, 1000 input and 1000 output len approx.

#Input tokens: 121127
#Output tokens: 121703
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:22<00:00,  5.76it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     128       
Benchmark duration (s):                  22.23     
Total input tokens:                      121127    
Total input text tokens:                 121127    
Total input vision tokens:               0         
Total generated tokens:                  121703    
Total generated tokens (retokenized):    104595    
Request throughput (req/s):              5.76      
Input token throughput (tok/s):          5449.65   
Output token throughput (tok/s):         5475.57   
Total token throughput (tok/s):          10925.22  
Concurrency:                             118.81    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   20630.09  
Median E2E Latency (ms):                 20686.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          18383.89  
Median TTFT (ms):                        20571.59  
P99 TTFT (ms):                           21981.57  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.38      
Median TPOT (ms):                        -0.00     
P99 TPOT (ms):                           22.56     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Mistral 7B model, PP=2 on single prefill worker (2 x H100), and PP=2 on single decode worker (2 x H100). 256 prompts, 8000 input and 16000 output len approx. Testing the retract logic. Scheduler conservativeness = 0.1.

python -m sglang.bench_serving --port $PORT1 --dataset-name random-ids --num-prompts 256 --random-input-len 8000 --random-output-len 16000 --random-range-ratio 0.9 --disable-stream --host 0.0.0.0 --model /model/Mistral-7B-Instruct-v0.3 
benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=10035, dataset_name='random-ids', dataset_path='', model='/model/Mistral-7B-Instruct-v0.3', served_model_name=None, tokenizer=None, num_prompts=256, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=8000, random_output_len=16000, random_range_ratio=0.9, image_count=1, image_resolution='1080p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=10035, dataset_name='random-ids', dataset_path='', model='/model/Mistral-7B-Instruct-v0.3', served_model_name=None, tokenizer=None, num_prompts=256, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=8000, random_output_len=16000, random_range_ratio=0.9, image_count=1, image_resolution='1080p', image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, disable_tqdm=False, disable_stream=True, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, profile_activities=['CPU', 'GPU'], lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None)

#Input tokens: 1945808
#Output tokens: 3892159
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [30:00<00:00,  7.03s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     138       
Benchmark duration (s):                  1800.84   
Total input tokens:                      1050818   
Total input text tokens:                 1050818   
Total input vision tokens:               0         
Total generated tokens:                  2094308   
Total generated tokens (retokenized):    1112114   
Request throughput (req/s):              0.08      
Input token throughput (tok/s):          583.52    
Output token throughput (tok/s):         1162.96   
Total token throughput (tok/s):          1746.48   
Concurrency:                             83.14     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1084963.78
Median E2E Latency (ms):                 1182656.61
---------------Time to First Token----------------
Mean TTFT (ms):                          946516.75 
Median TTFT (ms):                        1016141.41
P99 TTFT (ms):                           1771151.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.28      
Median TPOT (ms):                        -0.00     
P99 TPOT (ms):                           116.85    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

pp pd decode wip

136a515

bluecoffee8 requested review from ShangmingCai and hnyls2002 as code owners November 27, 2025 08:03

bluecoffee8 changed the title ~~pp pd decode wip~~ WIP: pp pd decode wip Nov 27, 2025

bluecoffee8 marked this pull request as draft November 27, 2025 08:05

bluecoffee8 added 4 commits November 28, 2025 20:30

pp x pd decode startup

5bb4a14

mooncake backend support

d572676

fix nixl

94d56a5

fix retract req

4433a9d

ShangmingCai mentioned this pull request Dec 5, 2025

[PD] Support decode pp for PD disaggregation sgl-project/sglang#14265

Merged

6 tasks

Merge branch 'Xuchun/pp-dev' into pp_pd_decode_wip

4eb0395

bluecoffee8 mentioned this pull request Dec 12, 2025

[PD] Add decode PP event loop for PD disaggregation sgl-project/sglang#14945

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: pp pd decode wip #13

WIP: pp pd decode wip #13

Uh oh!

bluecoffee8 commented Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: pp pd decode wip #13

Are you sure you want to change the base?

WIP: pp pd decode wip #13

Uh oh!

Conversation

bluecoffee8 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bluecoffee8 commented Nov 27, 2025 •

edited

Loading