Skip to content

[Feature] Better estimation policy#97

Open
YzXiao101 wants to merge 10 commits intosgl-project:mainfrom
YzXiao101:feat/better_estimate_policy
Open

[Feature] Better estimation policy#97
YzXiao101 wants to merge 10 commits intosgl-project:mainfrom
YzXiao101:feat/better_estimate_policy

Conversation

@YzXiao101
Copy link
Contributor

@YzXiao101 YzXiao101 commented Mar 8, 2026

Motivation

  1. Current prefill admission relies on rough token estimates rather than actual page-based KV memory usage.
  2. The current prefill admission is overly conservative, leading to substantial waste of reserved KV cache space in real-world workloads, as demonstrated by PR ([minor] Add WildChat offline benchmark with real-world prompts #81).

Description

This PR introduces a more accurate estimation policy from SGLang, aligning admission closer to actual memory usage while allowing decode to recover safely if estimates prove overly optimistic.

  • Makes prefill admission page-aware instead of relying on rough token reservations.
  • Designs a dynamic policy for controlling prefill admission centered around new_token_ratio.
  • Adds decode memory checks and retracts unsafe decode requests when necessary.
  • Requeues retracted requests as fresh pending requests when needed.
  • Updates new_token_ratio during decode and resets it when the scheduler becomes idle.

[WIP] Benchmark Result

Sun Mar  8 16:15:58 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:25:00.0 Off |                  Off |
|  0%   30C    P8             20W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Scenario 1: Overlap_schedule = OFF, Offline benchmark, Real-world Prompts

  • feat/better_estimate_policy
MINISGL_DISABLE_OVERLAP_SCHEDULING=1 python benchmark/offline/bench_wildchat.py
...
Input length: count=256, min=9, p50=37, p90=471, p99=3665, max=4699
Output length: count=256, min=77, p50=691, p90=1730, p99=4096, max=4096
Bench requests: 256
Output budget: 1048576tok, Actual output: 246443tok
Total: 246443tok, Time: 54.56s, Throughput: 4517.28tok/s
  • main

Scenario 2: Overlap_schedule = ON, Offline benchmark, Real-world Prompts
Scenario 3: Overlap_schedule = OFF, Online benchmark, Real-world Prompts
Scenario 4: Overlap_schedule = ON, Online benchmark, Real-world Prompts

Checklist

  • Basic feature w/o overlap schedule.
  • Adaption for overlap schedule.
  • Online benchmark w/ wildchat real-world prompt.
  • Format code with pre-commit.
  • Add unit tests.
  • Provide speed benchmark results.
  • Follow the minisgl code style.

@DarkSharpness
Copy link
Collaborator

quick question: does this feature supports overlap scheduling? By default we turn on overlap schedule, which improves GPU utilization.

@YzXiao101
Copy link
Contributor Author

quick question: does this feature supports overlap scheduling? By default we turn on overlap schedule, which improves GPU utilization.

@DarkSharpness Yes, this new feature will support overlap scheduling just like SGLang. I’ve already finished the design. In short, I will add a new flag to the request so that retracted requests can be skipped in _process_last_data. In addition, I will add an online benchmark setup in bench_qwen.py based on real-world prompts from WildChat to validate the benefit from this new feature.

These work above is still WIP, and I expect to complete all development, unit tests, and benchmarking by the end of this weekend !

table_manager: TableManager

@classmethod
def create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use __post_init__ and initialize reserved_size in it.

adder.reserved_size = sum(adder._get_running_reserve(req) for req in running_reqs)
return adder

def _get_running_reserve(self, req: Req) -> int:
Copy link
Collaborator

@DarkSharpness DarkSharpness Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we seperate out the reserve-estimation module/class? It should not be highly coupled with prefiller

@YzXiao101 YzXiao101 force-pushed the feat/better_estimate_policy branch from 5e32f9e to 14666d4 Compare March 12, 2026 14:54
@YzXiao101 YzXiao101 force-pushed the feat/better_estimate_policy branch from 14666d4 to e3e712e Compare March 12, 2026 14:57
@YzXiao101 YzXiao101 force-pushed the feat/better_estimate_policy branch from a7187be to 6cc0722 Compare March 18, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants