[Feature] Better estimation policy by YzXiao101 · Pull Request #97 · sgl-project/mini-sglang

YzXiao101 · 2026-03-08T15:25:33Z

Motivation

Current prefill admission relies on rough token estimates rather than actual page-based KV memory usage.
The current prefill admission is overly conservative, leading to substantial waste of reserved KV cache space in real-world workloads, as demonstrated by PR ([minor] Add WildChat offline benchmark with real-world prompts #81).

Description

This PR introduces a more accurate estimation policy from SGLang, aligning admission closer to actual memory usage while allowing decode to recover safely if estimates prove overly optimistic.

Makes prefill admission page-aware instead of relying on rough token reservations.
Designs a dynamic policy for controlling prefill admission centered around new_token_ratio.
Adds decode memory checks and retracts unsafe decode requests when necessary.
Requeues retracted requests as fresh pending requests when needed.
Updates new_token_ratio during decode and resets it when the scheduler becomes idle.

[WIP] Benchmark Result

Sun Mar  8 16:15:58 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:25:00.0 Off |                  Off |
|  0%   30C    P8             20W /  450W |       1MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Scenario 1: Overlap_schedule = OFF, Offline benchmark, Real-world Prompts

feat/better_estimate_policy

MINISGL_DISABLE_OVERLAP_SCHEDULING=1 python benchmark/offline/bench_wildchat.py
...
Input length: count=256, min=9, p50=37, p90=471, p99=3665, max=4699
Output length: count=256, min=77, p50=691, p90=1730, p99=4096, max=4096
Bench requests: 256
Output budget: 1048576tok, Actual output: 246443tok
Total: 246443tok, Time: 54.56s, Throughput: 4517.28tok/s

main

Scenario 2: Overlap_schedule = ON, Offline benchmark, Real-world Prompts
Scenario 3: Overlap_schedule = OFF, Online benchmark, Real-world Prompts
Scenario 4: Overlap_schedule = ON, Online benchmark, Real-world Prompts

Checklist

Basic feature w/o overlap schedule.
Adaption for overlap schedule.
Online benchmark w/ wildchat real-world prompt.
Format code with pre-commit.
Add unit tests.
Provide speed benchmark results.
Follow the minisgl code style.

DarkSharpness · 2026-03-10T03:02:45Z

quick question: does this feature supports overlap scheduling? By default we turn on overlap schedule, which improves GPU utilization.

YzXiao101 · 2026-03-10T15:16:11Z

quick question: does this feature supports overlap scheduling? By default we turn on overlap schedule, which improves GPU utilization.

@DarkSharpness Yes, this new feature will support overlap scheduling just like SGLang. I’ve already finished the design. In short, I will add a new flag to the request so that retracted requests can be skipped in _process_last_data. In addition, I will add an online benchmark setup in bench_qwen.py based on real-world prompts from WildChat to validate the benefit from this new feature.

These work above is still WIP, and I expect to complete all development, unit tests, and benchmarking by the end of this weekend !

DarkSharpness · 2026-03-11T18:42:38Z

python/minisgl/scheduler/prefill.py

    table_manager: TableManager

+    @classmethod
+    def create(


Use __post_init__ and initialize reserved_size in it.

DarkSharpness · 2026-03-11T18:44:12Z

python/minisgl/scheduler/prefill.py

+        adder.reserved_size = sum(adder._get_running_reserve(req) for req in running_reqs)
+        return adder
+
+    def _get_running_reserve(self, req: Req) -> int:


Could we seperate out the reserve-estimation module/class? It should not be highly coupled with prefiller

YzXiao101 added 3 commits March 8, 2026 20:03

[Feature] Better estimation policy

6921281

todo: refactor code style

70ddfac

todo: skip retracted requests for overlap schedule

c81e0e1

DarkSharpness reviewed Mar 11, 2026

View reviewed changes

YzXiao101 force-pushed the feat/better_estimate_policy branch from 5e32f9e to 14666d4 Compare March 12, 2026 14:54

feat: online benchmark w/ wildchat real-world prompt

e3e712e

YzXiao101 force-pushed the feat/better_estimate_policy branch from 14666d4 to e3e712e Compare March 12, 2026 14:57

YzXiao101 added 6 commits March 14, 2026 00:09

feat: retraction safety assurance for overlap schedule

c1d100d

fix: use __post_init__ and initialize reserved_size in it

21978e5

fix: move estimate login into decode mgr

91293e4

fix: max_tokens correction for retracted req

ec885dd

fix: refactor retraction steps

fd32086

refactor: refactor code style

6cc0722

YzXiao101 force-pushed the feat/better_estimate_policy branch from a7187be to 6cc0722 Compare March 18, 2026 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Better estimation policy#97

[Feature] Better estimation policy#97
YzXiao101 wants to merge 10 commits intosgl-project:mainfrom
YzXiao101:feat/better_estimate_policy

YzXiao101 commented Mar 8, 2026 •

edited

Loading

Uh oh!

DarkSharpness commented Mar 10, 2026

Uh oh!

YzXiao101 commented Mar 10, 2026

Uh oh!

DarkSharpness Mar 11, 2026

Uh oh!

DarkSharpness Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YzXiao101 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

[WIP] Benchmark Result

Checklist

Uh oh!

DarkSharpness commented Mar 10, 2026

Uh oh!

YzXiao101 commented Mar 10, 2026

Uh oh!

DarkSharpness Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

DarkSharpness Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YzXiao101 commented Mar 8, 2026 •

edited

Loading

DarkSharpness Mar 11, 2026 •

edited

Loading