[Feature] Better estimation policy#97
Conversation
|
quick question: does this feature supports overlap scheduling? By default we turn on overlap schedule, which improves GPU utilization. |
@DarkSharpness Yes, this new feature will support overlap scheduling just like SGLang. I’ve already finished the design. In short, I will add a new flag to the request so that retracted requests can be skipped in _process_last_data. In addition, I will add an online benchmark setup in bench_qwen.py based on real-world prompts from WildChat to validate the benefit from this new feature. These work above is still WIP, and I expect to complete all development, unit tests, and benchmarking by the end of this weekend ! |
python/minisgl/scheduler/prefill.py
Outdated
| table_manager: TableManager | ||
|
|
||
| @classmethod | ||
| def create( |
There was a problem hiding this comment.
Use __post_init__ and initialize reserved_size in it.
python/minisgl/scheduler/prefill.py
Outdated
| adder.reserved_size = sum(adder._get_running_reserve(req) for req in running_reqs) | ||
| return adder | ||
|
|
||
| def _get_running_reserve(self, req: Req) -> int: |
There was a problem hiding this comment.
Could we seperate out the reserve-estimation module/class? It should not be highly coupled with prefiller
5e32f9e to
14666d4
Compare
14666d4 to
e3e712e
Compare
a7187be to
6cc0722
Compare
Motivation
Description
This PR introduces a more accurate estimation policy from SGLang, aligning admission closer to actual memory usage while allowing decode to recover safely if estimates prove overly optimistic.
new_token_ratio.new_token_ratioduring decode and resets it when the scheduler becomes idle.[WIP] Benchmark Result
Scenario 1: Overlap_schedule = OFF, Offline benchmark, Real-world Prompts
Scenario 2: Overlap_schedule = ON, Offline benchmark, Real-world Prompts
Scenario 3: Overlap_schedule = OFF, Online benchmark, Real-world Prompts
Scenario 4: Overlap_schedule = ON, Online benchmark, Real-world Prompts
Checklist