WIP: pp pd decode wip #13
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Coauthor: @ShangmingCai (mooncake backend)
WIP async PP x PD decode loop.
Currently support PP on both prefill and decode, but the PP size needs to be the same (or decode PP=1).
Currently mooncake backend works, nixl is WIP.
Also needs a bit more stress testing in order to test the retract req logic.
Mistral 7B model, PP=4 on single prefill worker (4 x H100), and PP=4 on single decode worker (4 x H100). 128 prompts, 1000 input and 1000 output len approx.
Mistral 7B model, PP=2 on single prefill worker (2 x H100), and PP=2 on single decode worker (2 x H100). 256 prompts, 8000 input and 16000 output len approx. Testing the retract logic. Scheduler conservativeness = 0.1.
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist