-
Notifications
You must be signed in to change notification settings - Fork 619
[0.11.0][doc] add aclgraph developer guide #3947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.11.0][doc] add aclgraph developer guide #3947
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds comprehensive documentation for the ACLGraph feature. A new file, docs/source/developer_guide/feature_guide/ACLGraph.md, is introduced, covering the motivation, usage, implementation details, and limitations of ACLGraph. The feature guide index is also updated accordingly. The documentation is a valuable addition for developers working with this feature. While the content is informative, the document would benefit from a proofreading pass to address several grammatical errors and improve overall clarity.
|
|
||
| ## How to use ACLGraph? | ||
|
|
||
| ACLGraph is enabled by default in V1 Engine, just set to use V1 Engine is enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users need to check that enforce-eager is not set to True.
|
|
||
| ## How it works? | ||
|
|
||
| In short, graph mode works in two steps: **capture and replay**. When engine starts, we will capture all of the ops in model forward and save it as a graph, and when req come in, we just replay the graph on gpus, and waiting for result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just replay the graph on gpus, and
maybe npus? I'm not quite sure.
|
|
||
| ### Padding and Bucketing | ||
|
|
||
| Due to graph can only replay the ops captured before, without doing tiling and checking graph input, so we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to graph can only replay the ops captured before, without doing tiling and checking graph input, so we need to ensure the consistency of the graph input
grammar issue here, "due to" and "so" should not appear together.
|
|
||
| Due to graph can only replay the ops captured before, without doing tiling and checking graph input, so we need to ensure the consistency of the graph input, but we know that model input's shape depends on the request scheduled by Scheduler, we can't ensure the consistency. | ||
|
|
||
| Obviously, we can solve this problem by capturing the biggest shape and padding all of the model input to it. But it will bring a lot of redundant computing and make performance worse. So we can capture multiple graphs with different shape, and pad the model input to the nearest graph, it will greatly reduce redundant computing, but when `max_num_batched_tokens` is very large, the number of graphs that need to be captured will also become very large. But we know that when intensor's shape is large, the computing time will be very long, and graph mode is not necessary in this case. So all of things we need to do is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use periods to break it into several sentences.
it will greatly reduce redundant computing
which will ....
0c0bde9 to
2aaa7ee
Compare
Signed-off-by: zzzzwwjj <1183291235@qq.com>
2aaa7ee to
55bf1a9
Compare
What this PR does / why we need it?
Add aclgraph developer guide.
Does this PR introduce any user-facing change?
How was this patch tested?