Skip to content

Conversation

@czhu15
Copy link

@czhu15 czhu15 commented Nov 26, 2025

Split fp8_fused_sdpa into two phases to decrease the TTFT.
The first phase will call fused_sdpa kernel w/o mask for prefix cached part.
The second phase will call fused_sdpa kernel with mask for the new prompt part.
Via splitting fp8_fused_sdpa into two phases, it decreases the memory consumption and also decreases the TTFT with current synapse fused_sdpa kernel.

@czhu15
Copy link
Author

czhu15 commented Nov 26, 2025

cc @yangulei

@czhu15 czhu15 marked this pull request as draft November 26, 2025 00:40
Co-authored-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Bob Zhu <bob.zhu@intel.com>
@czhu15 czhu15 marked this pull request as ready for review December 1, 2025 01:54
@czhu15
Copy link
Author

czhu15 commented Dec 1, 2025

The output of the APC example code is OK.
The performance of TTFT is decreased to ~2 seconds with the customer's test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant