Split fp8_fused_sdpa into two phases #2346

czhu15 · 2025-11-26T00:37:48Z

Split fp8_fused_sdpa into two phases to decrease the TTFT.
The first phase will call fused_sdpa kernel w/o mask for prefix cached part.
The second phase will call fused_sdpa kernel with mask for the new prompt part.
Via splitting fp8_fused_sdpa into two phases, it decreases the memory consumption and also decreases the TTFT with current synapse fused_sdpa kernel.

czhu15 · 2025-11-26T00:38:24Z

cc @yangulei

Co-authored-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com>

czhu15 · 2025-12-01T01:55:25Z

The output of the APC example code is OK.
The performance of TTFT is decreased to ~2 seconds with the customer's test data.

czhu15 marked this pull request as draft November 26, 2025 00:40

Split fp8_fused_sdpa into two phases

82657ca

Co-authored-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com>

czhu15 marked this pull request as ready for review December 1, 2025 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split fp8_fused_sdpa into two phases #2346

Split fp8_fused_sdpa into two phases #2346

Uh oh!

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Split fp8_fused_sdpa into two phases #2346

Are you sure you want to change the base?

Split fp8_fused_sdpa into two phases #2346

Uh oh!

Conversation

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Nov 26, 2025

Uh oh!

czhu15 commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant