Question on Implementing Context Parallelism with MoBA Efficient Attention

Thank you for providing the code—it has been very helpful!
I'm currently exploring the integration of Context Parallelism (CP) with the MoBA Efficient Attention mechanism, and I’d greatly appreciate it if you could share some technical insights or guidance on this topic.

From the implementation, it seems that during the all-gather operation for KV tensors, only the key and value blocks are concatenated across devices, while the query (Q) vectors are not gathered or concatenated accordingly. This leads to a situation where Q and KV are not aligned globally across the full context, which raises a question:

Since MoBA's dynamic block selection depends on computing gating scores (Q × Kᵀ) to select top-k blocks, how is this step supposed to work correctly without access to global Q vectors, especially when Q is sharded across devices in CP?

In other words, how can we correctly compute the gating scores used for block-level routing in MoBA under CP, if only KV is all-gathered and Q remains local?

Any clarification on how to enable block selection based on Q×Kᵀ in a CP-compatible setting would be immensely helpful. If there's any reference code, design suggestion, or architectural note on this part, I’d love to learn more.

Thank you in advance for your time and help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question on Implementing Context Parallelism with MoBA Efficient Attention #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on Implementing Context Parallelism with MoBA Efficient Attention #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions