-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Thank you for providing the code—it has been very helpful!
I'm currently exploring the integration of Context Parallelism (CP) with the MoBA Efficient Attention mechanism, and I’d greatly appreciate it if you could share some technical insights or guidance on this topic.
From the implementation, it seems that during the all-gather operation for KV tensors, only the key and value blocks are concatenated across devices, while the query (Q) vectors are not gathered or concatenated accordingly. This leads to a situation where Q and KV are not aligned globally across the full context, which raises a question:
Since MoBA's dynamic block selection depends on computing gating scores (Q × Kᵀ) to select top-k blocks, how is this step supposed to work correctly without access to global Q vectors, especially when Q is sharded across devices in CP?
In other words, how can we correctly compute the gating scores used for block-level routing in MoBA under CP, if only KV is all-gathered and Q remains local?
Any clarification on how to enable block selection based on Q×Kᵀ in a CP-compatible setting would be immensely helpful. If there's any reference code, design suggestion, or architectural note on this part, I’d love to learn more.
Thank you in advance for your time and help!