Skip to content

Question on Implementing Context Parallelism with MoBA Efficient Attention #34

@weiye666

Description

@weiye666

Thank you for providing the code—it has been very helpful!
I'm currently exploring the integration of Context Parallelism (CP) with the MoBA Efficient Attention mechanism, and I’d greatly appreciate it if you could share some technical insights or guidance on this topic.

From the implementation, it seems that during the all-gather operation for KV tensors, only the key and value blocks are concatenated across devices, while the query (Q) vectors are not gathered or concatenated accordingly. This leads to a situation where Q and KV are not aligned globally across the full context, which raises a question:

Since MoBA's dynamic block selection depends on computing gating scores (Q × Kᵀ) to select top-k blocks, how is this step supposed to work correctly without access to global Q vectors, especially when Q is sharded across devices in CP?

In other words, how can we correctly compute the gating scores used for block-level routing in MoBA under CP, if only KV is all-gathered and Q remains local?

Any clarification on how to enable block selection based on Q×Kᵀ in a CP-compatible setting would be immensely helpful. If there's any reference code, design suggestion, or architectural note on this part, I’d love to learn more.

Thank you in advance for your time and help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions