Skip to content

pyverbs: Optimize CQ poll to use batch ibv_poll_cq#1720

Open
PetruMicu wants to merge 1 commit intolinux-rdma:masterfrom
PetruMicu:cq-batch-poll
Open

pyverbs: Optimize CQ poll to use batch ibv_poll_cq#1720
PetruMicu wants to merge 1 commit intolinux-rdma:masterfrom
PetruMicu:cq-batch-poll

Conversation

@PetruMicu
Copy link
Copy Markdown

Background
ibv_poll_cq(cq, num_entries, wc) is the libibverbs API for retrieving work completions from a Completion Queue. The num_entries parameter exists specifically to allow the caller to retrieve multiple CQEs in a single call.

Problem
The previous implementation of CQ.poll(num_entries) did not use this batching capability. Instead, it called ibv_poll_cq in a loop with num_entries=1, retrieving one completion at a time:

while npolled < num_entries:
    rc = v.ibv_poll_cq(self.cq, 1, &wc)  # single entry, repeated N times
    ...
    npolled += 1

This approach has two problems:

Call overhead multiplied by N. Each ibv_poll_cq invocation carries overhead: the verbs dispatch, memory barriers required to read the CQ ring buffer, and any provider-level locking. Calling it N times means paying that cost N times, even when all N completions are already sitting in the CQ.

Misuse of the API. The num_entries parameter exists precisely to amortize that overhead across a batch. Walking the CQ ring once for N entries is fundamentally cheaper than walking it N times for 1 entry each, both in terms of CPU cycles and memory access patterns.

Change
Replace the loop with a single ibv_poll_cq call that passes the full num_entries count into a stack-allocated ibv_wc array:

wcs_c = <v.ibv_wc *>malloc(num_entries * sizeof(v.ibv_wc))
npolled = v.ibv_poll_cq(self.cq, num_entries, wcs_c)
for i in range(npolled):
    wcs.append(WC(...wcs_c[i]...))

The return value (npolled, wcs) is unchanged — full API compatibility is preserved. Callers that handle partial results (e.g. _poll_cq() in tests/utils.py) continue to work correctly, since ibv_poll_cq already returns however many completions are available up to num_entries.

Impact
In high-throughput workloads where poll(N) is called with N > 1, this reduces the number of ibv_poll_cq invocations from N to 1, eliminating redundant memory barrier overhead and CQ ring traversals. The improvement scales directly with the batch size.

Signed-off-by: Petru Micu <micu.petru2899@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant