Skip to content

cache comms rdma registration#137

Merged
amirafzali merged 2 commits intomainfrom
cache-comms-rdma
Mar 19, 2026
Merged

cache comms rdma registration#137
amirafzali merged 2 commits intomainfrom
cache-comms-rdma

Conversation

@amirafzali
Copy link
Member

@amirafzali amirafzali commented Mar 10, 2026

Persists RDMA memory registrations in a transport context cache. This reduces subsequent kernel calls for repeat tensor operations. Holds a weakref to their underlying "undetyped_storage" and drops the process side registration once all views of the tensor are lost. Memory owned by the SV is always cached. Memory owned by the client can be cached with TORCHSTORE_CLIENT_RDMA_CACHE, which is default true. I will provide a nice high level config for all magic env vars soon!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 10, 2026
@amirafzali amirafzali changed the base branch from batch-get to comms-rdma-batch March 10, 2026 20:06
@amirafzali amirafzali force-pushed the comms-rdma-batch branch 4 times, most recently from d146aac to a4b9083 Compare March 16, 2026 18:54
@amirafzali amirafzali changed the base branch from comms-rdma-batch to transport-context-refactor March 16, 2026 20:38
@amirafzali amirafzali force-pushed the transport-context-refactor branch from 21eceb0 to ee64303 Compare March 17, 2026 00:20
@amirafzali amirafzali changed the base branch from transport-context-refactor to main March 17, 2026 17:33
@amirafzali amirafzali changed the title Cache comms rdma registration cache comms rdma registration Mar 17, 2026

mem = RdmaMemory(tensor)
self._cache[key] = mem
self._storage_refs[key] = weakref.ref(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monarch rdmabuffer drop() is async, this same pattern may not work there. might need a deferred drop. cc @allenwang28

Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, ty

@amirafzali amirafzali merged commit 74dae88 into main Mar 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants