Skip to content

Conversation

@whoisj
Copy link
Collaborator

@whoisj whoisj commented Nov 18, 2025

This change moves the nixl_connect library from a persistent connector based design to a per connection based design. When an active or passive operation is created, a connection object is created to represent the local side of a connection to a remote worker.
The connection is responsible for keeping the descriptor data and operation state separated when multiple operations are executing at the same time.

Prior to this change, it was possible for two operations to intersect leading to errors and disconnections.

Includes documentation changes and updates to dependent EPD code.

Summary by CodeRabbit

  • New Features

    • Added hostname property for accessing connector information.
    • Introduced Connection abstraction for improved per-connection state management.
  • Bug Fixes

    • Removed unnecessary explicit initialization calls improving startup sequencing.
  • Refactor

    • Converted read/write creation operations to async-first APIs for better concurrency handling.
    • Deprecated Connector.initialize() method; initialization now occurs implicitly.
    • Removed namespace and runtime properties from public API.
  • Documentation

    • Updated API documentation and examples reflecting async operation patterns.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

This change moves the nixl_connect library from a persistent connector based design
to a per connection based design. When an active or passive operation is created,
a connection object is created to represent the local side of a connection to a remote worker.
The connection is responsible for keeping the descriptor data and operation state separated
when multiple operations are executing at the same time.

Prior to this change, it was possible for two operations to intersect leading to errors and
disconnections.

Signed-off-by: J Wyman <jwyman@nvidia.com>
This change updates nixl_connect documentation to reflect the changes in the previous commit.

Signed-off-by: J Wyman <jwyman@nvidia.com>
This change updates all of the Dynamo usages of nixl_connect to adopt the changes in the previous commit.

- Removal of `Connector.initialize()` calls.
- Addition of the `await` keyword to `Connector.create_readable()` and `.create_writable()` calls.

Signed-off-by: J Wyman <jwyman@nvidia.com>
@whoisj whoisj force-pushed the jwyman/nixl_connect/better_concurrency branch from cd17235 to 177e283 Compare November 18, 2025 16:58
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 18, 2025

Walkthrough

The changes introduce asynchronous operation creation for create_readable() and create_writable() methods, remove explicit connector initialization calls across multiple handler files, and refactor the core NIXL Connect binding to introduce a new Connection abstraction layer that encapsulates per-connection state. API documentation and usage examples are updated to reflect these async patterns.

Changes

Cohort / File(s) Summary
Connector initialization removal
components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py, components/src/dynamo/trtllm/main.py, components/src/dynamo/vllm/multimodal_handlers/worker_handler.py, examples/multimodal/components/encode_worker.py, examples/multimodal/components/video_encode_worker.py
Removes await self._connector.initialize() calls from async initialization routines, deferring or eliminating explicit connector setup during startup.
Async operation creation adoption
components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py, components/src/dynamo/trtllm/encode_helper.py, components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py, examples/multimodal/components/encode_worker.py, examples/multimodal/components/video_encode_worker.py
Converts create_readable() and create_writable() calls to async, requiring await before context manager entry; control flow shifts to await operation creation before use.
API documentation updates
docs/api/nixl_connect/connector.md, docs/api/nixl_connect/readable_operation.md, docs/api/nixl_connect/writable_operation.md, docs/api/nixl_connect/read_operation.md, docs/api/nixl_connect/write_operation.md
Updates example usage to reflect async method signatures; removes initialize() calls from startup sequences; connector.md adds hostname property and removes namespace and runtime properties.
Core NIXL Connect binding refactoring
lib/bindings/python/src/dynamo/nixl_connect/__init__.py
Introduces Connection class to encapsulate per-connection NIXL agent state; migrates all operation classes (AbstractOperation, ReadOperation, WriteOperation, etc.) to use Connection instead of Connector; makes begin_read(), begin_write(), create_readable(), and create_writable() async; adds private _create_connection() method; updates descriptor registration and remote agent handling to reference connection objects.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

  • lib/bindings/python/src/dynamo/nixl_connect/__init__.py: Substantial refactoring introducing a new Connection abstraction with extensive updates to class constructors, property references, and lifecycle management across multiple operation types. Requires careful validation of memory registration, remote agent interactions, and backward compatibility implications.
  • Async pattern consistency: Verify that all await calls are placed correctly in context manager expressions and that initialization sequences are properly deferred across handler files.
  • Public API signature changes: Confirm that create_readable(), create_writable(), begin_read(), begin_write() are correctly documented as async in all affected modules and that callers properly await these operations.
  • Property migration: Validate removal of namespace and runtime properties from Connector and addition of hostname property, ensuring no breaking changes to external consumers.

Poem

🐰 Connectors now birth Connections with grace,
Async awaitings paint data-flow's face,
No init-less hops—each create's embraced,
Old properties fade, hostname takes place,
Hopping through bindings at lightning's own pace! ⚡

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.49% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description covers the main change and rationale, but lacks specific details on files changed and where to start review as required by the template. Add 'Where should the reviewer start?' section highlighting key files (nixl_connect/init.py, connector.md) and more detail on implementation changes for clarity.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: nixl_connect: Improve Concurrency Support' accurately describes the main change: refactoring nixl_connect to support concurrent operations via per-connection design.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (1)

1739-1791: WriteOperation type check incorrectly expects Connector instead of Connection

The new constructor signature is:

def __init__(
    self,
    connection: Connection,
    local_descriptors: Descriptor | list[Descriptor],
    remote_metadata: RdmaMetadata,
) -> None:

But the implementation still does:

if not isinstance(connection, Connector):
    raise TypeError(
        "Argument `connector` must be `dynamo.nixl_connect.Connector`."
    )

This is a blocking bug:

  • Callers (e.g., Connector.begin_write) pass a Connection, not a Connector.
  • Every valid call will raise TypeError, preventing write operations from ever starting.
  • The error message is also now misleading (connector vs connection).

The subsequent Remote(connection, remote_metadata.nixl_metadata) call is otherwise correct, since Remote expects a Connection.

Suggested fix:

-        if not isinstance(connection, Connector):
-            raise TypeError(
-                "Argument `connector` must be `dynamo.nixl_connect.Connector`."
-            )
+        if not isinstance(connection, Connection):
+            raise TypeError(
+                "Argument `connection` must be `dynamo.nixl_connect.Connection`."
+            )

Once updated, the rest of the logic (metadata validation, Remote construction, super call) aligns with the new per-connection model.

docs/api/nixl_connect/read_operation.md (1)

33-44: Doc example for begin_read has argument order reversed

The example currently shows:

with await self.connector.begin_read(descriptor, remote_metadata) as read_op:

But the actual signature is:

async def begin_read(
    self,
    remote_metadata: RdmaMetadata,
    local_descriptors: Descriptor | list[Descriptor],
) -> ReadOperation:

Passing descriptor first and remote_metadata second will fail the type check on remote_metadata. The example should instead be:

with await self.connector.begin_read(remote_metadata, descriptor) as read_op:
    ...

so that the RdmaMetadata and descriptor parameters line up correctly.

🧹 Nitpick comments (6)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (3)

502-586: Connection abstraction is mostly solid; minor cleanup opportunities

The Connection class cleanly encapsulates name, connector, and a dedicated nixl_agent, and the async initialize hook gives you room for future setup while staying idempotent.

Two small nits:

  • _agent_metadata is never used; you can drop it or extend metadata to cache into it if you intend to reuse.
  • The docstring under initialize still talks about initializing the “connector”; consider updating to “connection” to avoid confusion.

These are cosmetic and can be deferred.


661-745: Async begin_read / begin_write correctly create per-call Connections

Using await self._create_connection() and passing the resulting Connection into ReadOperation and WriteOperation achieves the PR’s goal of isolating per-operation NIXL state. The type checks on metadata/descriptors and operation-kind validation are preserved.

One inconsistency to consider tightening later: begin_read compares remote_metadata.operation_kind to OperationKind.READ.value (int), while begin_write compares to OperationKind.WRITE (IntEnum). Both work, but using .value in both places would be more uniform.


1685-1706: WritableOperation constructor aligns with Connection-based PassiveOperation

Passing connection into PassiveOperation for a WRITE operation is consistent with the per-connection refactor. The docstring still mentions local/Connector in the “Raises” section, which you might want to update to connection/Connection, but the behavior is correct.

docs/api/nixl_connect/write_operation.md (1)

34-45: Example now matches async begin_write API

Using with await self.connector.begin_write(descriptor, remote_metadata) as write_op: correctly reflects the async begin_write signature and keeps the lifetime of the WriteOperation scoped to the context manager. If you want slightly clearer style, you could split it into write_op = await ... followed by with write_op:, but the current form is fine.

docs/api/nixl_connect/writable_operation.md (1)

34-49: Writable operation example correctly reflects async factory

Using with await self.connector.create_writable(descriptor) as write_op: matches the async create_writable API and keeps the operation scoped to the context manager while awaiting completion. If desired, you could split the await and with for clarity, but the current form is functionally sound.

docs/api/nixl_connect/connector.md (1)

108-136: Async factory method docs are aligned; consider clarifying await usage

Documenting create_readable and create_writable as async def correctly reflects their async factory behavior and the per-connection model. To make usage unambiguous for readers skimming this page, consider adding a brief note like “This is a coroutine; use as op = await connector.create_readable(...)” in each section.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 771d8c0 and 177e283.

📒 Files selected for processing (14)
  • components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py (1 hunks)
  • components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py (0 hunks)
  • components/src/dynamo/trtllm/encode_helper.py (1 hunks)
  • components/src/dynamo/trtllm/main.py (0 hunks)
  • components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py (1 hunks)
  • components/src/dynamo/vllm/multimodal_handlers/worker_handler.py (0 hunks)
  • docs/api/nixl_connect/connector.md (3 hunks)
  • docs/api/nixl_connect/read_operation.md (1 hunks)
  • docs/api/nixl_connect/readable_operation.md (1 hunks)
  • docs/api/nixl_connect/writable_operation.md (1 hunks)
  • docs/api/nixl_connect/write_operation.md (1 hunks)
  • examples/multimodal/components/encode_worker.py (1 hunks)
  • examples/multimodal/components/video_encode_worker.py (1 hunks)
  • lib/bindings/python/src/dynamo/nixl_connect/__init__.py (41 hunks)
💤 Files with no reviewable changes (3)
  • components/src/dynamo/trtllm/main.py
  • components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py
  • components/src/dynamo/vllm/multimodal_handlers/worker_handler.py
🧰 Additional context used
🧬 Code graph analysis (5)
examples/multimodal/components/encode_worker.py (1)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (1)
  • create_readable (747-761)
components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py (1)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (1)
  • create_readable (747-761)
components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py (1)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (1)
  • create_readable (747-761)
examples/multimodal/components/video_encode_worker.py (1)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (1)
  • create_readable (747-761)
components/src/dynamo/trtllm/encode_helper.py (1)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (1)
  • create_readable (747-761)
🪛 Ruff (0.14.5)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py

525-527: Avoid specifying long messages outside the exception class

(TRY003)


529-529: Avoid specifying long messages outside the exception class

(TRY003)


531-531: Avoid specifying long messages outside the exception class

(TRY003)


1024-1026: Avoid specifying long messages outside the exception class

(TRY003)


1028-1028: Avoid specifying long messages outside the exception class

(TRY003)


1034-1037: Avoid specifying long messages outside the exception class

(TRY003)


1550-1552: Avoid specifying long messages outside the exception class

(TRY003)


1554-1554: Avoid specifying long messages outside the exception class

(TRY003)


1556-1556: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (23)
lib/bindings/python/src/dynamo/nixl_connect/__init__.py (14)

69-151: Connection-based AbstractOperation wiring looks correct

The move from Connector to per-operation Connection plus registering descriptors against self._connection is consistent and isolates state per connection. Argument validation and descriptor tuple creation still enforce the same invariants as before.


224-337: ActiveOperation correctly uses per-connection NIXL agent

Using remote.connection to seed the base class and then consistently calling self._connection._nixl.get_xfer_descs / initialize_xfer ensures the transfer handle is associated with the specific connection instance, which is what you want for concurrent operations. The validation of local/remote descriptors and sizes remains intact.


371-416: Release/cancel paths now route through the Connection’s NIXL agent

Releasing the transfer handle and cancelling via self._connection._nixl.release_xfer_handle(...) is consistent with the new per-connection model and should avoid cross-connection interference, assuming _connection is always set by construction (which it is via AbstractOperation).


468-484: Status polling against connection-local NIXL state is consistent

Switching both the initial transfer(...) call and subsequent check_xfer_state(...) to self._connection._nixl keeps status tracking scoped to the correct agent and avoids global connector races.


617-642: Connector hostname and connection counter plumbing look good

Storing self._hostname and exposing it via a hostname property is straightforward and correctly reflected in __repr__. The _connection_count counter is a simple way to generate unique Connection names per connector instance.


747-795: Connection factory is simple and aligns with the per-connection design

_create_connection using a monotonically increasing counter, constructing a new Connection, and awaiting its initialize method is a clean abstraction for the caller methods. The deprecated initialize on Connector being a no-op with a log message is a good compatibility story.


859-953: Descriptor destructor correctly deregisters via Connection

Tying deregister_memory to self._connection._nixl ensures memory is unregistered on the correct agent, and the extra guard on _connection is not None avoids dereferencing when a descriptor was never registered. This matches the new per-connection semantics.


1209-1233: PassiveOperation now correctly depends on Connection

Passing a Connection down into AbstractOperation and later using it for status/metadata lookups makes the passive side consistent with the active operations. The status initialization and serialization behavior are unchanged.


1275-1303: PassiveOperation.metadata correctly uses connection-level metadata

Using self._connection.metadata as the source of NIXL agent metadata, compressing with zlib, and then base64/hex encoding is coherent with the new Connection abstraction. The logging around compression ratio is also helpful for diagnosing metadata bloat.


1318-1346: Notification polling via connection-local NIXL agent is appropriate

Querying notifications through self._connection._nixl.update_notifs() and logging transitions with self._connection.name keeps status tracking tied to the specific connection/agent and improves observability when multiple connections are active.


1366-1415: ReadOperation now correctly takes a Connection, but remote construction is sound

The updated signature __init__(self, connection: Connection, ...) with a type check plus instantiating Remote(connection, remote_metadata.nixl_metadata) is the right way to anchor the remote to a specific connection. The remaining validation and logging mirror the previous behavior.


1469-1475: ReadableOperation’s constructor correctly threads through Connection

Calling super().__init__(connection, OperationKind.READ, local_descriptors) cleanly reuses the shared passive logic while binding the operation to a particular connection instance.


1545-1579: Remote now correctly binds to a Connection’s NIXL agent

Validating connection as a Connection, storing it, and then calling connection._nixl.add_remote_agent(...) ensures each Remote is scoped to the connection’s agent. The fallback decoding logic for legacy metadata remains intact, and __repr__ including connection={self._connection.name} is useful for debugging.


1602-1619: Remote._release uses connection-local remove_remote_agent

Calling self._connection._nixl.remove_remote_agent(self._name) matches the new ownership model and should prevent stale remote registrations from lingering across operations.

components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py (2)

66-72: Connector initialization change is compatible with new no-op initialize()

Creating self._connector = connect.Connector() without awaiting initialize() matches the updated Connector API where initialize is deprecated and a no-op. Startup behavior stays simple while still supporting per-request Connection creation via create_readable.


125-144: Async readable creation pattern is correct and matches new API

The block:

descriptor = connect.Descriptor(embeddings_cpu)

with await self._connector.create_readable(descriptor) as readable:
    request.serialized_request = readable.metadata()
    ...
    await readable.wait_for_completion()

correctly:

  • moves embeddings to CPU (to avoid transport issues),
  • creates a Descriptor,
  • awaits the async factory create_readable(...) to get a ReadableOperation,
  • then uses the returned object as a synchronous context manager, and
  • waits for completion before leaving the context.

This is consistent with create_readable’s async signature and ReadableOperation implementing a sync context manager.

docs/api/nixl_connect/readable_operation.md (1)

33-48: ReadableOperation example correctly reflects async create_readable usage

The example:

with await self.connector.create_readable(descriptor) as read_op:
    op_metadata = read_op.metadata()
    ...
    await read_op.wait_for_completion()

matches the updated Connector.create_readable(...) async API and the fact that ReadableOperation is a synchronous context manager. This should be a good reference for callers adopting the new pattern.

components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py (2)

159-177: SGLang handler correctly adopts async readable creation pattern

This section:

descriptor = connect.Descriptor(precomputed_embeddings)

with await self._connector.create_readable(descriptor) as readable:
    request.serialized_request = readable.metadata()
    ...
    await readable.wait_for_completion()

is consistent with the new async create_readable API and the ReadableOperation synchronous context manager. It cleanly wires the precomputed embeddings into the downstream worker via NIXL metadata and waits for completion before returning.


182-188: Connector initialization update aligns with deprecated initialize()

Instantiating self._connector = connect.Connector() without awaiting initialize() matches the refactored Connector semantics (per-request connections and deprecated initialize). This simplifies startup for the handler while still enabling concurrent RDMA operations.

components/src/dynamo/trtllm/encode_helper.py (1)

242-261: Async create_readable usage is correct and lifecycle-safe

Awaiting connector.create_readable(descriptor) directly in the with statement and then awaiting readable_op.wait_for_completion() inside the block aligns with the new async API and ensures the readable operation (and its underlying connection) is properly cleaned up even if the async generator is cancelled.

examples/multimodal/components/encode_worker.py (1)

124-141: Per-request readable creation matches new async connector design

with await self._connector.create_readable(descriptor) as readable: correctly uses the async factory, scopes the readable/connection to the request, and ensures await readable.wait_for_completion() runs before responses are streamed. This fits the per-connection concurrency model the PR is introducing.

examples/multimodal/components/video_encode_worker.py (1)

153-178: Async readable usage for video path is consistent and correct

The switch to with await self._connector.create_readable(descriptor) as readable: mirrors the image worker, correctly awaits the async factory, and keeps the RDMA-readable operation alive until await readable.wait_for_completion() completes. No issues from a concurrency or lifecycle standpoint.

docs/api/nixl_connect/connector.md (1)

153-160: hostname property documentation looks good

The new hostname property is clearly described and fits well alongside the other connector properties; it gives users an easy way to introspect the worker host without exposing runtime internals.

@whoisj whoisj force-pushed the jwyman/nixl_connect/better_concurrency branch from ee8731c to adcf1f4 Compare November 18, 2025 17:38
Signed-off-by: J Wyman <jwyman@nvidia.com>
@whoisj whoisj force-pushed the jwyman/nixl_connect/better_concurrency branch from adcf1f4 to 2006adc Compare November 18, 2025 17:43
This change automatically deregisters registered Descriptor memory when an operation goes out of scope.
Deregistering the memory enables the descriptor to be reused with future operations.
This is a common design pattern, especially with CUDA device allocated memory.

Signed-off-by: J Wyman <jwyman@nvidia.com>
@whoisj whoisj force-pushed the jwyman/nixl_connect/better_concurrency branch from 8c88c44 to d946178 Compare November 19, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants