Add vsock control plane: types, Firecracker support, host client, and facade integration#50
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a vsock-based control plane: new VsockConfig and VMConfig field, a host-side VsockClient, Firecracker vsock API call, VM lifecycle hooks and helpers, facade methods for executing/health-checking via vsock, tests, and a planning document describing phased implementation. Changes
Sequence DiagramsequenceDiagram
participant Host as Host (SmolVM)
participant Hypervisor as Hypervisor (QEMU / Firecracker)
participant Guest as Guest Agent
Host->>Hypervisor: Attach vsock device (FirecrackerClient.add_vsock / QEMU attach)
Hypervisor->>Guest: Guest boots and starts vsock listener
Note over Host: Later — user calls SmolVM.run_vsock()
Host->>Host: resolve_vsock_config(vm_id)
Host->>Host: lazy init VsockClient(guest_cid, guest_port)
Host->>Guest: Connect over AF_VSOCK and send JSON request (command, token, env, cwd, timeout)
Guest->>Guest: Execute command, capture exit_code/stdout/stderr
Guest->>Host: Send JSON response (ok/result or ok/exit_code, stdout, stderr)
Host->>Host: Parse response -> CommandResult and return
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
src/smolvm/vm.py (1)
60-65: Add collision handling on derived guest CIDs.
derive_vsock_guest_cid()is deterministic, but it doesn’t check host-level CID uniqueness. A low-probability hash collision would be operationally severe. Add reservation/check-and-retry against persisted VM state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/vm.py` around lines 60 - 65, derive_vsock_guest_cid(vm_id: str) currently returns a deterministic CID but never verifies that the CID is unused, which risks collisions; update derive_vsock_guest_cid to loop: compute the candidate CID as now, then check against the persisted VM registry/state (e.g. VMStore/VMRegistry methods that list or lookup assigned CIDs) and if it's already taken, re-derive by altering the input (e.g. append a retry counter or use next 4 bytes of the hash) and retry a bounded number of times; once an unused CID is found, persist/reserve it atomically (use the existing persistence/lock APIs for VM creation) and return it, and if retries are exhausted raise a clear exception so callers can handle the failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs-vsock-control-plane-implementation-plan.md`:
- Line 149: Replace the awkward parenthetical in the sentence "CI includes at
least one vsock-enabled integration lane (can be optional initially, required
before GA)." with the tightened phrasing "optional initially, but required
before GA" so the line reads: CI includes at least one vsock-enabled integration
lane (optional initially, but required before GA). Locate the sentence by
searching for the original parenthetical text or the phrase "vsock-enabled
integration lane" and update that exact line.
In `@src/smolvm/api.py`:
- Around line 247-254: The PUT request for configuring vsock uses the wrong URL;
update the call to self._request in the code that passes
vsock_id/guest_cid/uds_path so it targets "/vsock" (not f"/vsock/{vsock_id}"),
i.e., remove the path parameter and keep vsock_id only in the JSON payload sent
by the _request call (locate the block where self._request is invoked with
method "PUT" and the json containing vsock_id, guest_cid, uds_path).
In `@src/smolvm/facade.py`:
- Around line 324-325: Remove the duplicate assignment to self._vsock (the
second occurrence of "self._vsock = None") so only a single initialization
remains; locate the repeated lines where self._vsock is set (both occurrences
shown) and delete the redundant one to avoid the needless duplicate reset.
- Around line 426-433: The code should fail fast if the active backend does not
support vsock: inside the vsock acquisition in facade (where self._vsock is set
using self._sdk.get_vsock_config(self._vm_id) and VsockClient is constructed),
add an explicit capability check for the backend (e.g. query the SDK for
“backend supports vsock” for this VM or inspect a provided capability flag on
vsock_cfg) and raise SmolVMError when the backend cannot provide a vsock
transport instead of proceeding; apply the same guard to the second identical
block referenced (lines 443-450) so both places validate backend capability
before constructing VsockClient or calling self._vsock.run.
In `@src/smolvm/vm.py`:
- Around line 801-804: get_vsock_config currently always calls
resolve_vsock_config and returns an enabled-looking VsockConfig even for
backends that don't support vsock; change it to first inspect the VM's backend
(from vm_info.backend or vm_info["backend"]) and only call resolve_vsock_config
if that backend is in the set of vsock-capable backends (e.g., "firecracker",
"qemu", whatever backends your codebase treats as supporting vsock); otherwise
return a disabled VsockConfig (e.g., VsockConfig(enabled=False) or equivalent
sentinel) so callers won’t try to provision an unusable transport. Update
get_vsock_config to reference vm_info and resolve_vsock_config accordingly.
In `@src/smolvm/vsock.py`:
- Around line 75-93: The guest command timeout passed to run(...) can be longer
than the socket I/O timeout (self.timeout), causing premature socket-level
timeouts; update run(self, ... timeout=...) so the socket timeout used by
self._request is at least the command timeout (e.g. compute socket_timeout =
max(self.timeout, timeout) and either pass it into _request or temporarily set
self.timeout to that value before calling self._request, then restore if
mutated); reference run and self._request to locate the change and ensure the
I/O layer uses the larger timeout.
- Around line 41-45: The socket connect/send/recv block in the method that uses
socket.socket(AF_VSOCK) (the code that calls sock.connect((self.guest_cid,
self.guest_port)), sock.sendall(raw), and self._recv_line(sock)) must catch
OSError and socket.timeout and re-raise them as SmolVMError to keep SDK
exception consistency; wrap the entire with-socket block in a try/except that
captures OSError and socket.timeout (including exceptions from sock.connect,
sock.sendall, and self._recv_line) and raise SmolVMError(...) with a clear
message and the original exception attached or chained.
- Around line 59-70: _recv_line currently accumulates chunks until newline/EOF
with no size bound; add a maximum byte limit (e.g. MAX_LINE_BYTES constant) and
track total_bytes while reading in _recv_line, and if total_bytes exceeds the
limit raise an exception (ValueError or custom ProtocolError) to abort the read;
also ensure you stop reading and return/raise when the limit is hit before
attempting to decode, and document the limit where _recv_line is defined so
callers know lines over the cap will error.
---
Nitpick comments:
In `@src/smolvm/vm.py`:
- Around line 60-65: derive_vsock_guest_cid(vm_id: str) currently returns a
deterministic CID but never verifies that the CID is unused, which risks
collisions; update derive_vsock_guest_cid to loop: compute the candidate CID as
now, then check against the persisted VM registry/state (e.g. VMStore/VMRegistry
methods that list or lookup assigned CIDs) and if it's already taken, re-derive
by altering the input (e.g. append a retry counter or use next 4 bytes of the
hash) and retry a bounded number of times; once an unused CID is found,
persist/reserve it atomically (use the existing persistence/lock APIs for VM
creation) and return it, and if retries are exhausted raise a clear exception so
callers can handle the failure.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs-vsock-control-plane-implementation-plan.mdsrc/smolvm/__init__.pysrc/smolvm/api.pysrc/smolvm/facade.pysrc/smolvm/types.pysrc/smolvm/vm.pysrc/smolvm/vsock.pytests/test_vsock.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
src/smolvm/api.py (1)
247-250:⚠️ Potential issue | 🔴 CriticalUse Firecracker’s
PUT /vsockendpoint (not/vsock/{vsock_id}).Line 249 targets a path form Firecracker does not accept for vsock device configuration, which will fail at runtime. This also means the expectation in
tests/test_vsock.py(Line 19) should be updated with this fix.Suggested fix
self._request( "PUT", - f"/vsock/{vsock_id}", + "/vsock", json={ "vsock_id": vsock_id, "guest_cid": guest_cid, "uds_path": str(uds_path), }, )In the official Firecracker API docs/source, what is the correct endpoint for configuring virtio-vsock: `PUT /vsock` or `PUT /vsock/{vsock_id}`?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/api.py` around lines 247 - 250, The PUT request for configuring virtio-vsock is using the wrong path; change the API call in the method that invokes self._request (the block that currently does self._request("PUT", f"/vsock/{vsock_id}", json=...)) to use "PUT" with the "/vsock" endpoint (remove the path parameter) and adjust the payload accordingly; also update the test assertion in tests/test_vsock.py (the expectation on Line 19) to expect a PUT to "/vsock" instead of "/vsock/{vsock_id}" so tests match the Firecracker API.src/smolvm/facade.py (1)
324-325:⚠️ Potential issue | 🟡 MinorRemove duplicated
_vsockreset instop().Line 325 duplicates Line 324 and can be removed.
Suggested fix
self._ssh = None self._ssh_ready = False self._vsock = None - self._vsock = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/facade.py` around lines 324 - 325, In the stop() method of the Facade (src/smolvm/facade.py) there's a duplicated reset of the attribute self._vsock; remove the redundant second occurrence so self._vsock = None only appears once in stop(), leaving other teardown logic untouched.src/smolvm/vsock.py (2)
35-44:⚠️ Potential issue | 🟠 MajorAlign socket I/O timeout with command timeout in
run().Line 91 passes guest execution timeout, but socket timeout remains
self.timeout(Line 43). Long-running commands can fail prematurely at transport layer.Suggested fix
- def _request(self, payload: dict[str, Any]) -> dict[str, Any]: + def _request(self, payload: dict[str, Any], *, io_timeout: float | None = None) -> dict[str, Any]: @@ - sock.settimeout(self.timeout) + sock.settimeout(self.timeout if io_timeout is None else io_timeout) @@ - response = self._request( + io_timeout = max(self.timeout, float(timeout) + 5.0) + response = self._request( { "op": "exec", "command": command, "timeout": timeout, "env": env or {}, "cwd": cwd, "token": token, - } + }, + io_timeout=io_timeout, )Also applies to: 87-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/vsock.py` around lines 35 - 44, The socket timeout in _request currently uses self.timeout and can cause transport-layer timeouts for long-running commands invoked via run(); change _request to accept a timeout parameter (e.g., def _request(self, payload: dict[str, Any], timeout: float)) and use that value in sock.settimeout(...) instead of self.timeout, then update all callers (notably run()) to pass the guest execution timeout it already computes/passes to the guest (the same timeout used at the command execution site) so socket I/O aligns with command timeout.
62-73:⚠️ Potential issue | 🟠 MajorBound
_recv_line()payload size to prevent memory blowups.On Line 63+, the read loop accumulates unbounded bytes until newline/EOF. A misbehaving peer can force excessive memory growth.
Suggested fix
+MAX_LINE_BYTES = 1024 * 1024 # 1 MiB + `@staticmethod` def _recv_line(sock: socket.socket) -> str: chunks: list[bytes] = [] + total = 0 while True: chunk = sock.recv(4096) if not chunk: break chunks.append(chunk) + total += len(chunk) + if total > MAX_LINE_BYTES: + raise SmolVMError("Vsock response exceeded maximum size") if b"\n" in chunk: break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/vsock.py` around lines 62 - 73, _recv_line currently accumulates bytes until newline/EOF with no limit; add a hard cap (e.g. MAX_LINE_BYTES) and track total_bytes read inside _recv_line, and if total_bytes exceeds that cap raise a specific exception (or return an error) to avoid memory blowups; when enforcing the cap, preserve semantics of splitting at the first b"\n" (use the same join/split logic) and apply the check after appending each chunk so a malicious peer can't force unlimited growth.src/smolvm/vm.py (1)
801-804:⚠️ Potential issue | 🟠 MajorMake
get_vsock_config()backend-aware.Lines 801-804 always return resolved/enabled-looking vsock settings. For backends that do not provision vsock, this causes callers to attempt unusable transport instead of failing early.
Suggested fix
def get_vsock_config(self, vm_id: str) -> VsockConfig: """Return resolved vsock settings for a VM.""" vm_info = self.get(vm_id) + backend = self._backend_for_vm(vm_info) + if backend != BACKEND_FIRECRACKER: + return vm_info.config.vsock.model_copy(update={"enabled": False}) return resolve_vsock_config(vm_info, self.data_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/vm.py` around lines 801 - 804, get_vsock_config currently always calls resolve_vsock_config which yields an enabled-looking VsockConfig even for backends that don't provision vsock; update get_vsock_config(vm_id: str) to inspect the VM's backend (vm_info.backend or vm_info.get("backend")) and only call resolve_vsock_config when that backend is known to support vsock; for non-vsock backends return an explicit disabled value (e.g., a VsockConfig with enabled=False or None per the project's convention) so callers won't attempt unusable vsock transports; reference get_vsock_config, resolve_vsock_config and the Vm info's backend field in your change.
🧹 Nitpick comments (2)
src/smolvm/vm.py (1)
648-654: Create parent directories for customuds_pathbeforeadd_vsock().If a user sets
vsock_cfg.uds_pathunder a non-existent directory, attach fails even though config is otherwise valid.Suggested fix
assert vsock_cfg.guest_cid is not None assert vsock_cfg.uds_path is not None + vsock_cfg.uds_path.parent.mkdir(parents=True, exist_ok=True) if vsock_cfg.uds_path.exists(): self._unlink_socket(vsock_cfg.uds_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/vm.py` around lines 648 - 654, The code currently assumes the parent directory for vsock_cfg.uds_path exists before calling client.add_vsock (in the block around self._unlink_socket and client.add_vsock), causing attach to fail for custom paths; fix by ensuring the parent directories for vsock_cfg.uds_path are created (mkdir parents=True, exist_ok=True) before attempting to unlink or call client.add_vsock, and handle potential filesystem errors (raise/log) so failure is clear; update the logic around self._unlink_socket and client.add_vsock to create the directory from vsock_cfg.uds_path.parent first.src/smolvm/facade.py (1)
426-432: Extract shared vsock-client initialization into one helper.
run_vsock()andvsock_health()duplicate the same lazy-init/config checks, which increases drift risk.Suggested refactor direction
+ def _ensure_vsock(self) -> VsockClient: + if self._vsock is None: + vsock_cfg = self._sdk.get_vsock_config(self._vm_id) + if not vsock_cfg.enabled: + raise SmolVMError("Vsock control-plane is disabled for this VM") + assert vsock_cfg.guest_cid is not None + self._vsock = VsockClient(vsock_cfg.guest_cid, vsock_cfg.guest_port) + return self._vsock @@ - if self._vsock is None: - ... - return self._vsock.run(...) + return self._ensure_vsock().run(...) @@ - if self._vsock is None: - ... - return self._vsock.health() + return self._ensure_vsock().health()Also applies to: 443-449
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/facade.py` around lines 426 - 432, Extract the duplicated lazy-init and config validation into a single helper (e.g., a private method like _ensure_vsock) and call it from run_vsock() and vsock_health(); the helper should check self._vsock, call self._sdk.get_vsock_config(self._vm_id), raise SmolVMError if vsock_cfg.enabled is False, assert guest_cid is not None, and set self._vsock = VsockClient(vsock_cfg.guest_cid, vsock_cfg.guest_port) so both methods reuse the same initialization path and remove the duplicated code blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/smolvm/types.py`:
- Around line 43-44: The guest_cid field currently allows reserved/invalid vsock
CIDs and should be validated at parse time: change the declaration of guest_cid
to use an Annotated int with a Field constraint limiting it to the valid guest
CID range (e.g., ge=3 and an appropriate upper bound such as le=4294967294 or
similar for 32‑bit CID space) and keep it nullable (None default) so invalid
values like 0,1,2 are rejected by pydantic during parsing; update the guest_cid
declaration (symbol: guest_cid) to Annotated[int, Field(ge=3, le=<upper>)] |
None = None to enforce this constraint.
---
Duplicate comments:
In `@src/smolvm/api.py`:
- Around line 247-250: The PUT request for configuring virtio-vsock is using the
wrong path; change the API call in the method that invokes self._request (the
block that currently does self._request("PUT", f"/vsock/{vsock_id}", json=...))
to use "PUT" with the "/vsock" endpoint (remove the path parameter) and adjust
the payload accordingly; also update the test assertion in tests/test_vsock.py
(the expectation on Line 19) to expect a PUT to "/vsock" instead of
"/vsock/{vsock_id}" so tests match the Firecracker API.
In `@src/smolvm/facade.py`:
- Around line 324-325: In the stop() method of the Facade (src/smolvm/facade.py)
there's a duplicated reset of the attribute self._vsock; remove the redundant
second occurrence so self._vsock = None only appears once in stop(), leaving
other teardown logic untouched.
In `@src/smolvm/vm.py`:
- Around line 801-804: get_vsock_config currently always calls
resolve_vsock_config which yields an enabled-looking VsockConfig even for
backends that don't provision vsock; update get_vsock_config(vm_id: str) to
inspect the VM's backend (vm_info.backend or vm_info.get("backend")) and only
call resolve_vsock_config when that backend is known to support vsock; for
non-vsock backends return an explicit disabled value (e.g., a VsockConfig with
enabled=False or None per the project's convention) so callers won't attempt
unusable vsock transports; reference get_vsock_config, resolve_vsock_config and
the Vm info's backend field in your change.
In `@src/smolvm/vsock.py`:
- Around line 35-44: The socket timeout in _request currently uses self.timeout
and can cause transport-layer timeouts for long-running commands invoked via
run(); change _request to accept a timeout parameter (e.g., def _request(self,
payload: dict[str, Any], timeout: float)) and use that value in
sock.settimeout(...) instead of self.timeout, then update all callers (notably
run()) to pass the guest execution timeout it already computes/passes to the
guest (the same timeout used at the command execution site) so socket I/O aligns
with command timeout.
- Around line 62-73: _recv_line currently accumulates bytes until newline/EOF
with no limit; add a hard cap (e.g. MAX_LINE_BYTES) and track total_bytes read
inside _recv_line, and if total_bytes exceeds that cap raise a specific
exception (or return an error) to avoid memory blowups; when enforcing the cap,
preserve semantics of splitting at the first b"\n" (use the same join/split
logic) and apply the check after appending each chunk so a malicious peer can't
force unlimited growth.
---
Nitpick comments:
In `@src/smolvm/facade.py`:
- Around line 426-432: Extract the duplicated lazy-init and config validation
into a single helper (e.g., a private method like _ensure_vsock) and call it
from run_vsock() and vsock_health(); the helper should check self._vsock, call
self._sdk.get_vsock_config(self._vm_id), raise SmolVMError if vsock_cfg.enabled
is False, assert guest_cid is not None, and set self._vsock =
VsockClient(vsock_cfg.guest_cid, vsock_cfg.guest_port) so both methods reuse the
same initialization path and remove the duplicated code blocks.
In `@src/smolvm/vm.py`:
- Around line 648-654: The code currently assumes the parent directory for
vsock_cfg.uds_path exists before calling client.add_vsock (in the block around
self._unlink_socket and client.add_vsock), causing attach to fail for custom
paths; fix by ensuring the parent directories for vsock_cfg.uds_path are created
(mkdir parents=True, exist_ok=True) before attempting to unlink or call
client.add_vsock, and handle potential filesystem errors (raise/log) so failure
is clear; update the logic around self._unlink_socket and client.add_vsock to
create the directory from vsock_cfg.uds_path.parent first.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs-vsock-control-plane-implementation-plan.mdsrc/smolvm/__init__.pysrc/smolvm/api.pysrc/smolvm/facade.pysrc/smolvm/types.pysrc/smolvm/vm.pysrc/smolvm/vsock.pytests/test_vsock.py
|
@copilot please resolve the conflict in this PR |
|
@aniketmaurya I've opened a new pull request, #55, to work on those changes. Once the pull request is ready, I'll request review from you. |
* Initial plan * Merge main branch changes: extra_drives, rate_limit_mbps, remove_egress_rules, and other upstream updates Co-authored-by: aniketmaurya <21018714+aniketmaurya@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: aniketmaurya <21018714+aniketmaurya@users.noreply.github.com>
…d-control-plane-for
…d-control-plane-for
…d-control-plane-for
Motivation
Description
VsockClientinsrc/smolvm/vsock.pythat implements a simple framed JSON request/response protocol andhealth/execoperations.VsockConfigto the public types insrc/smolvm/types.pyand wire it intoVMConfigwith sensible defaults (guest port5000, enabled by default).FirecrackerClient.add_vsock(...)insrc/smolvm/api.pyand wire vsock provisioning into the VM startup path insrc/smolvm/vm.py, including stable guest CID derivation viaderive_vsock_guest_cidand resolution viaresolve_vsock_config.src/smolvm/facade.pywithrun_vsock(...)andvsock_health()helpers, and exportVsockClient/VsockConfigfrom the package__init__.Testing
tests/test_vsock.pythat cover theFirecrackerClient.add_vsockcall,derive_vsock_guest_cidstability/non-reserved behavior, andresolve_vsock_configdefaults.pytest tests/test_vsock.pyand the new tests succeeded.Codex Task
Summary by CodeRabbit
New Features
Documentation
Tests