Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 186 additions & 0 deletions docs-vsock-control-plane-implementation-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Vsock Control Plane Implementation Plan

This document proposes an implementation plan for adding a **vsock-based control plane** to SmolVM for automated in-guest command execution, streaming logs, health/metadata, and optional artifact transfer.

## 1) Current codebase fit

### Existing components to build on
- `SmolVMManager.start()` already configures Firecracker boot source, machine config, rootfs drive, and one NIC through `FirecrackerClient`. The vsock device should be wired in this same startup path for Firecracker-backed VMs.
- `FirecrackerClient` in `src/smolvm/api.py` provides clear, typed wrappers around API resources (`/boot-source`, `/machine-config`, `/drives`, `/network-interfaces`, `/actions`). Add a sibling method for vsock devices.
- `SmolVM` facade in `src/smolvm/facade.py` exposes the user-facing execution API (`run`) and is the right place to add transport selection (`ssh` vs `vsock`) while preserving backwards compatibility.
- `ImageBuilder` in `src/smolvm/build.py` already assembles guest images and installs `/init`; this is where the in-guest agent binary/service can be included and started.
- Pydantic types in `src/smolvm/types.py` (`VMConfig`, `VMInfo`, `CommandResult`) are central extension points for vsock config, per-VM auth secret metadata references, and richer command result payloads.

### Gaps today
- No Firecracker API wrapper for adding a vsock device.
- No in-guest control-plane daemon.
- No host-side vsock protocol/client.
- Execution path is SSH-centric and assumes guest networking.

## 2) Scope and rollout strategy

### Phase A (MVP, required)
1. Firecracker vsock device provisioning.
2. In-guest `vsock-agent` with auth + command execution + hard limits.
3. Host-side `vsock-client` supporting:
- `exec` (request/response)
- `stream` (incremental stdout/stderr)
- `cancel`
- `health`, `system_info`, `shutdown`
4. Facade integration (`SmolVM.run(..., transport="vsock")` or automatic fallback preference).

### Phase B (recommended)
1. File transfer (`put_file`, `get_file`) with size caps and checksum validation.
2. Job registry persistence in guest for reconnect/retry semantics.
3. CLI subcommands for explicit vsock operations.

### Phase C (hardening)
1. End-to-end metrics + tracing.
2. Protocol/version negotiation.
3. Strict compatibility tests across kernel/image variants.

## 3) Detailed implementation plan

### Step 1 — Data model and state extensions
**Files:** `src/smolvm/types.py`, `src/smolvm/storage.py` (if schema migration needed), `src/smolvm/vm.py`

- Extend `VMConfig` with optional `vsock` config:
- `enabled: bool = True` (for Firecracker)
- `guest_port: int = 5000`
- `uds_path: Path | None` (host-side Firecracker vsock UDS; default under data dir)
- Extend runtime info with resolved vsock connection details (e.g., host UDS path + guest port).
- Generate/store a **per-VM secret/token** at VM creation time; store in host state DB and inject into guest image/bootstrap location with strict permissions.

**Acceptance criteria:**
- Creating a VM produces deterministic vsock metadata for Firecracker VMs.
- Token material is never logged in plaintext.

### Step 2 — Firecracker API support for vsock
**Files:** `src/smolvm/api.py`, `src/smolvm/vm.py`, `tests/test_api.py` (new or extend relevant tests)

- Add `FirecrackerClient.add_vsock(vsock_id: str, guest_cid: int, uds_path: Path)` wrapper targeting Firecracker `/vsock` resource.
- In `SmolVMManager.start()`, after machine config/drive/network setup and before `start_instance()`, call `add_vsock(...)` for Firecracker backend.
- Derive stable `guest_cid` from VM identity (collision-resistant within host) and keep mapping in state.

**Acceptance criteria:**
- Firecracker boot path invokes vsock setup exactly once.
- Startup fails fast with actionable error if vsock setup fails.

### Step 3 — Guest agent packaging and boot integration
**Files:** `src/smolvm/build.py`, new guest assets directory (e.g., `src/smolvm/guest_assets/`), relevant tests

- Package a small `vsock-agent` in the guest image build flow:
- Start as non-root user.
- Bind fixed guest vsock port (default 5000).
- Read token from root-owned file with minimal permissions.
- Update generated `/init` logic (or systemd unit when applicable) to launch and supervise the agent.
- Ensure agent startup does not block SSH startup for compatibility.

**Acceptance criteria:**
- Fresh image boots with both SSH (legacy) and agent (new) available.
- Agent process restarts or fails clearly in logs.

### Step 4 — Protocol definition and host client
**Files:** new module(s) `src/smolvm/vsock_protocol.py`, `src/smolvm/vsock_client.py`, tests

- Define a framed protocol (length-prefixed JSON envelopes + binary chunks where needed):
- `auth`, `exec_start`, `exec_stream`, `exec_result`, `cancel`, `health`, `system_info`, `shutdown`, `put_file`, `get_file`.
- Add protocol version field and explicit error codes.
- Implement host client with:
- Connection/retry handling.
- Request timeout and max message/body constraints.
- Incremental streaming callback/iterator API.

**Acceptance criteria:**
- Client can execute short and long-running commands through vsock.
- Stream ordering for stdout/stderr is preserved by sequence numbers.

### Step 5 — Guest agent command execution + policy controls
**Files:** guest agent source, tests (unit + integration)

- Enforce security controls per request:
- Required auth token on every call.
- Max runtime, max stdout/stderr bytes, max concurrent jobs.
- Optional cwd/env allowlist policy.
- Implement job registry:
- `job_id` lifecycle (`queued/running/completed/failed/cancelled`).
- Cancellation via process group signaling.
- Return structured result: `exit_code`, `stdout`, `stderr`, `duration_ms`, truncation flags.

**Acceptance criteria:**
- Invalid token is denied consistently.
- Limits are enforced and surfaced as structured errors.

### Step 6 — Facade + CLI integration
**Files:** `src/smolvm/facade.py`, `src/smolvm/cli.py`, `src/smolvm/__init__.py`, docs/tests

- Add facade methods:
- `run_vsock(...)`
- `stream_vsock(...)`
- Optional `put_file/get_file`.
- Update `run()` default policy:
- Prefer vsock when agent is healthy.
- Optional fallback to SSH (feature flag/configurable).
- Add CLI commands (examples):
- `smolvm exec --transport vsock ...`
- `smolvm stream ...`
- `smolvm cp put|get ...`

**Acceptance criteria:**
- Existing SSH flows remain backward compatible.
- New commands documented and discoverable in `--help`.

### Step 7 — Testing matrix and CI gating
**Files:** `tests/test_vm.py`, `tests/test_facade.py`, new vsock-specific tests

- Unit tests:
- Firecracker vsock API payload correctness.
- Protocol encode/decode and error handling.
- Integration tests:
- End-to-end exec over vsock on Firecracker.
- Streaming + cancel behavior.
- Token auth rejection.
- Output/file size cap enforcement.
- Regression tests:
- SSH execution remains functional.

**Acceptance criteria:**
- `pytest` passes on existing and new tests.
- CI includes at least one vsock-enabled integration lane (can be optional initially, required before GA).

## 4) Security and reliability checklist

- Per-VM token generated on host, rotated on VM recreation.
- No unauthenticated endpoint exposed on guest vsock port.
- Agent runs non-root and uses bounded subprocess execution.
- Bounded memory usage for stream buffering.
- Backpressure-aware streaming to avoid host or guest OOM.
- Protocol versioning and feature negotiation to support rolling upgrades.

## 5) Suggested work breakdown (tickets)

1. **Core plumbing**: Firecracker vsock API + VM state wiring.
2. **Protocol MVP**: framed messages + `health` + `exec` + result.
3. **Guest agent MVP**: auth + execute + limits.
4. **Facade integration**: `run_vsock` and transport selection.
5. **Streaming and cancel**.
6. **File transfer**.
7. **Hardening & observability**.

## 6) Risks and mitigations

- **Risk:** CID collisions across concurrent VMs.
**Mitigation:** deterministic allocation + state-backed reservation + collision retry.
- **Risk:** Agent availability race at boot.
**Mitigation:** client-side health probe with bounded retries before first exec.
- **Risk:** Protocol bloat and compatibility drift.
**Mitigation:** strict schema + version field + compatibility tests.
- **Risk:** Security regressions from fallback behavior.
**Mitigation:** explicit config to disable SSH fallback in hardened environments.

## 7) Definition of done (feature level)

- Firecracker-backed VM can execute commands and stream output via vsock with per-request auth and enforced limits.
- Host SDK/facade exposes stable APIs for exec, stream, cancel, health, and shutdown.
- Optional artifact transfer is available with bounded sizes and integrity checks.
- Documentation includes migration guidance and transport selection behavior.
7 changes: 5 additions & 2 deletions src/smolvm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,9 @@
from smolvm.host import HostManager
from smolvm.images import ImageManager, ImageSource, LocalImage
from smolvm.ssh import SSHClient
from smolvm.types import CommandResult, NetworkConfig, VMConfig, VMInfo, VMState
from smolvm.types import CommandResult, NetworkConfig, VMConfig, VMInfo, VMState, VsockConfig
from smolvm.vm import SmolVMManager
from smolvm.vsock import VsockClient

__version__ = "0.0.5"

Expand All @@ -49,14 +50,16 @@
"LocalImage",
# Host setup
"HostManager",
# SSH
# SSH / vsock
"SSHClient",
"VsockClient",
# Data models
"VMConfig",
"VMInfo",
"VMState",
"NetworkConfig",
"CommandResult",
"VsockConfig",
# Exceptions
"SmolVMError",
"CommandExecutionUnavailableError",
Expand Down
19 changes: 19 additions & 0 deletions src/smolvm/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,25 @@ def add_network_interface(
)
logger.debug("Network interface added: %s -> %s", iface_id, host_dev_name)


def add_vsock(
self,
vsock_id: str,
guest_cid: int,
uds_path: Path,
) -> None:
"""Configure a vsock device for host/guest control-plane traffic."""
self._request(
"PUT",
f"/vsock/{vsock_id}",
json={
"vsock_id": vsock_id,
"guest_cid": guest_cid,
"uds_path": str(uds_path),
},
)
logger.debug("Vsock device added: %s (cid=%d)", vsock_id, guest_cid)

def start_instance(self) -> None:
"""Start the microVM instance."""
self._request(
Expand Down
49 changes: 49 additions & 0 deletions src/smolvm/facade.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
SmolVMError,
)
from smolvm.ssh import SSHClient
from smolvm.vsock import VsockClient
from smolvm.types import CommandResult, VMConfig, VMInfo, VMState
from smolvm.vm import SmolVMManager

Expand Down Expand Up @@ -202,6 +203,7 @@ def __init__(
self._owns_vm = False

self._ssh: SSHClient | None = None
self._vsock: VsockClient | None = None
self._ssh_ready = False
self._local_forwards: dict[tuple[int, int], _LocalForward] = {}

Expand Down Expand Up @@ -319,6 +321,8 @@ def stop(self, timeout: float = 3.0) -> SmolVM:
self._ssh.close()
self._ssh = None
self._ssh_ready = False
self._vsock = None
self._vsock = None
logger.info("VM %s stopped", self._vm_id)
return self

Expand All @@ -330,6 +334,7 @@ def delete(self) -> None:
self._ssh.close()
self._ssh = None
self._ssh_ready = False
self._vsock = None
logger.info("VM %s deleted", self._vm_id)

# ------------------------------------------------------------------
Expand Down Expand Up @@ -400,6 +405,50 @@ def run(

return self._ssh.run(command, timeout=timeout, shell=shell)

def run_vsock(
self,
command: str,
timeout: int = 30,
*,
env: dict[str, str] | None = None,
cwd: str | None = None,
token: str | None = None,
) -> CommandResult:
"""Execute a command through the vsock control-plane."""
self._refresh_info()

if self._info.status != VMState.RUNNING:
raise SmolVMError(
f"Cannot run command: VM is {self._info.status.value}",
{"vm_id": self._vm_id},
)

if self._vsock is None:
vsock_cfg = self._sdk.get_vsock_config(self._vm_id)
if not vsock_cfg.enabled:
raise SmolVMError("Vsock control-plane is disabled for this VM")
assert vsock_cfg.guest_cid is not None
self._vsock = VsockClient(vsock_cfg.guest_cid, vsock_cfg.guest_port)

return self._vsock.run(command, timeout=timeout, env=env, cwd=cwd, token=token)

def vsock_health(self) -> dict[str, Any]:
"""Query health from the in-guest vsock agent."""
self._refresh_info()
if self._info.status != VMState.RUNNING:
raise SmolVMError(
f"Cannot query vsock health: VM is {self._info.status.value}",
{"vm_id": self._vm_id},
)
if self._vsock is None:
vsock_cfg = self._sdk.get_vsock_config(self._vm_id)
if not vsock_cfg.enabled:
raise SmolVMError("Vsock control-plane is disabled for this VM")
assert vsock_cfg.guest_cid is not None
self._vsock = VsockClient(vsock_cfg.guest_cid, vsock_cfg.guest_port)

return self._vsock.health()

def set_env_vars(self, env_vars: dict[str, str], *, merge: bool = True) -> list[str]:
"""Set environment variables on a running VM.

Expand Down
13 changes: 13 additions & 0 deletions src/smolvm/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,17 @@ def _generate_vm_id() -> str:
return f"vm-{uuid4().hex[:8]}"


class VsockConfig(BaseModel):
"""Vsock control-plane configuration."""

enabled: bool = True
guest_cid: int | None = None
guest_port: Annotated[int, Field(ge=1, le=65535)] = 5000
uds_path: Path | None = None

model_config = {"frozen": True}


class VMConfig(BaseModel):
"""Configuration for creating a microVM.

Expand All @@ -56,6 +67,7 @@ class VMConfig(BaseModel):
create with the same VM ID can reuse prior state.
env_vars: Environment variables to inject into the guest
after boot via SSH. Keys must be valid shell identifiers.
vsock: Vsock control-plane settings.
"""

vm_id: Annotated[
Expand All @@ -76,6 +88,7 @@ class VMConfig(BaseModel):
retain_disk_on_delete: bool = False
env_vars: dict[str, str] = {}
network_rate_limit_mbps: Annotated[int, Field(ge=1)] | None = None
vsock: VsockConfig = VsockConfig()

@field_validator("vm_id", mode="before")
@classmethod
Expand Down
Loading