feat: add benchmark suite with density ramp and common I/O benchmarks by craigpfeifer · Pull Request #46 · CelestoAI/SmolVM

craigpfeifer · 2026-03-01T01:31:45Z

Summary

Adds a comprehensive benchmark suite under `benchmarks/` covering command latency, file I/O, network I/O, and a density ramp benchmark
The density ramp benchmark (`benchmarks/density_ramp.py`) progressively increases workload to characterize runtime performance under varying conditions
Includes helper utilities (`benchmarks/helpers.py`) shared across benchmark scripts
Adds documentation in `benchmarks/docs/` covering methodology, results, and review notes for the density estimation benchmark

Changes

`benchmarks/bench_command_latency.py` — command round-trip latency benchmark
`benchmarks/bench_file_io.py` — file I/O throughput and latency benchmark
`benchmarks/bench_network_io.py` — network I/O benchmark
`benchmarks/density_ramp.py` — density ramp benchmark script
`benchmarks/helpers.py` — shared benchmark utilities
`benchmarks/docs/README_DENSITY_RAMP.md` — density ramp usage and methodology
`benchmarks/docs/v0_density_estimation_benchmark.md` — v0 benchmark spec
`benchmarks/docs/v0_review_density_ramp_report.md` — v0 benchmark results review

Test plan

Run `python benchmarks/density_ramp.py` and verify output
Run `python benchmarks/bench_file_io.py` and verify output
Run `python benchmarks/bench_network_io.py` and verify output
Run `python benchmarks/bench_command_latency.py` and verify output

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a benchmarking suite covering command latency, file I/O, network I/O, and a density-ramp VM stress test, plus shared benchmark helpers for timing, stats, and sandboxed measurements.
Documentation
- Added comprehensive benchmark docs, a results format README, a review report, and cloud run guides for EC2 and GCP.
Chores
- Hardened image extraction/build flow and added an optional runtime dependency for host resource monitoring (psutil).

Adds a VM density ramp benchmark that boots VMs until failure, measuring boot latency, host resource consumption (CPU, memory, disk, TAP devices, Firecracker RSS), and sustain health checks across all VMs concurrently. Includes README and v0 benchmark/review reports in benchmarks/docs/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

When building the ext4 rootfs via _create_ext4_with_docker (used when the loopfs helper is absent, e.g. on a bare pip install on EC2), Python's tar.extractall() runs as a non-root user and cannot chown files to uid=0. This left /var/empty and /etc/ssh owned by the host user's uid in the image. sshd refuses to start if /var/empty is not root-owned, so port 22 never opened and all VM SSH attempts timed out. Fix: before mke2fs reads the source tree, the Docker container (running as root) chowns /var/empty and /etc/ssh to root:root and sets /var/empty to 755. The volume mount is changed from :ro to writable to allow the chown. Also adds diagnostic tooling to density_ramp benchmark: - Print Firecracker log on first boot failure to expose sshd errors - Stop orphaned VM on wait_for_ssh failure (resource leak fix) - EC2 script: disable firewalld (conflicts with SmolVM nftables rules) - EC2 script: post-run diagnostics section (network state + VM log) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous fix chowned /work/rootfs/etc/ssh inside Docker, but that path is a bind-mount from the host's tmpdir. After Docker exits the temp dir has root-owned files that ec2-user cannot remove, causing a PermissionError on TemporaryDirectory cleanup. Fix: cp -a the rootfs to /work/rootfs-fixed inside the container, chown there, and run mke2fs on the copy. The host bind-mount is restored to :ro so the host directory is never modified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… cleanup After mke2fs completes, run a Docker chown pass to restore the ownership of rootfs_dir back to the current user. This prevents PermissionError in TemporaryDirectory cleanup when root-owned files are present — either from the (now-fixed) :ro bypass issue or from residual files left by a previous failed build in the same /tmp path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Python's tarfile.extractall() silently drops root ownership (uid=0) when the host user is non-root. This caused sshd inside the guest to reject /root/.ssh/authorized_keys (wrong owner) and crash on /var/empty — both paths require root ownership enforced by OpenSSH's strict-modes check. Fix: skip Python's tar extraction entirely. Mount the tarball read-only into a Docker container and let Docker (running as root) extract it with correct uid/gid preserved. Exclude ./dev to avoid CAP_MKNOD failures; the guest init mounts devtmpfs at boot so rootfs device nodes are unused. This also eliminates the rootfs-dir bind-mount that caused root-owned files to appear in the temp dir, so TemporaryDirectory cleanup is clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-09T04:59:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a comprehensive benchmarking suite and helpers for SmolVM (command/file/network benchmarks, density-ramp harness with telemetry and lifecycle tooling), platform-specific sandbox helpers, docs and result formats, EC2/GCP runner scripts, an optional psutil benchmark dependency, and a Docker-based secure tar extraction change in image build.

Changes

Cohort / File(s)	Summary
Benchmarks — latency / I/O / network `benchmarks/bench_command_latency.py`, `benchmarks/bench_file_io.py`, `benchmarks/bench_network_io.py`	New benchmark scripts measuring command latency, file I/O throughput, and HTTP/SSE network I/O across host and SmolVM; include timing loops, p50 overhead comparisons, VM sanity checks, and formatted reporting.
Density ramp harness & runners `benchmarks/density_ramp.py`, `benchmarks/run_density_ramp_ec2.sh`, `benchmarks/run_density_ramp_gcp.sh`	New density-ramp orchestrator with Metrics dataclass, VM boot/health/cleanup utilities, parallel/serial boot workflows, sustain checks, JSON event output, and cloud runner scripts to provision, execute, collect results, and cleanup.
Helpers & formatting `benchmarks/helpers.py`	New timing/statistics utilities, percentile/stats summary, timer context, sandboxed execution helpers (macOS seatbelt), host/sandbox timing wrappers, table/print helpers, and platform_info.
Docs & results `benchmarks/docs/...`, `benchmarks/results/README.md`	New documentation for density-ramp design, run instructions, analysis guidance, and a results README describing JSON event schema and quick analysis snippets.
Project config `pyproject.toml`	Added optional dependency group `[project.optional-dependencies]` entry `benchmarks` containing `psutil>=7.2.2`.
Build / image creation `src/smolvm/build.py`	Hardening of tar handling (reject absolute/traversal/symlink targets) and switch to in-container tar extraction via Docker to preserve UID/GID and reduce host-side extraction; adjusted mounts and in-container commands with ownership restoration.
New artifacts `benchmarks/results/`, `benchmarks/docs/`	Added example result artifacts, READMEs, and review docs for benchmark outputs and density testing guidance.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host Process
    participant SmolVM as SmolVM Controller
    participant VM as Guest VM
    participant Metrics as Metrics Store

    Host->>SmolVM: build_image / boot_one(seq, kernel, rootfs)
    SmolVM->>VM: launch VM (Firecracker/TAP/socket)
    VM-->>SmolVM: health / SSH ready
    SmolVM-->>Host: return boot result (ok/fail, boot_time)
    Host->>Metrics: record boot event & snapshot
    loop sustain checks
        Host->>VM: check_vm_alive()
        VM-->>Host: alive / metrics
        Host->>Metrics: append sustain result
    end
    Host->>SmolVM: cleanup()
    SmolVM->>VM: terminate VM(s)
    VM-->>SmolVM: shutdown complete
    SmolVM-->>Host: report teardown metrics
    Host->>Metrics: write JSON results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(openClaw): host hardening for OpenClaw #39 — Modifies src/smolvm/build.py; changes to tar extraction and in-Docker extraction are code-level related.

Poem

🐰 I hopped through bytes and timed each ping,

tiny VMs answered with a faithful ring,
I wrote, I streamed, I booted every lane,
collected little metrics like crunchy grain,
A carrot-coded benchmark — hop, measure, sing!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main addition: a comprehensive benchmark suite including density ramp and I/O benchmarks, matching the changeset content.
Description check	✅ Passed	The description covers all required template sections (Summary, Changes, and a Test plan) with sufficient detail about what was added and why.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/smolvm/build.py (1)
690-724: ⚠️ Potential issue | 🟠 Major

The Docker fallback can leave root-owned cache files on Linux with rootful Docker.

The container writes the ext4 image to the bind-mounted host directory as root. On Linux with rootful Docker, the resulting rootfs.ext4 will be owned by UID 0, breaking cache reuse and cleanup for the non-root invoking user. Add a post-execution chown to fix ownership after the Docker container exits, or implement user namespace remapping with --userns-remap.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/smolvm/build.py` around lines 690 - 724, The Docker fallback can leave
the produced ext4 image owned by root; after the subprocess.run that executes
the Docker "run" (the block that uses shell_cmd, tar_path, rootfs_path and
invokes "docker" via subprocess.run), add a post-run step to chown the generated
file to the current invoking user (use rootfs_path or the constructed
rootfs_name to locate the output) — e.g., call os.chown(rootfs_path,
os.getuid(), os.getgid()) or run a safe subprocess(["chown",
f"{os.getuid()}:{os.getgid()}", str(rootfs_path)]) — or alternatively invoke
Docker with user namespace remapping (add "--userns-remap=default") to avoid
root-owned output; ensure this fix runs only on platforms where rootful Docker
produces UID 0 files (Linux) and does not break existing error handling around
the subprocess.run that used shell_cmd.

🧹 Nitpick comments (3)

benchmarks/helpers.py (1)

26-40: Record elapsed_ms even when the timed block raises.

With @contextmanager, the code after yield is skipped unless it is in finally. A timeout or subprocess error in the with body currently leaves t.elapsed_ms at 0.0.

♻️ Suggested fix

 `@contextmanager`
 def timer():
     """Context manager that records elapsed wall-clock time in milliseconds.
@@
     """
     t = _TimerResult()
     t.start = time.perf_counter()
-    yield t
-    t.end = time.perf_counter()
-    t.elapsed_ms = (t.end - t.start) * 1000.0
+    try:
+        yield t
+    finally:
+        t.end = time.perf_counter()
+        t.elapsed_ms = (t.end - t.start) * 1000.0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/helpers.py` around lines 26 - 40, The timer context manager does
not set t.end and t.elapsed_ms if the with-block raises because the post-yield
code runs only on normal exit; modify the timer function (the contextmanager
named timer that returns an _TimerResult instance) to ensure the stop-time and
elapsed_ms are computed in a finally block so they are always set even if the
with body raises—move the t.end and t.elapsed_ms assignments into a try/finally
around the yield (or use try: yield t finally: ...) so _TimerResult fields are
populated on both success and exception.

pyproject.toml (1)

29-35: Keep psutil out of the base runtime dependencies.

benchmarks/ is not shipped in the wheel, so this makes every smolvm install pull a benchmark-only dependency. Moving it to a benchmarks or dev extra keeps the runtime package leaner for normal consumers.

♻️ Suggested packaging split

 dependencies = [
     "paramiko>=3.0",
-    "psutil>=7.2.2",
     "pydantic>=2.0",
     "requests>=2.28",
     "requests-unixsocket>=0.3",
 ]
 
 [project.optional-dependencies]
+benchmarks = [
+    "psutil>=7.2.2",
+]
 dev = [

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 29 - 35, The pyproject currently lists
"psutil>=7.2.2" in the base dependencies array which forces all installs to pull
a benchmark-only dependency; remove "psutil" from the dependencies list and add
it to an optional extra (e.g., create or update the
[project.optional-dependencies] entry named "benchmarks" or "dev" and include
"psutil>=7.2.2" there) so normal consumers don't install it by default; ensure
any README or install docs reference the new extra name for developers running
benchmarks.

benchmarks/bench_command_latency.py (1)

10-20: Make the helpers import package-safe.

from helpers import ... only works when the script directory is already on sys.path. Importing this module as benchmarks.bench_command_latency fails before the __main__ path tweak runs, so run_benchmark() is not reusable from a driver or test harness.

♻️ Suggested import pattern

-from helpers import (
-    format_stats,
-    is_sandbox_exec_available,
-    overhead_str,
-    print_header,
-    print_result,
-    print_subheader,
-    run_sandboxed,
-    stats_summary,
-    time_call_n,
-)
+try:
+    from .helpers import (
+        format_stats,
+        is_sandbox_exec_available,
+        overhead_str,
+        print_header,
+        print_result,
+        print_subheader,
+        run_sandboxed,
+        stats_summary,
+        time_call_n,
+    )
+except ImportError:
+    from helpers import (
+        format_stats,
+        is_sandbox_exec_available,
+        overhead_str,
+        print_header,
+        print_result,
+        print_subheader,
+        run_sandboxed,
+        stats_summary,
+        time_call_n,
+    )

Also applies to: 129-132

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_command_latency.py` around lines 10 - 20, The top-level
import "from helpers import (...)" is not package-safe; change it to a relative
package import so the module can be imported as benchmarks.bench_command_latency
(e.g. replace with "from .helpers import format_stats,
is_sandbox_exec_available, overhead_str, print_header, print_result,
print_subheader, run_sandboxed, stats_summary, time_call_n"); also update the
other similar imports around the later block (lines referenced in the comment)
to use the same relative form so functions like run_sandboxed, time_call_n,
etc., resolve when the module is imported as part of the package.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/bench_file_io.py`:
- Around line 155-165: The VM bulk-read loop currently uses vm.run("cat
/tmp/bench_read.dat") which streams ~10MB back over the command channel
(measuring SSH/stdout transport), so replace the VM-side command with a
guest-local read that doesn’t transmit file contents (e.g., use dd
if=/tmp/bench_read.dat of=/dev/null bs=1024 count=10240) in the vm.run calls
used by time_call (the block building vm_bulk_r_times and vm_bulk_r_stats) and
the analogous block at lines ~185-198; keep using time_call and stats_summary
but ensure you only change the vm.run command string to dd so the benchmark
measures guest disk I/O instead of transport overhead.

In `@benchmarks/bench_network_io.py`:
- Around line 132-168: After attempting to install curl, ensure you re-check the
VM for HTTP clients and fail fast if none are available: re-run the checks that
set has_curl and has_wget (e.g., using vm.run("which curl") and vm.run("which
wget") or equivalent) after the apk add step, and if both has_curl and has_wget
remain false, abort the benchmark with a clear error (raise/assert or exit)
instead of proceeding to build fetch_cmd; update the logic around vm.run,
has_curl, has_wget, and fetch_cmd so the later HTTP-request block only runs when
a confirmed client exists.

In `@benchmarks/density_ramp.py`:
- Around line 294-315: The current parallel boot path collects all results with
batch_results = [f.result() for f in as_completed(futures)] and only takes
Metrics.snapshot(...) afterwards, which produces post-batch metrics; instead,
process each future as it completes: iterate over as_completed(futures), call
f.result() immediately, compute density using running_vms plus (1 if vm is not
None else 0), call Metrics.snapshot(density, args.socket_dir) right there, then
update i and failed_this_batch and emit the boot_ok/boot_fail handling; modify
the ThreadPoolExecutor block that builds futures and the subsequent
batch_results handling so snapshots are taken per completed future rather than
after collecting the whole batch.
- Around line 59-64: The snapshot function (and the similar code at the later
occurrence) is measuring disk usage with psutil.disk_usage(Path.home()) which
reports the home filesystem rather than SmolVM’s storage; update snapshot (and
the other occurrence around lines 257-258) to check for SmolVM storage paths (at
least Path.home()/".smolvm" and Path.home()/".local/state/smolvm" or an
environment/configured storage path) and call psutil.disk_usage on the first
existing storage mount, falling back to Path.home() if none exist; ensure you
convert the chosen path to a string for psutil.disk_usage and compute disk in GB
as before so host_disk_used_gb reflects the actual storage mount used for
images/clones.
- Around line 104-112: The cleanup() function currently calls vm.stop() for each
VM which only powers off the VM; change it to call vm.delete() on each vm in
running_vms so host-side disk and network cleanup occurs (replace vm.stop() with
vm.delete()). Remove the silent "except Exception: pass" and instead catch
Exception as e and log the failure (e.g., using the module logger or
logging.exception/logging.error) including the VM identifier and exception
details so teardown errors are visible; ensure cleanup() still returns elapsed
seconds.

In `@benchmarks/docs/README_DENSITY_RAMP.md`:
- Around line 112-114: The README currently states the ramp stops when any VM in
a batch fails, but the implementation in benchmarks/density_ramp.py uses the
condition failed_this_batch == batch_size to exit early; update the
README_DENSITY_RAMP.md failure-handling bullet to accurately describe that a
batch is only considered failed (and the ramp stops) when all VMs in the batch
fail (i.e., failed_this_batch equals batch_size), or alternatively change the
code in density_ramp.py to early-exit on any single VM failure if you intend the
former behavior—reference the failed_this_batch and batch_size variables and the
failure-check logic in density_ramp.py when making the edit.
- Around line 103-109: Update the README and example JSON schema so they match
the actual output fields emitted by benchmarks/density_ramp.py: replace
references to `seq`/`kvm_vms`/“KVM VM count” with the real field names `vm_seq`,
`density_at_boot`, `host_disk_used_gb`, `tap_count`, `fc_socket_count`, and
`sqlite_db_kb`, and update any example payloads and descriptions (including the
section currently describing host metrics around lines 120-145) to show the
correct keys and meanings; ensure the README’s metric list and the sample JSON
exactly mirror the dictionary keys written by the script.

In `@benchmarks/docs/v0_density_estimation_benchmark.md`:
- Around line 13-26: The example has a malformed code fence and doesn't clean up
started VMs; fix by making the block a proper python fenced section and change
the loop to try: while True: ... except Exception as e: print(...) finally:
iterate the stored vms list (vms) in reverse and call vm.stop() safely (catch
and ignore stop errors) so every SmolVM started via vm.start() is stopped on
exit; reference SmolVM, vms, vm.start(), and vm.stop() when locating the
snippet.

In `@benchmarks/docs/v0_review_density_ramp_report.md`:
- Around line 25-28: Update the description to document that boot_time_s
represents end-to-end readiness rather than just the SmolVM.start() call: state
that in benchmarks/density_ramp.py the timing is measured from immediately
before SmolVM(config) construction through until wait_for_ssh(timeout=30.0)
returns, and that density_at_boot is the sequence number recorded at that
readiness point; change any sentences that currently say “timed from
SmolVM(config).start() call to return” to reflect this end-to-end measurement
(same change for the other occurrence around lines 57–60).
- Around line 65-66: The report currently hardcodes /tmp for Firecracker
sockets; update the guidance for `fc_socket_count` (and the related cleanup
command referenced around lines 263-267) to mention that the socket directory is
configurable via the `--socket-dir` flag in `benchmarks/density_ramp.py` and
either use a placeholder like `<socket-dir>` (e.g.,
glob("<socket-dir>/fc-*.sock")) or explicitly note “defaults to /tmp if not
overridden”; ensure the text references `--socket-dir` and `fc_socket_count` so
readers know where to change the path.

In `@benchmarks/results/README.md`:
- Around line 167-178: The snippet assumes boots is non-empty; add a guard at
the start that checks if boots is empty and handles that case before computing
times or indexing boots[-1] (e.g., print Peak VMs and Failed boots and either
return/exit or print fallback messages for avg/max/peak mem/RSS). Specifically
update the logic around the variables boots, fails, times, and the prints that
use sum(times)/len(times), max(times), boots[-1]['host_mem_used_gb'], and
boots[-1]['firecracker_rss_mb'] so they are only executed when boots has at
least one element and otherwise produce sensible fallback output.

In `@benchmarks/run_density_ramp_ec2.sh`:
- Around line 123-130: The script currently authorizes SSH from 0.0.0.0/0 using
the aws ec2 authorize-security-group-ingress call, which is insecure; change the
logic around the aws ec2 authorize-security-group-ingress invocation (the block
using REGION, SECURITY_GROUP and --port 22) to require an explicit SSH_CIDR env
var or default to the caller’s single-IP /32 (fetching the public IP if
necessary) instead of 0.0.0.0/0; ensure the aws call uses that SSH_CIDR value
(e.g., --cidr "$SSH_CIDR") and add a guard that exits with an error when neither
SSH_CIDR nor a determinable caller IP is available.
- Around line 268-278: Make NVMe provisioning conditional by first checking
whether the device node /dev/nvme1n1 exists (e.g., using test -b /dev/nvme1n1 or
[ -e /dev/nvme1n1 ]) before running mkfs.ext4, mount, chown, and the symlink
steps; if the device is absent, skip the mkfs/mount and instead ensure the
fallback directories (~/.smolvm and ~/.local/state/smolvm) exist under
/home/ec2-user (mkdir -p and chown) so the rest of the script can proceed.
Reference the block that manipulates /dev/nvme1n1, /mnt/nvme, ln -sfn
/mnt/nvme/smolvm-images /home/ec2-user/.smolvm and ln -sfn
/mnt/nvme/smolvm-state /home/ec2-user/.local/state/smolvm to locate where to add
the conditional.
- Around line 31-33: The KEY_NAME variable is hardcoded to "smolvm-benchmark"
which disables the KEY_NAME env override and the validation block below; change
the assignment for KEY_NAME to use the default-expansion pattern (i.e.,
KEY_NAME="${KEY_NAME:-smolvm-benchmark}") so that the environment variable can
override it and the existing validation block that references KEY_NAME remains
effective; update any related comments to reflect that KEY_NAME can be provided
via environment.

---

Outside diff comments:
In `@src/smolvm/build.py`:
- Around line 690-724: The Docker fallback can leave the produced ext4 image
owned by root; after the subprocess.run that executes the Docker "run" (the
block that uses shell_cmd, tar_path, rootfs_path and invokes "docker" via
subprocess.run), add a post-run step to chown the generated file to the current
invoking user (use rootfs_path or the constructed rootfs_name to locate the
output) — e.g., call os.chown(rootfs_path, os.getuid(), os.getgid()) or run a
safe subprocess(["chown", f"{os.getuid()}:{os.getgid()}", str(rootfs_path)]) —
or alternatively invoke Docker with user namespace remapping (add
"--userns-remap=default") to avoid root-owned output; ensure this fix runs only
on platforms where rootful Docker produces UID 0 files (Linux) and does not
break existing error handling around the subprocess.run that used shell_cmd.

---

Nitpick comments:
In `@benchmarks/bench_command_latency.py`:
- Around line 10-20: The top-level import "from helpers import (...)" is not
package-safe; change it to a relative package import so the module can be
imported as benchmarks.bench_command_latency (e.g. replace with "from .helpers
import format_stats, is_sandbox_exec_available, overhead_str, print_header,
print_result, print_subheader, run_sandboxed, stats_summary, time_call_n"); also
update the other similar imports around the later block (lines referenced in the
comment) to use the same relative form so functions like run_sandboxed,
time_call_n, etc., resolve when the module is imported as part of the package.

In `@benchmarks/helpers.py`:
- Around line 26-40: The timer context manager does not set t.end and
t.elapsed_ms if the with-block raises because the post-yield code runs only on
normal exit; modify the timer function (the contextmanager named timer that
returns an _TimerResult instance) to ensure the stop-time and elapsed_ms are
computed in a finally block so they are always set even if the with body
raises—move the t.end and t.elapsed_ms assignments into a try/finally around the
yield (or use try: yield t finally: ...) so _TimerResult fields are populated on
both success and exception.

In `@pyproject.toml`:
- Around line 29-35: The pyproject currently lists "psutil>=7.2.2" in the base
dependencies array which forces all installs to pull a benchmark-only
dependency; remove "psutil" from the dependencies list and add it to an optional
extra (e.g., create or update the [project.optional-dependencies] entry named
"benchmarks" or "dev" and include "psutil>=7.2.2" there) so normal consumers
don't install it by default; ensure any README or install docs reference the new
extra name for developers running benchmarks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab0b565b-cae3-4ccb-931c-834902f7ea4f

📥 Commits

Reviewing files that changed from the base of the PR and between 42add75 and 275fbfe.

📒 Files selected for processing (14)

benchmarks/__init__.py
benchmarks/bench_command_latency.py
benchmarks/bench_file_io.py
benchmarks/bench_network_io.py
benchmarks/density_ramp.py
benchmarks/docs/README_DENSITY_RAMP.md
benchmarks/docs/v0_density_estimation_benchmark.md
benchmarks/docs/v0_review_density_ramp_report.md
benchmarks/helpers.py
benchmarks/results/README.md
benchmarks/results/density_20260309_001743.json
benchmarks/run_density_ramp_ec2.sh
pyproject.toml
src/smolvm/build.py

benchmarks/bench_file_io.py

benchmarks/bench_network_io.py

benchmarks/density_ramp.py

benchmarks/docs/v0_review_density_ramp_report.md

benchmarks/results/README.md

benchmarks/run_density_ramp_ec2.sh

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/smolvm/build.py (1)

977-1003: ⚠️ Potential issue | 🟠 Major

Validate tar link targets before extraction.

The preflight check only validates member.name but does not check symlink and hardlink linkname values. A crafted exported image can contain symlink or hardlink entries with unsafe link targets (absolute paths or .. segments), and tar xf will honor those during extraction. Since the helper container has a bind-mounted /work/out directory, unsafe link targets can escape the intended /work/rootfs extraction directory. Validate link targets using member.issym(), member.islnk(), and member.linkname before invoking tar xf.

🔒 Proposed fix

         with tarfile.open(tar_path, "r") as tar:
             for member in tar.getmembers():
                 member_path = Path(member.name)
                 if member_path.is_absolute() or ".." in member_path.parts:
                     raise ImageError(
                         f"Refusing to extract suspicious tar path from docker export: {member.name}"
                     )
+                if member.issym() or member.islnk():
+                    link_path = Path(member.linkname)
+                    if link_path.is_absolute() or ".." in link_path.parts:
+                        raise ImageError(
+                            "Refusing to extract suspicious tar link from docker export: "
+                            f"{member.name} -> {member.linkname}"
+                        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/smolvm/build.py` around lines 977 - 1003, The tar preflight loop that
currently checks only member.name must also validate symlink and hardlink
targets to prevent path traversal: inside the existing tar.getmembers() loop in
build.py (the block handling tar_path and ImageError), for any member where
member.issym() or member.islnk() inspect member.linkname and treat it like a
Path, rejecting it if Path(member.linkname).is_absolute() or ".." appears in its
parts (and handle empty/None linkname safely); raise the same ImageError with a
clear message for unsafe link targets before running the external tar xf step so
the extraction cannot escape /work/rootfs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/smolvm/build.py`:
- Around line 977-1003: The tar preflight loop that currently checks only
member.name must also validate symlink and hardlink targets to prevent path
traversal: inside the existing tar.getmembers() loop in build.py (the block
handling tar_path and ImageError), for any member where member.issym() or
member.islnk() inspect member.linkname and treat it like a Path, rejecting it if
Path(member.linkname).is_absolute() or ".." appears in its parts (and handle
empty/None linkname safely); raise the same ImageError with a clear message for
unsafe link targets before running the external tar xf step so the extraction
cannot escape /work/rootfs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5a4a798-9618-47c4-9e88-35c8a943e515

📥 Commits

Reviewing files that changed from the base of the PR and between 275fbfe and 96b9fc2.

📒 Files selected for processing (1)

src/smolvm/build.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

benchmarks/run_density_ramp_gcp.sh (1)
201-203: Consider a more targeted KVM permission approach.

chmod 666 /dev/kvm makes the device world-accessible. While acceptable for an ephemeral benchmark instance, a slightly cleaner approach uses ACLs for the specific user:
sudo setfacl -m u:$USER:rw /dev/kvm
This is a minor hygiene suggestion; the current approach is functional for the use case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/run_density_ramp_gcp.sh` around lines 201 - 203, Replace the
permissive "sudo chmod 666 /dev/kvm" with a targeted ACL-based grant so only the
runner user gets rw access; use setfacl to add user:$USER:rw on /dev/kvm (and
ensure the acl utility is available or fall back to usermod -aG kvm
"$USER")—update the script lines that currently call "sudo chmod 666 /dev/kvm"
to invoke "sudo setfacl -m u:$USER:rw /dev/kvm" and include a small
fallback/installation check for setfacl if needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/run_density_ramp_gcp.sh`:
- Around line 84-91: The firewall rule currently opens SSH to 0.0.0.0/0; change
the script to determine the operator's public IP into a variable (e.g. MY_IP
using curl fallback) and pass that to the gcloud command instead of 0.0.0.0/0 so
the gcloud compute firewall-rules create call (referenced by FIREWALL_RULE,
PROJECT, NETWORK_TAG) uses --source-ranges="$MY_IP/32"; ensure the script exits
or warns if MY_IP cannot be determined.
- Around line 206-210: The pip3 install line will fail on Debian 12 due to PEP
668 — update the installation step that runs "pip3 install --quiet smolvm
psutil" to either add "--break-system-packages" (i.e., "pip3 install
--break-system-packages --quiet smolvm psutil") or, preferably, create and use a
virtual environment: run "python3 -m venv ~/venv" then use "~/venv/bin/pip
install --quiet smolvm psutil" and ensure subsequent benchmark invocations use
"~/venv/bin/python3" (or source the venv) instead of the system "python3".

---

Nitpick comments:
In `@benchmarks/run_density_ramp_gcp.sh`:
- Around line 201-203: Replace the permissive "sudo chmod 666 /dev/kvm" with a
targeted ACL-based grant so only the runner user gets rw access; use setfacl to
add user:$USER:rw on /dev/kvm (and ensure the acl utility is available or fall
back to usermod -aG kvm "$USER")—update the script lines that currently call
"sudo chmod 666 /dev/kvm" to invoke "sudo setfacl -m u:$USER:rw /dev/kvm" and
include a small fallback/installation check for setfacl if needed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f8c9bb1-e4e3-47b3-964e-b9051114b0ab

📥 Commits

Reviewing files that changed from the base of the PR and between 96b9fc2 and 0645b60.

📒 Files selected for processing (3)

benchmarks/density_20260309_010721.json
benchmarks/results/density_20260309_010721.json
benchmarks/run_density_ramp_gcp.sh

benchmarks/run_density_ramp_gcp.sh

- build.py: validate symlink/hardlink targets in tar security check - build.py: chown ext4 output to caller after Docker run (rootful Docker leaves root-owned file) - density_ramp.py: vm.stop() → vm.delete() so disk clones are released; log cleanup failures - density_ramp.py: measure disk usage on SmolVM storage path (follows NVMe symlink on EC2) - density_ramp.py: capture metrics snapshot per-future in parallel path, not post-batch - bench_network_io.py: abort with RuntimeError if no HTTP client after apk add attempt - helpers.py: timer context manager uses try/finally so elapsed_ms is set on exception - bench_*.py: package-safe try/except import for helpers (relative then absolute) - run_density_ramp_ec2.sh: fix KEY_NAME env override; restrict SSH rule to caller IP; guard NVMe provisioning on device existence - run_density_ramp_gcp.sh: restrict SSH rule to caller IP; use venv for pip install (PEP 668); use setfacl for /dev/kvm - pyproject.toml: move psutil to optional [benchmarks] extra, out of base dependencies - README_DENSITY_RAMP.md: update JSON schema (vm_seq, density_at_boot, etc.) and fix failure-handling description - results/README.md: guard analysis snippet against empty boots list - v0_review_density_ramp_report.md: correct boot_time_s as end-to-end SSH readiness; clarify --socket-dir for fc_socket_count - v0_density_estimation_benchmark.md: fix malformed code fence; add VM cleanup to example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cat /tmp/bench_read.dat measures SSH/stdout transport (~10MB through the command channel) rather than guest disk I/O, making the VM vs host comparison misleading. Use dd if=... of=/dev/null so only the guest filesystem read is timed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 13

♻️ Duplicate comments (4)

benchmarks/results/README.md (1)
134-145: ⚠️ Potential issue | 🟡 Minor

Guard these formulas for 0/1 successful boots.

A valid run can stop after zero or one boot_ok. In that case len(boots) - 1 hits zero here, and the RSS example below still divides by len(boots). The examples need a small precondition or fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/results/README.md` around lines 134 - 145, The examples compute
per_vm_mem_mb and rss_per_vm_mb from boots without guarding for runs with 0 or 1
successful boots; add a short precondition that checks len(boots) and handle the
edge cases (len(boots) == 0 and len(boots) == 1) before performing the
divisions: if boots is empty, skip the computation or return a clear "no boots"
result, and if there's exactly one boot, avoid dividing by (len(boots)-1) by
using a defined fallback (e.g., treat per-VM delta as 0 or report "insufficient
samples") and compute rss_per_vm_mb only when len(boots) > 0; update the
explanatory text around per_vm_mem_mb and rss_per_vm_mb to mention these checks.
benchmarks/docs/v0_review_density_ramp_report.md (3)
265-269: ⚠️ Potential issue | 🟡 Minor

Keep the stale-socket example parameterized.

This section hardcodes /tmp/fc-*.sock, but the benchmark lets operators move sockets via --socket-dir. Use <socket-dir> here as well, or call out /tmp as only the default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/docs/v0_review_density_ramp_report.md` around lines 265 - 269, The
example hardcodes the Firecracker socket path (/tmp/fc-*.sock); update the docs
to reference the configurable socket directory instead — e.g., replace the
literal "/tmp/fc-*.sock" in the command example with a parameterized placeholder
like "<socket-dir>/fc-*.sock" (or add a parenthetical note that /tmp is the
default) so the command aligns with the benchmark's --socket-dir option and
avoids confusing readers; update the Bash block containing "ls /tmp/fc-*.sock"
accordingly.
47-52: ⚠️ Potential issue | 🟡 Minor

Describe teardown in terms of vm.delete().

benchmarks/density_ramp.py now deletes each VM during cleanup. Saying cleanup() calls vm.stop() makes the disk-release explanation here inaccurate, because stop() and delete() have very different host-side cleanup behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/docs/v0_review_density_ramp_report.md` around lines 47 - 52,
Update the teardown description to reflect that cleanup() calls vm.delete() (not
vm.stop()), because delete() performs host-side rootfs/clone removal while
stop() only halts the VM; update the text around Metrics.snapshot() to explain
that the post-cleanup snapshot measures disk usage after vm.delete() so the
delta reflects released clones, and mention benchmarks/density_ramp.py as the
source of the deletion behavior change.
59-64: ⚠️ Potential issue | 🟡 Minor

Refresh the metrics table to match the current script.

The Source column still describes the pre-fix behavior: boot_time_s is no longer just wall-clock around vm.start(), and host_disk_used_gb is no longer taken from psutil.disk_usage(home).used. Readers comparing the JSON to this table will get the wrong interpretation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/docs/v0_review_density_ramp_report.md` around lines 59 - 64,
Update the metrics table so the Source column accurately describes how each JSON
field is computed in the current script: for `boot_time_s` replace "Wall clock
around vm.start()" with the actual measurement used in the script (e.g., the
timestamp delta from VM start to the script's success/health-check completion
and the function or lines that compute it), for `host_disk_used_gb` remove the
reference to psutil.disk_usage(home).used and instead describe the current
method the script uses to measure disk growth (the specific variable/function
that records rootfs/clone growth), and ensure `density_at_boot` and
`host_mem_used_gb` entries reference the exact script variables or calls (e.g.,
`len(running_vms)` and `psutil.virtual_memory().used` if still accurate) so
readers can directly map JSON fields to the code that produces them.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/bench_network_io.py`:
- Around line 105-127: The benchmark currently appends timing results even when
subprocess.run or vm.run fail; update the code paths around
time_call/subprocess.run (for host_* variables like host_http_times,
host_sse_lines, and similarly vm.run for guest_http_times/guest_sse_lines) to
check for errors before recording metrics: catch exceptions and verify
subprocess.CompletedProcess.returncode == 0 (and non-empty stdout for SSE) and
only append the measured time or count when the call succeeded; on failure
log/print the error and skip adding the measurement so
stats_summary/format_stats only reflect successful calls.
- Around line 40-43: The current _find_free_port() creates and closes a probe
socket which allows a race where another process can grab the port before
_start_http_server() binds; instead, change _start_http_server() (and any
callers in this file) to bind to port 0 so the HTTPServer picks an available
port and then read the assigned port from the bound server (e.g.,
server.server_address or server.socket.getsockname()[1]) after the server is
created/started, remove or stop using _find_free_port() in bench_network_io.py,
and return or propagate the actual assigned port to the benchmark code so no
ephemeral-port race can occur.
- Around line 159-160: The code uses gateway_result.output.strip() to set
gateway_ip which is inconsistent with the rest of the file and duplicates
stripping; update the assignment to use gateway_result.stdout.strip() instead so
it matches other uses (see vm.run(...) returning gateway_result and the
gateway_ip variable) and remove the redundant .output usage.

In `@benchmarks/density_ramp.py`:
- Around line 158-167: boot_one currently drops the input seq causing later
as_completed processing to use completion order and misreport vm sequence and
density; change boot_one (the function with signature
boot_one(...)->tuple[SmolVM|None, float, dict[str,Any]]) to include and return
the original seq (e.g., return (seq, vm, boot_time, meta) or similar), then
update the caller that iterates over futures/as_completed to unpack and use the
returned seq to assign vm_seq and to compute density_at_boot and update
running_vms/dead_vms based on the preserved seq/readiness counts rather than the
completion order; apply the same change pattern to the analogous boot/launch
code referenced around the other affected block so ordering and readiness counts
remain consistent in --parallel mode.
- Around line 186-193: The except block currently swallows errors from
vm.delete(), causing leaked resources to be hidden; modify the cleanup so that
when vm is not None you attempt vm.delete() but catch any Exception as e_cleanup
and record its details (e.g., "cleanup_error": str(e_cleanup) and optionally a
traceback) into the returned metadata dict alongside "error" and "fc_log"
instead of silently passing, so callers/snapshots can see delete failures;
locate the except handler using symbols _read_vm_log and vm.delete to implement
this change.

In `@benchmarks/helpers.py`:
- Around line 135-162: The current run_sandboxed function silently falls back to
running on the host when is_sandbox_exec_available() is false; change it to
explicitly indicate unsupported instead of performing a host run: update
run_sandboxed to raise a clear exception (e.g., NotImplementedError or a small
custom UnsupportedSandboxError) or return an explicit unsupported sentinel so
callers can detect and skip/emit "N/A", keep the sandbox path (writing profile
and subprocess.run with sandbox-exec) unchanged, and ensure the function
signature and docstring mention the new behavior so callers of run_sandboxed
(and any tests) can catch the exception or check the sentinel rather than
accidentally recording host timings.
- Around line 231-237: The overhead_str function currently builds the delta as
"(+{diff:.0f}ms)" which yields "+-Nms" when diff is negative; update the
formatted delta to include the sign in the numeric format instead of hardcoding
a "+" so negative deltas render as "(-20ms)". Concretely, in overhead_str use
the computed diff variable and format it with a signed format specifier (e.g.,
{diff:+.0f}ms) and remove the hardcoded "+" in the surrounding string so the
returned string uses ratio and signed diff (use ratio, diff, base_ms, test_ms as
referenced).
- Around line 165-180: The timing helpers should fail the benchmark when the
underlying command errors: in time_host replace the subprocess.run call with one
that enforces failure (use check=True or detect CompletedProcess.returncode and
raise subprocess.CalledProcessError) so non-zero exits don't produce a
misleading elapsed time, and in time_sandboxed ensure run_sandboxed either
raises on non-zero exit or returns a result you validate and raise on error
(i.e., propagate or raise subprocess.CalledProcessError) before computing and
returning the elapsed milliseconds; update both functions (time_sandboxed,
time_host) to validate command success and abort/raise on failure so failing
commands don't count as fast runs.
- Around line 120-125: The Seatbelt profile string _SEATBELT_PROFILE currently
only contains "(version 1)" and "(allow default)" which does not deny network
access; update the _SEATBELT_PROFILE constant to insert a network denial rule by
adding "(deny network*)" to block network operations while keeping filesystem
and process execution allowed, and if Unix-domain socket access is needed for
local services add an explicit allowance like "(allow network-outbound (to
unix-socket))" to preserve that case.

In `@benchmarks/results/README.md`:
- Line 7: Update the README to reflect actual file-write behavior: describe that
density_ramp.py accumulates events in memory and writes a single JSON array to
disk at the end of the run (not a sequentially-updated file), and clarify the
actual event ordering and optionality of records produced by density_ramp.py —
record types include boot_ok/boot_fail per VM attempt and teardown, but
sustain_check is only emitted when --sustain-sec > 0 and at least one VM booted
(it is omitted if --sustain-sec 0 or no VM ever boots); update the wording
around event "shape" accordingly so boot/sustain/teardown semantics match
density_ramp.py's behavior.

In `@benchmarks/run_density_ramp_gcp.sh`:
- Around line 215-224: The script must install and start Docker before running
density_ramp.py because ImageBuilder.build_alpine_ssh_key() uses Docker to
build/export the guest image; update benchmarks/run_density_ramp_gcp.sh to
install a Docker package (e.g., docker.io or docker-ce), enable/start the Docker
daemon, and ensure the runner can invoke docker (use sudo or add the user to the
docker group) before calling density_ramp.py so
ImageBuilder.build_alpine_ssh_key() can succeed.
- Around line 219-221: The script currently does "pip install smolvm" which
installs the PyPI release rather than the local checkout; change the install
step so the benchmark uses the checked-out package under review (not PyPI) by
installing the local repository into the venv (e.g., run the venv pip install
against the repo root or use pip install -e . from the checkout) so that
density_ramp.py runs with the branch changes in src/smolvm and pyproject.toml
instead of the PyPI smolvm package.

In `@src/smolvm/build.py`:
- Line 1005: The code calls os.getuid() and os.getgid() inside
_create_ext4_with_docker() but the os module is not imported; add an import for
the os module (e.g., import os) at the top of the module so that the uid/gid
retrieval (uid, gid = os.getuid(), os.getgid()) succeeds without NameError;
ensure the import is placed with other standard-library imports to keep imports
organized.

---

Duplicate comments:
In `@benchmarks/docs/v0_review_density_ramp_report.md`:
- Around line 265-269: The example hardcodes the Firecracker socket path
(/tmp/fc-*.sock); update the docs to reference the configurable socket directory
instead — e.g., replace the literal "/tmp/fc-*.sock" in the command example with
a parameterized placeholder like "<socket-dir>/fc-*.sock" (or add a
parenthetical note that /tmp is the default) so the command aligns with the
benchmark's --socket-dir option and avoids confusing readers; update the Bash
block containing "ls /tmp/fc-*.sock" accordingly.
- Around line 47-52: Update the teardown description to reflect that cleanup()
calls vm.delete() (not vm.stop()), because delete() performs host-side
rootfs/clone removal while stop() only halts the VM; update the text around
Metrics.snapshot() to explain that the post-cleanup snapshot measures disk usage
after vm.delete() so the delta reflects released clones, and mention
benchmarks/density_ramp.py as the source of the deletion behavior change.
- Around line 59-64: Update the metrics table so the Source column accurately
describes how each JSON field is computed in the current script: for
`boot_time_s` replace "Wall clock around vm.start()" with the actual measurement
used in the script (e.g., the timestamp delta from VM start to the script's
success/health-check completion and the function or lines that compute it), for
`host_disk_used_gb` remove the reference to psutil.disk_usage(home).used and
instead describe the current method the script uses to measure disk growth (the
specific variable/function that records rootfs/clone growth), and ensure
`density_at_boot` and `host_mem_used_gb` entries reference the exact script
variables or calls (e.g., `len(running_vms)` and `psutil.virtual_memory().used`
if still accurate) so readers can directly map JSON fields to the code that
produces them.

In `@benchmarks/results/README.md`:
- Around line 134-145: The examples compute per_vm_mem_mb and rss_per_vm_mb from
boots without guarding for runs with 0 or 1 successful boots; add a short
precondition that checks len(boots) and handle the edge cases (len(boots) == 0
and len(boots) == 1) before performing the divisions: if boots is empty, skip
the computation or return a clear "no boots" result, and if there's exactly one
boot, avoid dividing by (len(boots)-1) by using a defined fallback (e.g., treat
per-VM delta as 0 or report "insufficient samples") and compute rss_per_vm_mb
only when len(boots) > 0; update the explanatory text around per_vm_mem_mb and
rss_per_vm_mb to mention these checks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d01608cf-d8df-458a-9418-5b61c51c9b7b

📥 Commits

Reviewing files that changed from the base of the PR and between 0645b60 and 0d66e61.

📒 Files selected for processing (13)

benchmarks/bench_command_latency.py
benchmarks/bench_file_io.py
benchmarks/bench_network_io.py
benchmarks/density_ramp.py
benchmarks/docs/README_DENSITY_RAMP.md
benchmarks/docs/v0_density_estimation_benchmark.md
benchmarks/docs/v0_review_density_ramp_report.md
benchmarks/helpers.py
benchmarks/results/README.md
benchmarks/run_density_ramp_ec2.sh
benchmarks/run_density_ramp_gcp.sh
pyproject.toml
src/smolvm/build.py

🚧 Files skipped from review as they are similar to previous changes (3)

benchmarks/bench_file_io.py
benchmarks/run_density_ramp_ec2.sh
benchmarks/docs/README_DENSITY_RAMP.md

coderabbitai · 2026-03-09T06:23:08Z

benchmarks/bench_network_io.py

+def _find_free_port() -> int:
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(("", 0))
+        return s.getsockname()[1]


⚠️ Potential issue | 🟡 Minor

Let HTTPServer pick the port.

_find_free_port() closes its probe socket before _start_http_server() binds, so another process can claim the port in between and make the benchmark fail sporadically.

🔧 Suggested race-free binding

-def _find_free_port() -> int: - with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: - s.bind(("", 0)) - return s.getsockname()[1] - - -def _start_http_server(port: int) -> http.server.HTTPServer: +def _start_http_server() -> tuple[http.server.HTTPServer, int]: """Start a background HTTP server on the given port.""" - server = http.server.HTTPServer(("0.0.0.0", port), _BenchHandler) + server = http.server.HTTPServer(("0.0.0.0", 0), _BenchHandler) thread = threading.Thread(target=server.serve_forever, daemon=True) thread.start() - return server + return server, int(server.server_address[1]) @@ - host_port = _find_free_port() - server = _start_http_server(host_port) + server, host_port = _start_http_server()

Also applies to: 80-85, 97-98

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_network_io.py` around lines 40 - 43, The current _find_free_port() creates and closes a probe socket which allows a race where another process can grab the port before _start_http_server() binds; instead, change _start_http_server() (and any callers in this file) to bind to port 0 so the HTTPServer picks an available port and then read the assigned port from the bound server (e.g., server.server_address or server.socket.getsockname()[1]) after the server is created/started, remove or stop using _find_free_port() in bench_network_io.py, and return or propagate the actual assigned port to the benchmark code so no ephemeral-port race can occur.

benchmarks/bench_network_io.py

coderabbitai · 2026-03-09T06:23:08Z

benchmarks/bench_network_io.py

+            gateway_result = vm.run("ip route | grep default | awk '{print $3}'")
+            gateway_ip = gateway_result.output.strip()


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Possible run-result fields ==" rg -n -C2 'stdout:|stderr:|output:|exit_code:' src/smolvm echo echo "== Existing attribute usage on vm.run results ==" rg -n -C2 '\.stdout\b|\.output\b' benchmarks src/smolvm

Repository: CelestoAI/SmolVM

Length of output: 12143

Consolidate to use .stdout consistently; .output introduces redundant double-stripping.

Line 160 uses gateway_result.output.strip(), but the output property already returns stripped output (self.stdout.strip()). Since the rest of the file consistently uses .stdout (lines 127, 138, 141–150, 206), change line 160 to gateway_result.stdout.strip() to match the pattern.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_network_io.py` around lines 159 - 160, The code uses gateway_result.output.strip() to set gateway_ip which is inconsistent with the rest of the file and duplicates stripping; update the assignment to use gateway_result.stdout.strip() instead so it matches other uses (see vm.run(...) returning gateway_result and the gateway_ip variable) and remove the redundant .output usage.

benchmarks/density_ramp.py

coderabbitai · 2026-03-09T06:23:08Z

benchmarks/helpers.py

+def overhead_str(base_ms: float, test_ms: float) -> str:
+    """Return a human-readable overhead string like '2.5x' or '+150ms'."""
+    if base_ms <= 0:
+        return "N/A"
+    ratio = test_ms / base_ms
+    diff = test_ms - base_ms
+    return f"{ratio:.1f}x (+{diff:.0f}ms)"


⚠️ Potential issue | 🟡 Minor

Negative deltas render as +-Nms here.

When test_ms < base_ms, this produces output like 0.8x (+-20ms). Use signed formatting so faster cases read cleanly.

Possible fix

- return f"{ratio:.1f}x (+{diff:.0f}ms)" + return f"{ratio:.1f}x ({diff:+.0f}ms)"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def overhead_str(base_ms: float, test_ms: float) -> str:

"""Return a human-readable overhead string like '2.5x' or '+150ms'."""

if base_ms <= 0:

return "N/A"

ratio = test_ms / base_ms

diff = test_ms - base_ms

return f"{ratio:.1f}x (+{diff:.0f}ms)"

def overhead_str(base_ms: float, test_ms: float) -> str:

"""Return a human-readable overhead string like '2.5x' or '+150ms'."""

if base_ms <= 0:

return "N/A"

ratio = test_ms / base_ms

diff = test_ms - base_ms

return f"{ratio:.1f}x ({diff:+.0f}ms)"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/helpers.py` around lines 231 - 237, The overhead_str function currently builds the delta as "(+{diff:.0f}ms)" which yields "+-Nms" when diff is negative; update the formatted delta to include the sign in the numeric format instead of hardcoding a "+" so negative deltas render as "(-20ms)". Concretely, in overhead_str use the computed diff variable and format it with a signed format specifier (e.g., {diff:+.0f}ms) and remove the hardcoded "+" in the surrounding string so the returned string uses ratio and signed diff (use ratio, diff, base_ms, test_ms as referenced).

coderabbitai · 2026-03-09T06:23:08Z

benchmarks/results/README.md

+
+## File Structure
+
+Each file is a JSON array of event objects written sequentially during the run. Records appear in chronological order: one `boot_ok` or `boot_fail` per VM launch attempt, followed by a single `sustain_check`, and finally a `teardown` record.


⚠️ Potential issue | 🟡 Minor

Document the actual file-write behavior.

density_ramp.py buffers events in memory and writes the JSON once at the end, so there is no sequentially-updated file during the run. These sections also overstate the event shape: sustain_check is omitted when --sustain-sec 0 or when no VM ever boots.

Also applies to: 124-125

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/results/README.md` at line 7, Update the README to reflect actual file-write behavior: describe that density_ramp.py accumulates events in memory and writes a single JSON array to disk at the end of the run (not a sequentially-updated file), and clarify the actual event ordering and optionality of records produced by density_ramp.py — record types include boot_ok/boot_fail per VM attempt and teardown, but sustain_check is only emitted when --sustain-sec > 0 and at least one VM booted (it is omitted if --sustain-sec 0 or no VM ever boots); update the wording around event "shape" accordingly so boot/sustain/teardown semantics match density_ramp.py's behavior.

benchmarks/run_density_ramp_gcp.sh

coderabbitai · 2026-03-09T06:23:09Z

benchmarks/run_density_ramp_gcp.sh

+# Install smolvm and psutil into a venv (avoids PEP 668 externally-managed-env error)
+python3 -m venv ~/venv
+~/venv/bin/pip install --quiet smolvm psutil


⚠️ Potential issue | 🟠 Major

This runs against PyPI smolvm, not the checkout under review.

Only density_ramp.py is copied to the instance. The actual runtime package comes from pip install smolvm, so any unreleased changes in src/smolvm or pyproject.toml on this branch are absent from the benchmark run.

Also applies to: 232-245

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/run_density_ramp_gcp.sh` around lines 219 - 221, The script currently does "pip install smolvm" which installs the PyPI release rather than the local checkout; change the install step so the benchmark uses the checked-out package under review (not PyPI) by installing the local repository into the venv (e.g., run the venv pip install against the repo root or use pip install -e . from the checkout) so that density_ramp.py runs with the branch changes in src/smolvm and pyproject.toml instead of the PyPI smolvm package.

src/smolvm/build.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

benchmarks/bench_file_io.py (2)

84-93: Consider using a context manager for reliable cleanup.

If an exception occurs during file creation, the temp directory won't be cleaned up. While this is minor for a benchmark, using tempfile.TemporaryDirectory would be cleaner.

♻️ Suggested improvement

     def _do():
-        tmpdir = tempfile.mkdtemp()
-        for i in range(n):
-            with open(os.path.join(tmpdir, f"job_{i}.json"), "w") as f:
-                f.write(_1KB_LINE)
-        # Cleanup
-        import shutil
-
-        shutil.rmtree(tmpdir)
+        with tempfile.TemporaryDirectory() as tmpdir:
+            for i in range(n):
+                with open(os.path.join(tmpdir, f"job_{i}.json"), "w") as f:
+                    f.write(_1KB_LINE)

     return time_call(_do)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_file_io.py` around lines 84 - 93, The tmpdir created in the
_do function uses tempfile.mkdtemp and a separate shutil.rmtree cleanup which
will be skipped if an exception occurs; change _do to use
tempfile.TemporaryDirectory as a context manager so the temporary directory is
automatically removed on exit (and remove the manual shutil.rmtree call), i.e.,
locate the _do function and replace the mkdtemp/open/write/rmtree pattern with a
with tempfile.TemporaryDirectory() as tmpdir: block around the file creation
loop so cleanup is reliable.

220-223: Redundant path manipulation given the import fallback.

The sys.path.insert logic appears redundant since lines 14-33 already handle both relative and absolute imports with a try/except fallback. This path manipulation is also fragile as it depends on "benchmarks" being in the file path.

♻️ Simplified entry point

 if __name__ == "__main__":
-    if "benchmarks" in __file__:
-        sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent))
     run_benchmark()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/bench_file_io.py` around lines 220 - 223, Remove the fragile
sys.path.insert() conditional in the module-level entrypoint and rely on the
existing try/except import fallback already implemented earlier; specifically,
delete the "if 'benchmarks' in __file__: sys.path.insert(...)" block and leave
the guarded main invocation that calls run_benchmark(), ensuring the script
simply executes run_benchmark() under if __name__ == "__main__": without
manipulating sys.path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmarks/bench_file_io.py`:
- Line 105: Fix the typo in the inline comment that reads "# baslines on host"
in benchmarks/bench_file_io.py by changing "baslines" to "baselines" so the
comment reads "# baselines on host"; locate the comment near the benchmarking
setup (search for the exact string "baslines on host") and update it
accordingly.

---

Nitpick comments:
In `@benchmarks/bench_file_io.py`:
- Around line 84-93: The tmpdir created in the _do function uses
tempfile.mkdtemp and a separate shutil.rmtree cleanup which will be skipped if
an exception occurs; change _do to use tempfile.TemporaryDirectory as a context
manager so the temporary directory is automatically removed on exit (and remove
the manual shutil.rmtree call), i.e., locate the _do function and replace the
mkdtemp/open/write/rmtree pattern with a with tempfile.TemporaryDirectory() as
tmpdir: block around the file creation loop so cleanup is reliable.
- Around line 220-223: Remove the fragile sys.path.insert() conditional in the
module-level entrypoint and rely on the existing try/except import fallback
already implemented earlier; specifically, delete the "if 'benchmarks' in
__file__: sys.path.insert(...)" block and leave the guarded main invocation that
calls run_benchmark(), ensuring the script simply executes run_benchmark() under
if __name__ == "__main__": without manipulating sys.path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20ae6815-236a-41f1-88d9-c5e529f26a0d

📥 Commits

Reviewing files that changed from the base of the PR and between 0d66e61 and b002fb8.

📒 Files selected for processing (1)

benchmarks/bench_file_io.py

coderabbitai · 2026-03-09T06:25:52Z

benchmarks/bench_file_io.py

+
+    print_header("File IO Benchmark")
+
+    # baslines on host


⚠️ Potential issue | 🟡 Minor

Typo: "baslines" should be "baselines".

Minor typo in the comment.

✏️ Fix

- # baslines on host + # baselines on host

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# baslines on host

# baselines on host

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_file_io.py` at line 105, Fix the typo in the inline comment that reads "# baslines on host" in benchmarks/bench_file_io.py by changing "baslines" to "baselines" so the comment reads "# baselines on host"; locate the comment near the benchmarking setup (search for the exact string "baslines on host") and update it accordingly.

Port Docker/nftables install, ulimit/sysctl tuning (nofile, nproc, ip_local_port_range, somaxconn), and post-run network/log diagnostics from run_density_ramp_ec2.sh to run_density_ramp_gcp.sh, adapted for Debian 12 (apt-get, docker.io package, no firewalld). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Only append timing measurements when curl/vm.run succeeds (returncode==0 and non-empty stdout). Failed requests due to network issues or startup races previously produced invalid latency data; now they are logged as warnings and excluded from stats_summary. Fixes CodeRabbit discussion_r2903439531. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

os.getuid() and os.getgid() were called in _create_ext4_with_docker() but the os module was not imported, causing a NameError at runtime. Fixes CodeRabbit discussion_r2903439577. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously, cleanup errors were silently swallowed (except Exception: pass), hiding leaked disks/sockets near the resource ceiling. Now cleanup_error is captured and included in the returned error dict so callers and snapshots can see delete failures. Fixes CodeRabbit discussion_r2903439539. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mode boot_one() now returns seq as the first element of its tuple so callers can use the original launch index instead of completion order. In the as_completed() loop, batch_ok_so_far tracks successful boots within the batch so density_at_boot reflects the true concurrent count for each VM (not just len(running_vms)+1 for all). batch_results are sorted by seq before processing to emit boot_ok/boot_fail records in launch order. Fixes CodeRabbit discussion_r2903439537. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tensor-ninja and others added 2 commits February 28, 2026 20:18

benchmarks for common io

19fc08c

greptile-apps bot reviewed Mar 1, 2026

View reviewed changes

craigpfeifer and others added 8 commits March 5, 2026 22:09

added aws exec script

5b39234

updated project deps

6af109a

results dir

cda015f

results README

275fbfe

Merge branch 'CelestoAI:main' into feat/density-ramp-benchmark

96b9fc2

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

craigpfeifer and others added 2 commits March 9, 2026 01:40

added second test after rebasing to upstream

2abbe7e

add GCP density ramp script and re-tested results

0645b60

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

benchmarks/run_density_ramp_gcp.sh Show resolved Hide resolved

benchmarks/run_density_ramp_gcp.sh Outdated Show resolved Hide resolved

craigpfeifer and others added 2 commits March 9, 2026 02:10

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

craigpfeifer and others added 6 commits March 9, 2026 02:26

moved results file to dir

4570006

		gateway_result = vm.run("ip route \| grep default \| awk '{print $3}'")
		gateway_ip = gateway_result.output.strip()


		## File Structure

		Each file is a JSON array of event objects written sequentially during the run. Records appear in chronological order: one `boot_ok` or `boot_fail` per VM launch attempt, followed by a single `sustain_check`, and finally a `teardown` record.

Conversation

craigpfeifer commented Mar 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

craigpfeifer commented Mar 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading