feat: add benchmark suite with density ramp and common I/O benchmarks#46
feat: add benchmark suite with density ramp and common I/O benchmarks#46craigpfeifer wants to merge 21 commits intoCelestoAI:mainfrom
Conversation
Adds a VM density ramp benchmark that boots VMs until failure, measuring boot latency, host resource consumption (CPU, memory, disk, TAP devices, Firecracker RSS), and sustain health checks across all VMs concurrently. Includes README and v0 benchmark/review reports in benchmarks/docs/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
When building the ext4 rootfs via _create_ext4_with_docker (used when the loopfs helper is absent, e.g. on a bare pip install on EC2), Python's tar.extractall() runs as a non-root user and cannot chown files to uid=0. This left /var/empty and /etc/ssh owned by the host user's uid in the image. sshd refuses to start if /var/empty is not root-owned, so port 22 never opened and all VM SSH attempts timed out. Fix: before mke2fs reads the source tree, the Docker container (running as root) chowns /var/empty and /etc/ssh to root:root and sets /var/empty to 755. The volume mount is changed from :ro to writable to allow the chown. Also adds diagnostic tooling to density_ramp benchmark: - Print Firecracker log on first boot failure to expose sshd errors - Stop orphaned VM on wait_for_ssh failure (resource leak fix) - EC2 script: disable firewalld (conflicts with SmolVM nftables rules) - EC2 script: post-run diagnostics section (network state + VM log) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous fix chowned /work/rootfs/etc/ssh inside Docker, but that path is a bind-mount from the host's tmpdir. After Docker exits the temp dir has root-owned files that ec2-user cannot remove, causing a PermissionError on TemporaryDirectory cleanup. Fix: cp -a the rootfs to /work/rootfs-fixed inside the container, chown there, and run mke2fs on the copy. The host bind-mount is restored to :ro so the host directory is never modified. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… cleanup After mke2fs completes, run a Docker chown pass to restore the ownership of rootfs_dir back to the current user. This prevents PermissionError in TemporaryDirectory cleanup when root-owned files are present — either from the (now-fixed) :ro bypass issue or from residual files left by a previous failed build in the same /tmp path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Python's tarfile.extractall() silently drops root ownership (uid=0) when the host user is non-root. This caused sshd inside the guest to reject /root/.ssh/authorized_keys (wrong owner) and crash on /var/empty — both paths require root ownership enforced by OpenSSH's strict-modes check. Fix: skip Python's tar extraction entirely. Mount the tarball read-only into a Docker container and let Docker (running as root) extract it with correct uid/gid preserved. Exclude ./dev to avoid CAP_MKNOD failures; the guest init mounts devtmpfs at boot so rootfs device nodes are unused. This also eliminates the rootfs-dir bind-mount that caused root-owned files to appear in the temp dir, so TemporaryDirectory cleanup is clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a comprehensive benchmarking suite and helpers for SmolVM (command/file/network benchmarks, density-ramp harness with telemetry and lifecycle tooling), platform-specific sandbox helpers, docs and result formats, EC2/GCP runner scripts, an optional psutil benchmark dependency, and a Docker-based secure tar extraction change in image build. Changes
Sequence Diagram(s)sequenceDiagram
participant Host as Host Process
participant SmolVM as SmolVM Controller
participant VM as Guest VM
participant Metrics as Metrics Store
Host->>SmolVM: build_image / boot_one(seq, kernel, rootfs)
SmolVM->>VM: launch VM (Firecracker/TAP/socket)
VM-->>SmolVM: health / SSH ready
SmolVM-->>Host: return boot result (ok/fail, boot_time)
Host->>Metrics: record boot event & snapshot
loop sustain checks
Host->>VM: check_vm_alive()
VM-->>Host: alive / metrics
Host->>Metrics: append sustain result
end
Host->>SmolVM: cleanup()
SmolVM->>VM: terminate VM(s)
VM-->>SmolVM: shutdown complete
SmolVM-->>Host: report teardown metrics
Host->>Metrics: write JSON results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/smolvm/build.py (1)
690-724:⚠️ Potential issue | 🟠 MajorThe Docker fallback can leave root-owned cache files on Linux with rootful Docker.
The container writes the ext4 image to the bind-mounted host directory as root. On Linux with rootful Docker, the resulting
rootfs.ext4will be owned by UID 0, breaking cache reuse and cleanup for the non-root invoking user. Add a post-execution chown to fix ownership after the Docker container exits, or implement user namespace remapping with--userns-remap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/build.py` around lines 690 - 724, The Docker fallback can leave the produced ext4 image owned by root; after the subprocess.run that executes the Docker "run" (the block that uses shell_cmd, tar_path, rootfs_path and invokes "docker" via subprocess.run), add a post-run step to chown the generated file to the current invoking user (use rootfs_path or the constructed rootfs_name to locate the output) — e.g., call os.chown(rootfs_path, os.getuid(), os.getgid()) or run a safe subprocess(["chown", f"{os.getuid()}:{os.getgid()}", str(rootfs_path)]) — or alternatively invoke Docker with user namespace remapping (add "--userns-remap=default") to avoid root-owned output; ensure this fix runs only on platforms where rootful Docker produces UID 0 files (Linux) and does not break existing error handling around the subprocess.run that used shell_cmd.
🧹 Nitpick comments (3)
benchmarks/helpers.py (1)
26-40: Recordelapsed_mseven when the timed block raises.With
@contextmanager, the code afteryieldis skipped unless it is infinally. A timeout or subprocess error in thewithbody currently leavest.elapsed_msat0.0.♻️ Suggested fix
`@contextmanager` def timer(): """Context manager that records elapsed wall-clock time in milliseconds. @@ """ t = _TimerResult() t.start = time.perf_counter() - yield t - t.end = time.perf_counter() - t.elapsed_ms = (t.end - t.start) * 1000.0 + try: + yield t + finally: + t.end = time.perf_counter() + t.elapsed_ms = (t.end - t.start) * 1000.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/helpers.py` around lines 26 - 40, The timer context manager does not set t.end and t.elapsed_ms if the with-block raises because the post-yield code runs only on normal exit; modify the timer function (the contextmanager named timer that returns an _TimerResult instance) to ensure the stop-time and elapsed_ms are computed in a finally block so they are always set even if the with body raises—move the t.end and t.elapsed_ms assignments into a try/finally around the yield (or use try: yield t finally: ...) so _TimerResult fields are populated on both success and exception.pyproject.toml (1)
29-35: Keeppsutilout of the base runtime dependencies.
benchmarks/is not shipped in the wheel, so this makes everysmolvminstall pull a benchmark-only dependency. Moving it to abenchmarksordevextra keeps the runtime package leaner for normal consumers.♻️ Suggested packaging split
dependencies = [ "paramiko>=3.0", - "psutil>=7.2.2", "pydantic>=2.0", "requests>=2.28", "requests-unixsocket>=0.3", ] [project.optional-dependencies] +benchmarks = [ + "psutil>=7.2.2", +] dev = [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 29 - 35, The pyproject currently lists "psutil>=7.2.2" in the base dependencies array which forces all installs to pull a benchmark-only dependency; remove "psutil" from the dependencies list and add it to an optional extra (e.g., create or update the [project.optional-dependencies] entry named "benchmarks" or "dev" and include "psutil>=7.2.2" there) so normal consumers don't install it by default; ensure any README or install docs reference the new extra name for developers running benchmarks.benchmarks/bench_command_latency.py (1)
10-20: Make the helpers import package-safe.
from helpers import ...only works when the script directory is already onsys.path. Importing this module asbenchmarks.bench_command_latencyfails before the__main__path tweak runs, sorun_benchmark()is not reusable from a driver or test harness.♻️ Suggested import pattern
-from helpers import ( - format_stats, - is_sandbox_exec_available, - overhead_str, - print_header, - print_result, - print_subheader, - run_sandboxed, - stats_summary, - time_call_n, -) +try: + from .helpers import ( + format_stats, + is_sandbox_exec_available, + overhead_str, + print_header, + print_result, + print_subheader, + run_sandboxed, + stats_summary, + time_call_n, + ) +except ImportError: + from helpers import ( + format_stats, + is_sandbox_exec_available, + overhead_str, + print_header, + print_result, + print_subheader, + run_sandboxed, + stats_summary, + time_call_n, + )Also applies to: 129-132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_command_latency.py` around lines 10 - 20, The top-level import "from helpers import (...)" is not package-safe; change it to a relative package import so the module can be imported as benchmarks.bench_command_latency (e.g. replace with "from .helpers import format_stats, is_sandbox_exec_available, overhead_str, print_header, print_result, print_subheader, run_sandboxed, stats_summary, time_call_n"); also update the other similar imports around the later block (lines referenced in the comment) to use the same relative form so functions like run_sandboxed, time_call_n, etc., resolve when the module is imported as part of the package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/bench_file_io.py`:
- Around line 155-165: The VM bulk-read loop currently uses vm.run("cat
/tmp/bench_read.dat") which streams ~10MB back over the command channel
(measuring SSH/stdout transport), so replace the VM-side command with a
guest-local read that doesn’t transmit file contents (e.g., use dd
if=/tmp/bench_read.dat of=/dev/null bs=1024 count=10240) in the vm.run calls
used by time_call (the block building vm_bulk_r_times and vm_bulk_r_stats) and
the analogous block at lines ~185-198; keep using time_call and stats_summary
but ensure you only change the vm.run command string to dd so the benchmark
measures guest disk I/O instead of transport overhead.
In `@benchmarks/bench_network_io.py`:
- Around line 132-168: After attempting to install curl, ensure you re-check the
VM for HTTP clients and fail fast if none are available: re-run the checks that
set has_curl and has_wget (e.g., using vm.run("which curl") and vm.run("which
wget") or equivalent) after the apk add step, and if both has_curl and has_wget
remain false, abort the benchmark with a clear error (raise/assert or exit)
instead of proceeding to build fetch_cmd; update the logic around vm.run,
has_curl, has_wget, and fetch_cmd so the later HTTP-request block only runs when
a confirmed client exists.
In `@benchmarks/density_ramp.py`:
- Around line 294-315: The current parallel boot path collects all results with
batch_results = [f.result() for f in as_completed(futures)] and only takes
Metrics.snapshot(...) afterwards, which produces post-batch metrics; instead,
process each future as it completes: iterate over as_completed(futures), call
f.result() immediately, compute density using running_vms plus (1 if vm is not
None else 0), call Metrics.snapshot(density, args.socket_dir) right there, then
update i and failed_this_batch and emit the boot_ok/boot_fail handling; modify
the ThreadPoolExecutor block that builds futures and the subsequent
batch_results handling so snapshots are taken per completed future rather than
after collecting the whole batch.
- Around line 59-64: The snapshot function (and the similar code at the later
occurrence) is measuring disk usage with psutil.disk_usage(Path.home()) which
reports the home filesystem rather than SmolVM’s storage; update snapshot (and
the other occurrence around lines 257-258) to check for SmolVM storage paths (at
least Path.home()/".smolvm" and Path.home()/".local/state/smolvm" or an
environment/configured storage path) and call psutil.disk_usage on the first
existing storage mount, falling back to Path.home() if none exist; ensure you
convert the chosen path to a string for psutil.disk_usage and compute disk in GB
as before so host_disk_used_gb reflects the actual storage mount used for
images/clones.
- Around line 104-112: The cleanup() function currently calls vm.stop() for each
VM which only powers off the VM; change it to call vm.delete() on each vm in
running_vms so host-side disk and network cleanup occurs (replace vm.stop() with
vm.delete()). Remove the silent "except Exception: pass" and instead catch
Exception as e and log the failure (e.g., using the module logger or
logging.exception/logging.error) including the VM identifier and exception
details so teardown errors are visible; ensure cleanup() still returns elapsed
seconds.
In `@benchmarks/docs/README_DENSITY_RAMP.md`:
- Around line 112-114: The README currently states the ramp stops when any VM in
a batch fails, but the implementation in benchmarks/density_ramp.py uses the
condition failed_this_batch == batch_size to exit early; update the
README_DENSITY_RAMP.md failure-handling bullet to accurately describe that a
batch is only considered failed (and the ramp stops) when all VMs in the batch
fail (i.e., failed_this_batch equals batch_size), or alternatively change the
code in density_ramp.py to early-exit on any single VM failure if you intend the
former behavior—reference the failed_this_batch and batch_size variables and the
failure-check logic in density_ramp.py when making the edit.
- Around line 103-109: Update the README and example JSON schema so they match
the actual output fields emitted by benchmarks/density_ramp.py: replace
references to `seq`/`kvm_vms`/“KVM VM count” with the real field names `vm_seq`,
`density_at_boot`, `host_disk_used_gb`, `tap_count`, `fc_socket_count`, and
`sqlite_db_kb`, and update any example payloads and descriptions (including the
section currently describing host metrics around lines 120-145) to show the
correct keys and meanings; ensure the README’s metric list and the sample JSON
exactly mirror the dictionary keys written by the script.
In `@benchmarks/docs/v0_density_estimation_benchmark.md`:
- Around line 13-26: The example has a malformed code fence and doesn't clean up
started VMs; fix by making the block a proper python fenced section and change
the loop to try: while True: ... except Exception as e: print(...) finally:
iterate the stored vms list (vms) in reverse and call vm.stop() safely (catch
and ignore stop errors) so every SmolVM started via vm.start() is stopped on
exit; reference SmolVM, vms, vm.start(), and vm.stop() when locating the
snippet.
In `@benchmarks/docs/v0_review_density_ramp_report.md`:
- Around line 25-28: Update the description to document that boot_time_s
represents end-to-end readiness rather than just the SmolVM.start() call: state
that in benchmarks/density_ramp.py the timing is measured from immediately
before SmolVM(config) construction through until wait_for_ssh(timeout=30.0)
returns, and that density_at_boot is the sequence number recorded at that
readiness point; change any sentences that currently say “timed from
SmolVM(config).start() call to return” to reflect this end-to-end measurement
(same change for the other occurrence around lines 57–60).
- Around line 65-66: The report currently hardcodes /tmp for Firecracker
sockets; update the guidance for `fc_socket_count` (and the related cleanup
command referenced around lines 263-267) to mention that the socket directory is
configurable via the `--socket-dir` flag in `benchmarks/density_ramp.py` and
either use a placeholder like `<socket-dir>` (e.g.,
glob("<socket-dir>/fc-*.sock")) or explicitly note “defaults to /tmp if not
overridden”; ensure the text references `--socket-dir` and `fc_socket_count` so
readers know where to change the path.
In `@benchmarks/results/README.md`:
- Around line 167-178: The snippet assumes boots is non-empty; add a guard at
the start that checks if boots is empty and handles that case before computing
times or indexing boots[-1] (e.g., print Peak VMs and Failed boots and either
return/exit or print fallback messages for avg/max/peak mem/RSS). Specifically
update the logic around the variables boots, fails, times, and the prints that
use sum(times)/len(times), max(times), boots[-1]['host_mem_used_gb'], and
boots[-1]['firecracker_rss_mb'] so they are only executed when boots has at
least one element and otherwise produce sensible fallback output.
In `@benchmarks/run_density_ramp_ec2.sh`:
- Around line 123-130: The script currently authorizes SSH from 0.0.0.0/0 using
the aws ec2 authorize-security-group-ingress call, which is insecure; change the
logic around the aws ec2 authorize-security-group-ingress invocation (the block
using REGION, SECURITY_GROUP and --port 22) to require an explicit SSH_CIDR env
var or default to the caller’s single-IP /32 (fetching the public IP if
necessary) instead of 0.0.0.0/0; ensure the aws call uses that SSH_CIDR value
(e.g., --cidr "$SSH_CIDR") and add a guard that exits with an error when neither
SSH_CIDR nor a determinable caller IP is available.
- Around line 268-278: Make NVMe provisioning conditional by first checking
whether the device node /dev/nvme1n1 exists (e.g., using test -b /dev/nvme1n1 or
[ -e /dev/nvme1n1 ]) before running mkfs.ext4, mount, chown, and the symlink
steps; if the device is absent, skip the mkfs/mount and instead ensure the
fallback directories (~/.smolvm and ~/.local/state/smolvm) exist under
/home/ec2-user (mkdir -p and chown) so the rest of the script can proceed.
Reference the block that manipulates /dev/nvme1n1, /mnt/nvme, ln -sfn
/mnt/nvme/smolvm-images /home/ec2-user/.smolvm and ln -sfn
/mnt/nvme/smolvm-state /home/ec2-user/.local/state/smolvm to locate where to add
the conditional.
- Around line 31-33: The KEY_NAME variable is hardcoded to "smolvm-benchmark"
which disables the KEY_NAME env override and the validation block below; change
the assignment for KEY_NAME to use the default-expansion pattern (i.e.,
KEY_NAME="${KEY_NAME:-smolvm-benchmark}") so that the environment variable can
override it and the existing validation block that references KEY_NAME remains
effective; update any related comments to reflect that KEY_NAME can be provided
via environment.
---
Outside diff comments:
In `@src/smolvm/build.py`:
- Around line 690-724: The Docker fallback can leave the produced ext4 image
owned by root; after the subprocess.run that executes the Docker "run" (the
block that uses shell_cmd, tar_path, rootfs_path and invokes "docker" via
subprocess.run), add a post-run step to chown the generated file to the current
invoking user (use rootfs_path or the constructed rootfs_name to locate the
output) — e.g., call os.chown(rootfs_path, os.getuid(), os.getgid()) or run a
safe subprocess(["chown", f"{os.getuid()}:{os.getgid()}", str(rootfs_path)]) —
or alternatively invoke Docker with user namespace remapping (add
"--userns-remap=default") to avoid root-owned output; ensure this fix runs only
on platforms where rootful Docker produces UID 0 files (Linux) and does not
break existing error handling around the subprocess.run that used shell_cmd.
---
Nitpick comments:
In `@benchmarks/bench_command_latency.py`:
- Around line 10-20: The top-level import "from helpers import (...)" is not
package-safe; change it to a relative package import so the module can be
imported as benchmarks.bench_command_latency (e.g. replace with "from .helpers
import format_stats, is_sandbox_exec_available, overhead_str, print_header,
print_result, print_subheader, run_sandboxed, stats_summary, time_call_n"); also
update the other similar imports around the later block (lines referenced in the
comment) to use the same relative form so functions like run_sandboxed,
time_call_n, etc., resolve when the module is imported as part of the package.
In `@benchmarks/helpers.py`:
- Around line 26-40: The timer context manager does not set t.end and
t.elapsed_ms if the with-block raises because the post-yield code runs only on
normal exit; modify the timer function (the contextmanager named timer that
returns an _TimerResult instance) to ensure the stop-time and elapsed_ms are
computed in a finally block so they are always set even if the with body
raises—move the t.end and t.elapsed_ms assignments into a try/finally around the
yield (or use try: yield t finally: ...) so _TimerResult fields are populated on
both success and exception.
In `@pyproject.toml`:
- Around line 29-35: The pyproject currently lists "psutil>=7.2.2" in the base
dependencies array which forces all installs to pull a benchmark-only
dependency; remove "psutil" from the dependencies list and add it to an optional
extra (e.g., create or update the [project.optional-dependencies] entry named
"benchmarks" or "dev" and include "psutil>=7.2.2" there) so normal consumers
don't install it by default; ensure any README or install docs reference the new
extra name for developers running benchmarks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ab0b565b-cae3-4ccb-931c-834902f7ea4f
📒 Files selected for processing (14)
benchmarks/__init__.pybenchmarks/bench_command_latency.pybenchmarks/bench_file_io.pybenchmarks/bench_network_io.pybenchmarks/density_ramp.pybenchmarks/docs/README_DENSITY_RAMP.mdbenchmarks/docs/v0_density_estimation_benchmark.mdbenchmarks/docs/v0_review_density_ramp_report.mdbenchmarks/helpers.pybenchmarks/results/README.mdbenchmarks/results/density_20260309_001743.jsonbenchmarks/run_density_ramp_ec2.shpyproject.tomlsrc/smolvm/build.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/smolvm/build.py (1)
977-1003:⚠️ Potential issue | 🟠 MajorValidate tar link targets before extraction.
The preflight check only validates
member.namebut does not check symlink and hardlinklinknamevalues. A crafted exported image can contain symlink or hardlink entries with unsafe link targets (absolute paths or..segments), andtar xfwill honor those during extraction. Since the helper container has a bind-mounted/work/outdirectory, unsafe link targets can escape the intended/work/rootfsextraction directory. Validate link targets usingmember.issym(),member.islnk(), andmember.linknamebefore invokingtar xf.🔒 Proposed fix
with tarfile.open(tar_path, "r") as tar: for member in tar.getmembers(): member_path = Path(member.name) if member_path.is_absolute() or ".." in member_path.parts: raise ImageError( f"Refusing to extract suspicious tar path from docker export: {member.name}" ) + if member.issym() or member.islnk(): + link_path = Path(member.linkname) + if link_path.is_absolute() or ".." in link_path.parts: + raise ImageError( + "Refusing to extract suspicious tar link from docker export: " + f"{member.name} -> {member.linkname}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/smolvm/build.py` around lines 977 - 1003, The tar preflight loop that currently checks only member.name must also validate symlink and hardlink targets to prevent path traversal: inside the existing tar.getmembers() loop in build.py (the block handling tar_path and ImageError), for any member where member.issym() or member.islnk() inspect member.linkname and treat it like a Path, rejecting it if Path(member.linkname).is_absolute() or ".." appears in its parts (and handle empty/None linkname safely); raise the same ImageError with a clear message for unsafe link targets before running the external tar xf step so the extraction cannot escape /work/rootfs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/smolvm/build.py`:
- Around line 977-1003: The tar preflight loop that currently checks only
member.name must also validate symlink and hardlink targets to prevent path
traversal: inside the existing tar.getmembers() loop in build.py (the block
handling tar_path and ImageError), for any member where member.issym() or
member.islnk() inspect member.linkname and treat it like a Path, rejecting it if
Path(member.linkname).is_absolute() or ".." appears in its parts (and handle
empty/None linkname safely); raise the same ImageError with a clear message for
unsafe link targets before running the external tar xf step so the extraction
cannot escape /work/rootfs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
benchmarks/run_density_ramp_gcp.sh (1)
201-203: Consider a more targeted KVM permission approach.
chmod 666 /dev/kvmmakes the device world-accessible. While acceptable for an ephemeral benchmark instance, a slightly cleaner approach uses ACLs for the specific user:sudo setfacl -m u:$USER:rw /dev/kvmThis is a minor hygiene suggestion; the current approach is functional for the use case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/run_density_ramp_gcp.sh` around lines 201 - 203, Replace the permissive "sudo chmod 666 /dev/kvm" with a targeted ACL-based grant so only the runner user gets rw access; use setfacl to add user:$USER:rw on /dev/kvm (and ensure the acl utility is available or fall back to usermod -aG kvm "$USER")—update the script lines that currently call "sudo chmod 666 /dev/kvm" to invoke "sudo setfacl -m u:$USER:rw /dev/kvm" and include a small fallback/installation check for setfacl if needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/run_density_ramp_gcp.sh`:
- Around line 84-91: The firewall rule currently opens SSH to 0.0.0.0/0; change
the script to determine the operator's public IP into a variable (e.g. MY_IP
using curl fallback) and pass that to the gcloud command instead of 0.0.0.0/0 so
the gcloud compute firewall-rules create call (referenced by FIREWALL_RULE,
PROJECT, NETWORK_TAG) uses --source-ranges="$MY_IP/32"; ensure the script exits
or warns if MY_IP cannot be determined.
- Around line 206-210: The pip3 install line will fail on Debian 12 due to PEP
668 — update the installation step that runs "pip3 install --quiet smolvm
psutil" to either add "--break-system-packages" (i.e., "pip3 install
--break-system-packages --quiet smolvm psutil") or, preferably, create and use a
virtual environment: run "python3 -m venv ~/venv" then use "~/venv/bin/pip
install --quiet smolvm psutil" and ensure subsequent benchmark invocations use
"~/venv/bin/python3" (or source the venv) instead of the system "python3".
---
Nitpick comments:
In `@benchmarks/run_density_ramp_gcp.sh`:
- Around line 201-203: Replace the permissive "sudo chmod 666 /dev/kvm" with a
targeted ACL-based grant so only the runner user gets rw access; use setfacl to
add user:$USER:rw on /dev/kvm (and ensure the acl utility is available or fall
back to usermod -aG kvm "$USER")—update the script lines that currently call
"sudo chmod 666 /dev/kvm" to invoke "sudo setfacl -m u:$USER:rw /dev/kvm" and
include a small fallback/installation check for setfacl if needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f8c9bb1-e4e3-47b3-964e-b9051114b0ab
📒 Files selected for processing (3)
benchmarks/density_20260309_010721.jsonbenchmarks/results/density_20260309_010721.jsonbenchmarks/run_density_ramp_gcp.sh
- build.py: validate symlink/hardlink targets in tar security check - build.py: chown ext4 output to caller after Docker run (rootful Docker leaves root-owned file) - density_ramp.py: vm.stop() → vm.delete() so disk clones are released; log cleanup failures - density_ramp.py: measure disk usage on SmolVM storage path (follows NVMe symlink on EC2) - density_ramp.py: capture metrics snapshot per-future in parallel path, not post-batch - bench_network_io.py: abort with RuntimeError if no HTTP client after apk add attempt - helpers.py: timer context manager uses try/finally so elapsed_ms is set on exception - bench_*.py: package-safe try/except import for helpers (relative then absolute) - run_density_ramp_ec2.sh: fix KEY_NAME env override; restrict SSH rule to caller IP; guard NVMe provisioning on device existence - run_density_ramp_gcp.sh: restrict SSH rule to caller IP; use venv for pip install (PEP 668); use setfacl for /dev/kvm - pyproject.toml: move psutil to optional [benchmarks] extra, out of base dependencies - README_DENSITY_RAMP.md: update JSON schema (vm_seq, density_at_boot, etc.) and fix failure-handling description - results/README.md: guard analysis snippet against empty boots list - v0_review_density_ramp_report.md: correct boot_time_s as end-to-end SSH readiness; clarify --socket-dir for fc_socket_count - v0_density_estimation_benchmark.md: fix malformed code fence; add VM cleanup to example Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cat /tmp/bench_read.dat measures SSH/stdout transport (~10MB through the command channel) rather than guest disk I/O, making the VM vs host comparison misleading. Use dd if=... of=/dev/null so only the guest filesystem read is timed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 13
♻️ Duplicate comments (4)
benchmarks/results/README.md (1)
134-145:⚠️ Potential issue | 🟡 MinorGuard these formulas for 0/1 successful boots.
A valid run can stop after zero or one
boot_ok. In that caselen(boots) - 1hits zero here, and the RSS example below still divides bylen(boots). The examples need a small precondition or fallback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/results/README.md` around lines 134 - 145, The examples compute per_vm_mem_mb and rss_per_vm_mb from boots without guarding for runs with 0 or 1 successful boots; add a short precondition that checks len(boots) and handle the edge cases (len(boots) == 0 and len(boots) == 1) before performing the divisions: if boots is empty, skip the computation or return a clear "no boots" result, and if there's exactly one boot, avoid dividing by (len(boots)-1) by using a defined fallback (e.g., treat per-VM delta as 0 or report "insufficient samples") and compute rss_per_vm_mb only when len(boots) > 0; update the explanatory text around per_vm_mem_mb and rss_per_vm_mb to mention these checks.benchmarks/docs/v0_review_density_ramp_report.md (3)
265-269:⚠️ Potential issue | 🟡 MinorKeep the stale-socket example parameterized.
This section hardcodes
/tmp/fc-*.sock, but the benchmark lets operators move sockets via--socket-dir. Use<socket-dir>here as well, or call out/tmpas only the default.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/docs/v0_review_density_ramp_report.md` around lines 265 - 269, The example hardcodes the Firecracker socket path (/tmp/fc-*.sock); update the docs to reference the configurable socket directory instead — e.g., replace the literal "/tmp/fc-*.sock" in the command example with a parameterized placeholder like "<socket-dir>/fc-*.sock" (or add a parenthetical note that /tmp is the default) so the command aligns with the benchmark's --socket-dir option and avoids confusing readers; update the Bash block containing "ls /tmp/fc-*.sock" accordingly.
47-52:⚠️ Potential issue | 🟡 MinorDescribe teardown in terms of
vm.delete().
benchmarks/density_ramp.pynow deletes each VM during cleanup. Sayingcleanup()callsvm.stop()makes the disk-release explanation here inaccurate, becausestop()anddelete()have very different host-side cleanup behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/docs/v0_review_density_ramp_report.md` around lines 47 - 52, Update the teardown description to reflect that cleanup() calls vm.delete() (not vm.stop()), because delete() performs host-side rootfs/clone removal while stop() only halts the VM; update the text around Metrics.snapshot() to explain that the post-cleanup snapshot measures disk usage after vm.delete() so the delta reflects released clones, and mention benchmarks/density_ramp.py as the source of the deletion behavior change.
59-64:⚠️ Potential issue | 🟡 MinorRefresh the metrics table to match the current script.
The Source column still describes the pre-fix behavior:
boot_time_sis no longer just wall-clock aroundvm.start(), andhost_disk_used_gbis no longer taken frompsutil.disk_usage(home).used. Readers comparing the JSON to this table will get the wrong interpretation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/docs/v0_review_density_ramp_report.md` around lines 59 - 64, Update the metrics table so the Source column accurately describes how each JSON field is computed in the current script: for `boot_time_s` replace "Wall clock around vm.start()" with the actual measurement used in the script (e.g., the timestamp delta from VM start to the script's success/health-check completion and the function or lines that compute it), for `host_disk_used_gb` remove the reference to psutil.disk_usage(home).used and instead describe the current method the script uses to measure disk growth (the specific variable/function that records rootfs/clone growth), and ensure `density_at_boot` and `host_mem_used_gb` entries reference the exact script variables or calls (e.g., `len(running_vms)` and `psutil.virtual_memory().used` if still accurate) so readers can directly map JSON fields to the code that produces them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/bench_network_io.py`:
- Around line 105-127: The benchmark currently appends timing results even when
subprocess.run or vm.run fail; update the code paths around
time_call/subprocess.run (for host_* variables like host_http_times,
host_sse_lines, and similarly vm.run for guest_http_times/guest_sse_lines) to
check for errors before recording metrics: catch exceptions and verify
subprocess.CompletedProcess.returncode == 0 (and non-empty stdout for SSE) and
only append the measured time or count when the call succeeded; on failure
log/print the error and skip adding the measurement so
stats_summary/format_stats only reflect successful calls.
- Around line 40-43: The current _find_free_port() creates and closes a probe
socket which allows a race where another process can grab the port before
_start_http_server() binds; instead, change _start_http_server() (and any
callers in this file) to bind to port 0 so the HTTPServer picks an available
port and then read the assigned port from the bound server (e.g.,
server.server_address or server.socket.getsockname()[1]) after the server is
created/started, remove or stop using _find_free_port() in bench_network_io.py,
and return or propagate the actual assigned port to the benchmark code so no
ephemeral-port race can occur.
- Around line 159-160: The code uses gateway_result.output.strip() to set
gateway_ip which is inconsistent with the rest of the file and duplicates
stripping; update the assignment to use gateway_result.stdout.strip() instead so
it matches other uses (see vm.run(...) returning gateway_result and the
gateway_ip variable) and remove the redundant .output usage.
In `@benchmarks/density_ramp.py`:
- Around line 158-167: boot_one currently drops the input seq causing later
as_completed processing to use completion order and misreport vm sequence and
density; change boot_one (the function with signature
boot_one(...)->tuple[SmolVM|None, float, dict[str,Any]]) to include and return
the original seq (e.g., return (seq, vm, boot_time, meta) or similar), then
update the caller that iterates over futures/as_completed to unpack and use the
returned seq to assign vm_seq and to compute density_at_boot and update
running_vms/dead_vms based on the preserved seq/readiness counts rather than the
completion order; apply the same change pattern to the analogous boot/launch
code referenced around the other affected block so ordering and readiness counts
remain consistent in --parallel mode.
- Around line 186-193: The except block currently swallows errors from
vm.delete(), causing leaked resources to be hidden; modify the cleanup so that
when vm is not None you attempt vm.delete() but catch any Exception as e_cleanup
and record its details (e.g., "cleanup_error": str(e_cleanup) and optionally a
traceback) into the returned metadata dict alongside "error" and "fc_log"
instead of silently passing, so callers/snapshots can see delete failures;
locate the except handler using symbols _read_vm_log and vm.delete to implement
this change.
In `@benchmarks/helpers.py`:
- Around line 135-162: The current run_sandboxed function silently falls back to
running on the host when is_sandbox_exec_available() is false; change it to
explicitly indicate unsupported instead of performing a host run: update
run_sandboxed to raise a clear exception (e.g., NotImplementedError or a small
custom UnsupportedSandboxError) or return an explicit unsupported sentinel so
callers can detect and skip/emit "N/A", keep the sandbox path (writing profile
and subprocess.run with sandbox-exec) unchanged, and ensure the function
signature and docstring mention the new behavior so callers of run_sandboxed
(and any tests) can catch the exception or check the sentinel rather than
accidentally recording host timings.
- Around line 231-237: The overhead_str function currently builds the delta as
"(+{diff:.0f}ms)" which yields "+-Nms" when diff is negative; update the
formatted delta to include the sign in the numeric format instead of hardcoding
a "+" so negative deltas render as "(-20ms)". Concretely, in overhead_str use
the computed diff variable and format it with a signed format specifier (e.g.,
{diff:+.0f}ms) and remove the hardcoded "+" in the surrounding string so the
returned string uses ratio and signed diff (use ratio, diff, base_ms, test_ms as
referenced).
- Around line 165-180: The timing helpers should fail the benchmark when the
underlying command errors: in time_host replace the subprocess.run call with one
that enforces failure (use check=True or detect CompletedProcess.returncode and
raise subprocess.CalledProcessError) so non-zero exits don't produce a
misleading elapsed time, and in time_sandboxed ensure run_sandboxed either
raises on non-zero exit or returns a result you validate and raise on error
(i.e., propagate or raise subprocess.CalledProcessError) before computing and
returning the elapsed milliseconds; update both functions (time_sandboxed,
time_host) to validate command success and abort/raise on failure so failing
commands don't count as fast runs.
- Around line 120-125: The Seatbelt profile string _SEATBELT_PROFILE currently
only contains "(version 1)" and "(allow default)" which does not deny network
access; update the _SEATBELT_PROFILE constant to insert a network denial rule by
adding "(deny network*)" to block network operations while keeping filesystem
and process execution allowed, and if Unix-domain socket access is needed for
local services add an explicit allowance like "(allow network-outbound (to
unix-socket))" to preserve that case.
In `@benchmarks/results/README.md`:
- Line 7: Update the README to reflect actual file-write behavior: describe that
density_ramp.py accumulates events in memory and writes a single JSON array to
disk at the end of the run (not a sequentially-updated file), and clarify the
actual event ordering and optionality of records produced by density_ramp.py —
record types include boot_ok/boot_fail per VM attempt and teardown, but
sustain_check is only emitted when --sustain-sec > 0 and at least one VM booted
(it is omitted if --sustain-sec 0 or no VM ever boots); update the wording
around event "shape" accordingly so boot/sustain/teardown semantics match
density_ramp.py's behavior.
In `@benchmarks/run_density_ramp_gcp.sh`:
- Around line 215-224: The script must install and start Docker before running
density_ramp.py because ImageBuilder.build_alpine_ssh_key() uses Docker to
build/export the guest image; update benchmarks/run_density_ramp_gcp.sh to
install a Docker package (e.g., docker.io or docker-ce), enable/start the Docker
daemon, and ensure the runner can invoke docker (use sudo or add the user to the
docker group) before calling density_ramp.py so
ImageBuilder.build_alpine_ssh_key() can succeed.
- Around line 219-221: The script currently does "pip install smolvm" which
installs the PyPI release rather than the local checkout; change the install
step so the benchmark uses the checked-out package under review (not PyPI) by
installing the local repository into the venv (e.g., run the venv pip install
against the repo root or use pip install -e . from the checkout) so that
density_ramp.py runs with the branch changes in src/smolvm and pyproject.toml
instead of the PyPI smolvm package.
In `@src/smolvm/build.py`:
- Line 1005: The code calls os.getuid() and os.getgid() inside
_create_ext4_with_docker() but the os module is not imported; add an import for
the os module (e.g., import os) at the top of the module so that the uid/gid
retrieval (uid, gid = os.getuid(), os.getgid()) succeeds without NameError;
ensure the import is placed with other standard-library imports to keep imports
organized.
---
Duplicate comments:
In `@benchmarks/docs/v0_review_density_ramp_report.md`:
- Around line 265-269: The example hardcodes the Firecracker socket path
(/tmp/fc-*.sock); update the docs to reference the configurable socket directory
instead — e.g., replace the literal "/tmp/fc-*.sock" in the command example with
a parameterized placeholder like "<socket-dir>/fc-*.sock" (or add a
parenthetical note that /tmp is the default) so the command aligns with the
benchmark's --socket-dir option and avoids confusing readers; update the Bash
block containing "ls /tmp/fc-*.sock" accordingly.
- Around line 47-52: Update the teardown description to reflect that cleanup()
calls vm.delete() (not vm.stop()), because delete() performs host-side
rootfs/clone removal while stop() only halts the VM; update the text around
Metrics.snapshot() to explain that the post-cleanup snapshot measures disk usage
after vm.delete() so the delta reflects released clones, and mention
benchmarks/density_ramp.py as the source of the deletion behavior change.
- Around line 59-64: Update the metrics table so the Source column accurately
describes how each JSON field is computed in the current script: for
`boot_time_s` replace "Wall clock around vm.start()" with the actual measurement
used in the script (e.g., the timestamp delta from VM start to the script's
success/health-check completion and the function or lines that compute it), for
`host_disk_used_gb` remove the reference to psutil.disk_usage(home).used and
instead describe the current method the script uses to measure disk growth (the
specific variable/function that records rootfs/clone growth), and ensure
`density_at_boot` and `host_mem_used_gb` entries reference the exact script
variables or calls (e.g., `len(running_vms)` and `psutil.virtual_memory().used`
if still accurate) so readers can directly map JSON fields to the code that
produces them.
In `@benchmarks/results/README.md`:
- Around line 134-145: The examples compute per_vm_mem_mb and rss_per_vm_mb from
boots without guarding for runs with 0 or 1 successful boots; add a short
precondition that checks len(boots) and handle the edge cases (len(boots) == 0
and len(boots) == 1) before performing the divisions: if boots is empty, skip
the computation or return a clear "no boots" result, and if there's exactly one
boot, avoid dividing by (len(boots)-1) by using a defined fallback (e.g., treat
per-VM delta as 0 or report "insufficient samples") and compute rss_per_vm_mb
only when len(boots) > 0; update the explanatory text around per_vm_mem_mb and
rss_per_vm_mb to mention these checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d01608cf-d8df-458a-9418-5b61c51c9b7b
📒 Files selected for processing (13)
benchmarks/bench_command_latency.pybenchmarks/bench_file_io.pybenchmarks/bench_network_io.pybenchmarks/density_ramp.pybenchmarks/docs/README_DENSITY_RAMP.mdbenchmarks/docs/v0_density_estimation_benchmark.mdbenchmarks/docs/v0_review_density_ramp_report.mdbenchmarks/helpers.pybenchmarks/results/README.mdbenchmarks/run_density_ramp_ec2.shbenchmarks/run_density_ramp_gcp.shpyproject.tomlsrc/smolvm/build.py
🚧 Files skipped from review as they are similar to previous changes (3)
- benchmarks/bench_file_io.py
- benchmarks/run_density_ramp_ec2.sh
- benchmarks/docs/README_DENSITY_RAMP.md
| def _find_free_port() -> int: | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind(("", 0)) | ||
| return s.getsockname()[1] |
There was a problem hiding this comment.
Let HTTPServer pick the port.
_find_free_port() closes its probe socket before _start_http_server() binds, so another process can claim the port in between and make the benchmark fail sporadically.
🔧 Suggested race-free binding
-def _find_free_port() -> int:
- with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
- s.bind(("", 0))
- return s.getsockname()[1]
-
-
-def _start_http_server(port: int) -> http.server.HTTPServer:
+def _start_http_server() -> tuple[http.server.HTTPServer, int]:
"""Start a background HTTP server on the given port."""
- server = http.server.HTTPServer(("0.0.0.0", port), _BenchHandler)
+ server = http.server.HTTPServer(("0.0.0.0", 0), _BenchHandler)
thread = threading.Thread(target=server.serve_forever, daemon=True)
thread.start()
- return server
+ return server, int(server.server_address[1])
@@
- host_port = _find_free_port()
- server = _start_http_server(host_port)
+ server, host_port = _start_http_server()Also applies to: 80-85, 97-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/bench_network_io.py` around lines 40 - 43, The current
_find_free_port() creates and closes a probe socket which allows a race where
another process can grab the port before _start_http_server() binds; instead,
change _start_http_server() (and any callers in this file) to bind to port 0 so
the HTTPServer picks an available port and then read the assigned port from the
bound server (e.g., server.server_address or server.socket.getsockname()[1])
after the server is created/started, remove or stop using _find_free_port() in
bench_network_io.py, and return or propagate the actual assigned port to the
benchmark code so no ephemeral-port race can occur.
| gateway_result = vm.run("ip route | grep default | awk '{print $3}'") | ||
| gateway_ip = gateway_result.output.strip() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Possible run-result fields =="
rg -n -C2 'stdout:|stderr:|output:|exit_code:' src/smolvm
echo
echo "== Existing attribute usage on vm.run results =="
rg -n -C2 '\.stdout\b|\.output\b' benchmarks src/smolvmRepository: CelestoAI/SmolVM
Length of output: 12143
Consolidate to use .stdout consistently; .output introduces redundant double-stripping.
Line 160 uses gateway_result.output.strip(), but the output property already returns stripped output (self.stdout.strip()). Since the rest of the file consistently uses .stdout (lines 127, 138, 141–150, 206), change line 160 to gateway_result.stdout.strip() to match the pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/bench_network_io.py` around lines 159 - 160, The code uses
gateway_result.output.strip() to set gateway_ip which is inconsistent with the
rest of the file and duplicates stripping; update the assignment to use
gateway_result.stdout.strip() instead so it matches other uses (see vm.run(...)
returning gateway_result and the gateway_ip variable) and remove the redundant
.output usage.
| def overhead_str(base_ms: float, test_ms: float) -> str: | ||
| """Return a human-readable overhead string like '2.5x' or '+150ms'.""" | ||
| if base_ms <= 0: | ||
| return "N/A" | ||
| ratio = test_ms / base_ms | ||
| diff = test_ms - base_ms | ||
| return f"{ratio:.1f}x (+{diff:.0f}ms)" |
There was a problem hiding this comment.
Negative deltas render as +-Nms here.
When test_ms < base_ms, this produces output like 0.8x (+-20ms). Use signed formatting so faster cases read cleanly.
Possible fix
- return f"{ratio:.1f}x (+{diff:.0f}ms)"
+ return f"{ratio:.1f}x ({diff:+.0f}ms)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def overhead_str(base_ms: float, test_ms: float) -> str: | |
| """Return a human-readable overhead string like '2.5x' or '+150ms'.""" | |
| if base_ms <= 0: | |
| return "N/A" | |
| ratio = test_ms / base_ms | |
| diff = test_ms - base_ms | |
| return f"{ratio:.1f}x (+{diff:.0f}ms)" | |
| def overhead_str(base_ms: float, test_ms: float) -> str: | |
| """Return a human-readable overhead string like '2.5x' or '+150ms'.""" | |
| if base_ms <= 0: | |
| return "N/A" | |
| ratio = test_ms / base_ms | |
| diff = test_ms - base_ms | |
| return f"{ratio:.1f}x ({diff:+.0f}ms)" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/helpers.py` around lines 231 - 237, The overhead_str function
currently builds the delta as "(+{diff:.0f}ms)" which yields "+-Nms" when diff
is negative; update the formatted delta to include the sign in the numeric
format instead of hardcoding a "+" so negative deltas render as "(-20ms)".
Concretely, in overhead_str use the computed diff variable and format it with a
signed format specifier (e.g., {diff:+.0f}ms) and remove the hardcoded "+" in
the surrounding string so the returned string uses ratio and signed diff (use
ratio, diff, base_ms, test_ms as referenced).
|
|
||
| ## File Structure | ||
|
|
||
| Each file is a JSON array of event objects written sequentially during the run. Records appear in chronological order: one `boot_ok` or `boot_fail` per VM launch attempt, followed by a single `sustain_check`, and finally a `teardown` record. |
There was a problem hiding this comment.
Document the actual file-write behavior.
density_ramp.py buffers events in memory and writes the JSON once at the end, so there is no sequentially-updated file during the run. These sections also overstate the event shape: sustain_check is omitted when --sustain-sec 0 or when no VM ever boots.
Also applies to: 124-125
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/results/README.md` at line 7, Update the README to reflect actual
file-write behavior: describe that density_ramp.py accumulates events in memory
and writes a single JSON array to disk at the end of the run (not a
sequentially-updated file), and clarify the actual event ordering and
optionality of records produced by density_ramp.py — record types include
boot_ok/boot_fail per VM attempt and teardown, but sustain_check is only emitted
when --sustain-sec > 0 and at least one VM booted (it is omitted if
--sustain-sec 0 or no VM ever boots); update the wording around event "shape"
accordingly so boot/sustain/teardown semantics match density_ramp.py's behavior.
| # Install smolvm and psutil into a venv (avoids PEP 668 externally-managed-env error) | ||
| python3 -m venv ~/venv | ||
| ~/venv/bin/pip install --quiet smolvm psutil |
There was a problem hiding this comment.
This runs against PyPI smolvm, not the checkout under review.
Only density_ramp.py is copied to the instance. The actual runtime package comes from pip install smolvm, so any unreleased changes in src/smolvm or pyproject.toml on this branch are absent from the benchmark run.
Also applies to: 232-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/run_density_ramp_gcp.sh` around lines 219 - 221, The script
currently does "pip install smolvm" which installs the PyPI release rather than
the local checkout; change the install step so the benchmark uses the
checked-out package under review (not PyPI) by installing the local repository
into the venv (e.g., run the venv pip install against the repo root or use pip
install -e . from the checkout) so that density_ramp.py runs with the branch
changes in src/smolvm and pyproject.toml instead of the PyPI smolvm package.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
benchmarks/bench_file_io.py (2)
84-93: Consider using a context manager for reliable cleanup.If an exception occurs during file creation, the temp directory won't be cleaned up. While this is minor for a benchmark, using
tempfile.TemporaryDirectorywould be cleaner.♻️ Suggested improvement
def _do(): - tmpdir = tempfile.mkdtemp() - for i in range(n): - with open(os.path.join(tmpdir, f"job_{i}.json"), "w") as f: - f.write(_1KB_LINE) - # Cleanup - import shutil - - shutil.rmtree(tmpdir) + with tempfile.TemporaryDirectory() as tmpdir: + for i in range(n): + with open(os.path.join(tmpdir, f"job_{i}.json"), "w") as f: + f.write(_1KB_LINE) return time_call(_do)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_file_io.py` around lines 84 - 93, The tmpdir created in the _do function uses tempfile.mkdtemp and a separate shutil.rmtree cleanup which will be skipped if an exception occurs; change _do to use tempfile.TemporaryDirectory as a context manager so the temporary directory is automatically removed on exit (and remove the manual shutil.rmtree call), i.e., locate the _do function and replace the mkdtemp/open/write/rmtree pattern with a with tempfile.TemporaryDirectory() as tmpdir: block around the file creation loop so cleanup is reliable.
220-223: Redundant path manipulation given the import fallback.The
sys.path.insertlogic appears redundant since lines 14-33 already handle both relative and absolute imports with a try/except fallback. This path manipulation is also fragile as it depends on"benchmarks"being in the file path.♻️ Simplified entry point
if __name__ == "__main__": - if "benchmarks" in __file__: - sys.path.insert(0, str(__import__("pathlib").Path(__file__).parent)) run_benchmark()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/bench_file_io.py` around lines 220 - 223, Remove the fragile sys.path.insert() conditional in the module-level entrypoint and rely on the existing try/except import fallback already implemented earlier; specifically, delete the "if 'benchmarks' in __file__: sys.path.insert(...)" block and leave the guarded main invocation that calls run_benchmark(), ensuring the script simply executes run_benchmark() under if __name__ == "__main__": without manipulating sys.path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/bench_file_io.py`:
- Line 105: Fix the typo in the inline comment that reads "# baslines on host"
in benchmarks/bench_file_io.py by changing "baslines" to "baselines" so the
comment reads "# baselines on host"; locate the comment near the benchmarking
setup (search for the exact string "baslines on host") and update it
accordingly.
---
Nitpick comments:
In `@benchmarks/bench_file_io.py`:
- Around line 84-93: The tmpdir created in the _do function uses
tempfile.mkdtemp and a separate shutil.rmtree cleanup which will be skipped if
an exception occurs; change _do to use tempfile.TemporaryDirectory as a context
manager so the temporary directory is automatically removed on exit (and remove
the manual shutil.rmtree call), i.e., locate the _do function and replace the
mkdtemp/open/write/rmtree pattern with a with tempfile.TemporaryDirectory() as
tmpdir: block around the file creation loop so cleanup is reliable.
- Around line 220-223: Remove the fragile sys.path.insert() conditional in the
module-level entrypoint and rely on the existing try/except import fallback
already implemented earlier; specifically, delete the "if 'benchmarks' in
__file__: sys.path.insert(...)" block and leave the guarded main invocation that
calls run_benchmark(), ensuring the script simply executes run_benchmark() under
if __name__ == "__main__": without manipulating sys.path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 20ae6815-236a-41f1-88d9-c5e529f26a0d
📒 Files selected for processing (1)
benchmarks/bench_file_io.py
|
|
||
| print_header("File IO Benchmark") | ||
|
|
||
| # baslines on host |
There was a problem hiding this comment.
Typo: "baslines" should be "baselines".
Minor typo in the comment.
✏️ Fix
- # baslines on host
+ # baselines on host📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # baslines on host | |
| # baselines on host |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmarks/bench_file_io.py` at line 105, Fix the typo in the inline comment
that reads "# baslines on host" in benchmarks/bench_file_io.py by changing
"baslines" to "baselines" so the comment reads "# baselines on host"; locate the
comment near the benchmarking setup (search for the exact string "baslines on
host") and update it accordingly.
Port Docker/nftables install, ulimit/sysctl tuning (nofile, nproc, ip_local_port_range, somaxconn), and post-run network/log diagnostics from run_density_ramp_ec2.sh to run_density_ramp_gcp.sh, adapted for Debian 12 (apt-get, docker.io package, no firewalld). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Only append timing measurements when curl/vm.run succeeds (returncode==0 and non-empty stdout). Failed requests due to network issues or startup races previously produced invalid latency data; now they are logged as warnings and excluded from stats_summary. Fixes CodeRabbit discussion_r2903439531. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
os.getuid() and os.getgid() were called in _create_ext4_with_docker() but the os module was not imported, causing a NameError at runtime. Fixes CodeRabbit discussion_r2903439577. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously, cleanup errors were silently swallowed (except Exception: pass), hiding leaked disks/sockets near the resource ceiling. Now cleanup_error is captured and included in the returned error dict so callers and snapshots can see delete failures. Fixes CodeRabbit discussion_r2903439539. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mode boot_one() now returns seq as the first element of its tuple so callers can use the original launch index instead of completion order. In the as_completed() loop, batch_ok_so_far tracks successful boots within the batch so density_at_boot reflects the true concurrent count for each VM (not just len(running_vms)+1 for all). batch_results are sorted by seq before processing to emit boot_ok/boot_fail records in launch order. Fixes CodeRabbit discussion_r2903439537. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Changes
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores