Skip to content

Production hardening: HA control plane, autoscaling, live migration, CI/CD pipeline#88

Merged
breardon2011 merged 32 commits intomainfrom
autoscaling-etc
Apr 14, 2026
Merged

Production hardening: HA control plane, autoscaling, live migration, CI/CD pipeline#88
breardon2011 merged 32 commits intomainfrom
autoscaling-etc

Conversation

@breardon2011
Copy link
Copy Markdown
Contributor

@breardon2011 breardon2011 commented Mar 26, 2026

Summary

Production-ready infrastructure for OpenSandbox: HA control plane, autoscaling with bin-packing, live migration, and end-to-end CI/CD
pipeline. All validated on Azure westus2 dev environment with 200-sandbox chaos tests.

Control Plane HA

  • 2 control planes behind leader election (Redis SETNX). Both serve API requests; only the leader runs the scaler.
  • Dedicated Postgres + Redis VMs — separated from the control plane (no more single-VM SPOF).
  • Graceful shutdown with readiness probe (/readyz checks Postgres + Redis) and 25s request draining.
  • Scaler state persisted to Redis — cooldowns, pending launches, drain state, migration locks survive CP restarts.
  • Route cache in Redis — shared across HA instances.
  • Leader election with 15s TTL, automatic failover tested (kill CP1 → CP2 takes over in <15s).

Autoscaling & Bin-Packing

  • Committed memory-based bin-packing — routes sandboxes by actual memory commitment, not sandbox count.
  • Idle reserves for burst — scaler maintains N idle workers, routes use them freely, scaler backfills.
  • Dynamic capacity — workers report TotalMemoryMB/CommittedMemoryMB in heartbeats.
  • Creation queue — polls for 30s instead of immediate 503 when at capacity.
  • Scale-down thrashing fix — skip drain during rolling replace or pending launches.
  • Default sandbox size 4GB for all plans (was 256MB for pro).

Live Migration

  • Migration state machine — running → migrating → running in Postgres, blocks exec routing during migration.
  • Scale-triggered auto-migration — insufficient_capacity → find target worker → live migrate → retry scale.
  • virtio-mem pool fix — 16GB - base instead of fixed 15GB, supports scaling any sandbox to 16GB.
  • Ping-pong tested: 20x back-and-forth, 20x random across 6 workers, zero failures.

CI/CD Pipeline

  • deploy-server.yml — blue-green deploy to N control planes (discovered from Azure, no hardcoded IPs), smoke test (create on CP1, exec on
    CP2).
  • build-worker-ami.yml — Packer → Azure Compute Gallery (NVMe-compatible for v6 VMs) → Key Vault → scaler auto-rolls workers.
  • deploy-worker.yml — hotfix deploy, discovers CP from Azure.
  • Rolling worker replacement validated: 9 workers replaced in ~14 minutes, zero downtime.
  • Drain/destroy bug fixed — exported drainState/pendingLaunch fields for Redis serialization.

Observability

  • Admin dashboard (/admin/status) — real-time worker status, sandbox counts, CPU/memory/disk bars.
  • SSE event stream (/admin/events) — live creates, destroys, scales, migrations.
  • Report endpoint (/admin/report) — summary with migration details.
  • Orphaned NIC cleanup — periodic sweep every 5 min via OrphanCleaner interface.
  • Postgres daily backups — local + Azure Blob upload via managed identity.

Testing

  • 200-sandbox chaos test with random memory scaling (4-16GB), disk writes, concurrent create/destroy — 60/60 spot checks passed, all
    migrations succeeded.
  • HA failover tests — kill CP1, kill CP2, rolling deploy, both CPs restart — 26/26 passed.
  • SDK integration tests — 134/140+ assertions pass (failures are DNS/pre-existing).
  • 29 new Go tests — leader election, admin events, InMemoryScalerState.

Test plan

  • 200-sandbox chaos test with random scaling + migration
  • HA failover (kill each CP, rolling deploy)
  • Rolling worker replacement (9 workers, ~14 min)
  • SDK integration tests (Python + TypeScript)
  • Hibernate/wake with S3 checkpoint store
  • Live migration ping-pong (20x)
  • Pressure tests (RAM + CPU + disk)
  • Blue-green deploy via GitHub Actions
  • Packer image build + gallery publish via GitHub Actions

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opensandbox Ready Ready Preview, Comment Apr 14, 2026 7:14pm

Request Review

breardon2011 and others added 10 commits April 3, 2026 20:29
Use printf instead of echo for multi-line SSH key secret,
add StrictHostKeyChecking=no to prevent interactive prompts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add rootfs tarball creation step before Packer build
- Remove fallback that constructed fake image ID when Packer failed
- This caused the scaler to try launching VMs from a non-existent image

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
breardon2011 and others added 3 commits April 6, 2026 17:43
- Smoke test: retry exec 3 times with 5s gaps (cold golden boot)
- Packer: re-login Azure before Key Vault write (token expires during long build)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@breardon2011 breardon2011 changed the title autoscaling wip Production hardening: HA control plane, autoscaling, live migration, CI/CD pipeline Apr 8, 2026
@breardon2011 breardon2011 marked this pull request as ready for review April 9, 2026 03:01
Resolves conflicts preserving:
- Autoscaling routing (least-loaded, Redis round-robin, Draining)
- Main's secrets store layering, stale cache validation, S3 checkpoint upload
- Main's SDK fixes (always route through CP)
- Main's deploy-worker kernel fix
- Both GoldenVersion and sealSandboxEnvs in manager
- All SnapshotMeta fields (GoldenVersion + SealedTokens + TokenHosts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@breardon2011 breardon2011 merged commit 6d5658b into main Apr 14, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants