arm64: fix boot stability issues and remove Docker requirement #134

ryanbreen · 2026-01-30T13:46:34Z

Summary

Fix ARM64 serial logging deadlock (~40% boot failure)
Fix capacity overflow in stack address calculation (~10% failure)
Fix syscall return race condition / data abort (~10% failure)
Remove Docker requirement - QEMU runs natively for both architectures
Add strict boot test (20 iterations, 100% pass required)
Fix VirtIO block DSB barrier issue

Root Causes Fixed

1. Serial Logging Deadlock (40% failure)

ARM64 has only one UART, so both serial_println! and log_serial_println! use the same SERIAL1 lock. Timer interrupts firing while holding this lock caused deadlocks when the scheduler tried to log.

Fix: Disable interrupts during serial output on ARM64 + guard scheduler logging with #[cfg(target_arch = "x86_64")].

2. Capacity Overflow (10% failure)

Integer underflow in stack_top - USER_STACK_SIZE created invalid page ranges, triggering Page::range_inclusive() capacity overflow.

Fix: Use checked_sub() and add validation in map_user_stack_to_process.

3. Syscall Return Race (10% failure)

The TTBR0 switch code used SP with stp/ldp before and after changing TTBR0. If a context switch occurred, SP and TTBR0 became inconsistent.

Fix: Reorder operations to do TTBR0 switch before restoring x0/x1, using them as scratch registers.

Test Results

After all fixes: 20/20 strict boot tests pass (100% success rate)

Test plan

Build ARM64 kernel
Run strict boot test (20 iterations) - 100% pass rate
Verify serial output is clean (no truncation/garbage)

🤖 Generated with Claude Code

- Add native ARM64 boot test script (run-aarch64-boot-test-native.sh) - Uses retry logic (5 attempts) to handle timing-related flakiness - Validates shell startup and checks for excessive init_shell spawning - Fast (~8 seconds per attempt with 40% success rate) - Add Docker ARM64 boot test script (run-aarch64-boot-test.sh) - Uses only ext2 disk (test_binaries disk triggers VirtIO driver bug) - Docker QEMU is slower than native; needs more investigation Known issues: - ARM64 boot has ~40% success rate due to timing bug - With 5 retries, expected to pass 92% of the time - Root cause appears to be timer interrupt interference during page table operations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Docker is unnecessary for ARM64 QEMU: - ARM64 QEMU uses software emulation, not Hypervisor.framework - No system instability risk like x86_64 QEMU - Native QEMU is faster and supports VNC for UI testing Keep only the native test script (run-aarch64-boot-test-native.sh). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ARM64: - Native QEMU (no Docker) - no Hypervisor.framework issues - Uses cocoa display for native window (no VNC needed) - Better performance than Docker x86_64: - Docker-based QEMU (required for Hypervisor.framework stability) - Uses VNC display at localhost:5900 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Root cause investigation found the ARM64 VirtIO block driver was using only fence(SeqCst) (DMB) before notify_queue(), but ARM64 requires DSB to ensure descriptor writes reach memory before the device is notified. Fixes: - Add dsb_sy() call before both notify_queue() calls in block_mmio.rs - Import dsb_sy from cpu.rs instead of duplicating implementation - Add device_features tracking to detect read-only devices New tests added to improve coverage (per validation feedback): - test_multi_read(): Stress test reading sector 0 ten times - test_sequential_read(): Read sectors 0-31 testing queue wrap-around - test_write_read_verify(): Write-read-verify cycle (skips if readonly) - test_invalid_sector(): Verify error handling for out-of-range sectors - test_uninitialized_read(): Document error handling for uninitialized state All tests registered in ARM64 test framework with appropriate timeouts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Root cause: On ARM64, both serial_println! and log_serial_println! use the same SERIAL1 lock. If a timer interrupt fires while holding this lock and something in the timer/scheduler path tries to log, a deadlock occurs. On x86_64, serial_println! uses COM1 and log_serial_println! uses COM2 (separate hardware), so no deadlock is possible. Fixes: 1. serial_aarch64.rs: Disable interrupts while holding serial lock - Saves DAIF register, masks IRQ/FIQ, performs output, restores - Prevents timer from firing during serial output 2. context_switch.rs: Remove all logging from context switch path - Replaced log::* and serial_println! with lock-free alternatives - Added raw_uart_str() calls for critical error paths only 3. scheduler.rs: Make all logging conditional on x86_64 - Wrapped log_serial_println! calls with #[cfg(target_arch)] - Scheduler functions can be called from context switch path 4. trace.rs: New lock-free tracing infrastructure - Single atomic operation per trace event - Ring buffer for forensic debugging - Macros for common events (context switch, IRQ, syscall) 5. check-critical-path-violations.sh: Enforcement script - Scans critical path files for prohibited logging patterns - Lists allowed lock-free alternatives Result: ARM64 boot tests now pass 5/5 consistently (was ~40% failure rate) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add run-aarch64-boot-test-strict.sh: 20-iteration test requiring 100% success rate (no retries) for CI validation - Add scheduler.rs to critical path violation checker - Fix unguarded log::info! at scheduler.rs line 874 - Update boot test comments: remove outdated "known timing bug" note The strict test revealed two separate pre-existing bugs: 1. Capacity overflow panic during process creation 2. Data abort after entering userspace These are tracked separately from the deadlock fix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update CLAUDE.md to remove Docker-only policy for x86-64 - Update run.sh to run x86_64 QEMU directly on host (no Docker wrapper) - ARM64 already ran natively, this makes x86-64 consistent - Simplify documentation around QEMU usage - Both architectures now run QEMU natively for better performance and simpler debugging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix 1: Capacity overflow in stack address calculation - In manager.rs, use checked_sub() to prevent integer underflow when calculating stack_bottom from stack_top - Add validation in process_memory.rs to verify stack_bottom < stack_top before iterating, preventing Page::range_inclusive() capacity overflow Fix 2: Data abort race condition in syscall return path - Reordered syscall_entry.S to do TTBR0 switch BEFORE restoring x0/x1 - Use x0/x1 as scratch registers during TTBR0 switch (not restored yet) - Restore x0/x1 from frame AFTER TTBR0 switch (frame still valid) - Eliminates race where SP/TTBR0 could become inconsistent after context switch, causing reads from unmapped addresses Both bugs caused ~10% failure rate, combined ~20% intermittent failures. After fix: 20/20 strict boot tests pass (100% success rate). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ryanbreen and others added 9 commits January 29, 2026 21:35

Add scripts pycache to get iIgnorant.

b26cab4

ryanbreen merged commit e7a511f into main Jan 30, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64: fix boot stability issues and remove Docker requirement #134

arm64: fix boot stability issues and remove Docker requirement #134

Uh oh!

ryanbreen commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arm64: fix boot stability issues and remove Docker requirement #134

arm64: fix boot stability issues and remove Docker requirement #134

Uh oh!

Conversation

ryanbreen commented Jan 30, 2026

Summary

Root Causes Fixed

1. Serial Logging Deadlock (40% failure)

2. Capacity Overflow (10% failure)

3. Syscall Return Race (10% failure)

Test Results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants