Skip to content

Conversation

@ryanbreen
Copy link
Owner

Summary

  • Fix ARM64 serial logging deadlock (~40% boot failure)
  • Fix capacity overflow in stack address calculation (~10% failure)
  • Fix syscall return race condition / data abort (~10% failure)
  • Remove Docker requirement - QEMU runs natively for both architectures
  • Add strict boot test (20 iterations, 100% pass required)
  • Fix VirtIO block DSB barrier issue

Root Causes Fixed

1. Serial Logging Deadlock (40% failure)

ARM64 has only one UART, so both serial_println! and log_serial_println! use the same SERIAL1 lock. Timer interrupts firing while holding this lock caused deadlocks when the scheduler tried to log.

Fix: Disable interrupts during serial output on ARM64 + guard scheduler logging with #[cfg(target_arch = "x86_64")].

2. Capacity Overflow (10% failure)

Integer underflow in stack_top - USER_STACK_SIZE created invalid page ranges, triggering Page::range_inclusive() capacity overflow.

Fix: Use checked_sub() and add validation in map_user_stack_to_process.

3. Syscall Return Race (10% failure)

The TTBR0 switch code used SP with stp/ldp before and after changing TTBR0. If a context switch occurred, SP and TTBR0 became inconsistent.

Fix: Reorder operations to do TTBR0 switch before restoring x0/x1, using them as scratch registers.

Test Results

After all fixes: 20/20 strict boot tests pass (100% success rate)

Test plan

  • Build ARM64 kernel
  • Run strict boot test (20 iterations) - 100% pass rate
  • Verify serial output is clean (no truncation/garbage)

🤖 Generated with Claude Code

ryanbreen and others added 9 commits January 29, 2026 21:35
- Add native ARM64 boot test script (run-aarch64-boot-test-native.sh)
  - Uses retry logic (5 attempts) to handle timing-related flakiness
  - Validates shell startup and checks for excessive init_shell spawning
  - Fast (~8 seconds per attempt with 40% success rate)

- Add Docker ARM64 boot test script (run-aarch64-boot-test.sh)
  - Uses only ext2 disk (test_binaries disk triggers VirtIO driver bug)
  - Docker QEMU is slower than native; needs more investigation

Known issues:
- ARM64 boot has ~40% success rate due to timing bug
- With 5 retries, expected to pass 92% of the time
- Root cause appears to be timer interrupt interference during page table operations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker is unnecessary for ARM64 QEMU:
- ARM64 QEMU uses software emulation, not Hypervisor.framework
- No system instability risk like x86_64 QEMU
- Native QEMU is faster and supports VNC for UI testing

Keep only the native test script (run-aarch64-boot-test-native.sh).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ARM64:
- Native QEMU (no Docker) - no Hypervisor.framework issues
- Uses cocoa display for native window (no VNC needed)
- Better performance than Docker

x86_64:
- Docker-based QEMU (required for Hypervisor.framework stability)
- Uses VNC display at localhost:5900

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause investigation found the ARM64 VirtIO block driver was using
only fence(SeqCst) (DMB) before notify_queue(), but ARM64 requires DSB
to ensure descriptor writes reach memory before the device is notified.

Fixes:
- Add dsb_sy() call before both notify_queue() calls in block_mmio.rs
- Import dsb_sy from cpu.rs instead of duplicating implementation
- Add device_features tracking to detect read-only devices

New tests added to improve coverage (per validation feedback):
- test_multi_read(): Stress test reading sector 0 ten times
- test_sequential_read(): Read sectors 0-31 testing queue wrap-around
- test_write_read_verify(): Write-read-verify cycle (skips if readonly)
- test_invalid_sector(): Verify error handling for out-of-range sectors
- test_uninitialized_read(): Document error handling for uninitialized state

All tests registered in ARM64 test framework with appropriate timeouts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: On ARM64, both serial_println! and log_serial_println!
use the same SERIAL1 lock. If a timer interrupt fires while holding
this lock and something in the timer/scheduler path tries to log,
a deadlock occurs.

On x86_64, serial_println! uses COM1 and log_serial_println! uses
COM2 (separate hardware), so no deadlock is possible.

Fixes:

1. serial_aarch64.rs: Disable interrupts while holding serial lock
   - Saves DAIF register, masks IRQ/FIQ, performs output, restores
   - Prevents timer from firing during serial output

2. context_switch.rs: Remove all logging from context switch path
   - Replaced log::* and serial_println! with lock-free alternatives
   - Added raw_uart_str() calls for critical error paths only

3. scheduler.rs: Make all logging conditional on x86_64
   - Wrapped log_serial_println! calls with #[cfg(target_arch)]
   - Scheduler functions can be called from context switch path

4. trace.rs: New lock-free tracing infrastructure
   - Single atomic operation per trace event
   - Ring buffer for forensic debugging
   - Macros for common events (context switch, IRQ, syscall)

5. check-critical-path-violations.sh: Enforcement script
   - Scans critical path files for prohibited logging patterns
   - Lists allowed lock-free alternatives

Result: ARM64 boot tests now pass 5/5 consistently (was ~40% failure rate)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add run-aarch64-boot-test-strict.sh: 20-iteration test requiring
  100% success rate (no retries) for CI validation
- Add scheduler.rs to critical path violation checker
- Fix unguarded log::info! at scheduler.rs line 874
- Update boot test comments: remove outdated "known timing bug" note

The strict test revealed two separate pre-existing bugs:
1. Capacity overflow panic during process creation
2. Data abort after entering userspace
These are tracked separately from the deadlock fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update CLAUDE.md to remove Docker-only policy for x86-64
- Update run.sh to run x86_64 QEMU directly on host (no Docker wrapper)
- ARM64 already ran natively, this makes x86-64 consistent
- Simplify documentation around QEMU usage
- Both architectures now run QEMU natively for better performance
  and simpler debugging

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix 1: Capacity overflow in stack address calculation
- In manager.rs, use checked_sub() to prevent integer underflow when
  calculating stack_bottom from stack_top
- Add validation in process_memory.rs to verify stack_bottom < stack_top
  before iterating, preventing Page::range_inclusive() capacity overflow

Fix 2: Data abort race condition in syscall return path
- Reordered syscall_entry.S to do TTBR0 switch BEFORE restoring x0/x1
- Use x0/x1 as scratch registers during TTBR0 switch (not restored yet)
- Restore x0/x1 from frame AFTER TTBR0 switch (frame still valid)
- Eliminates race where SP/TTBR0 could become inconsistent after context
  switch, causing reads from unmapped addresses

Both bugs caused ~10% failure rate, combined ~20% intermittent failures.
After fix: 20/20 strict boot tests pass (100% success rate).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ryanbreen ryanbreen merged commit e7a511f into main Jan 30, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants