-
Notifications
You must be signed in to change notification settings - Fork 0
arm64: fix boot stability issues and remove Docker requirement #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add native ARM64 boot test script (run-aarch64-boot-test-native.sh) - Uses retry logic (5 attempts) to handle timing-related flakiness - Validates shell startup and checks for excessive init_shell spawning - Fast (~8 seconds per attempt with 40% success rate) - Add Docker ARM64 boot test script (run-aarch64-boot-test.sh) - Uses only ext2 disk (test_binaries disk triggers VirtIO driver bug) - Docker QEMU is slower than native; needs more investigation Known issues: - ARM64 boot has ~40% success rate due to timing bug - With 5 retries, expected to pass 92% of the time - Root cause appears to be timer interrupt interference during page table operations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker is unnecessary for ARM64 QEMU: - ARM64 QEMU uses software emulation, not Hypervisor.framework - No system instability risk like x86_64 QEMU - Native QEMU is faster and supports VNC for UI testing Keep only the native test script (run-aarch64-boot-test-native.sh). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ARM64: - Native QEMU (no Docker) - no Hypervisor.framework issues - Uses cocoa display for native window (no VNC needed) - Better performance than Docker x86_64: - Docker-based QEMU (required for Hypervisor.framework stability) - Uses VNC display at localhost:5900 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause investigation found the ARM64 VirtIO block driver was using only fence(SeqCst) (DMB) before notify_queue(), but ARM64 requires DSB to ensure descriptor writes reach memory before the device is notified. Fixes: - Add dsb_sy() call before both notify_queue() calls in block_mmio.rs - Import dsb_sy from cpu.rs instead of duplicating implementation - Add device_features tracking to detect read-only devices New tests added to improve coverage (per validation feedback): - test_multi_read(): Stress test reading sector 0 ten times - test_sequential_read(): Read sectors 0-31 testing queue wrap-around - test_write_read_verify(): Write-read-verify cycle (skips if readonly) - test_invalid_sector(): Verify error handling for out-of-range sectors - test_uninitialized_read(): Document error handling for uninitialized state All tests registered in ARM64 test framework with appropriate timeouts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: On ARM64, both serial_println! and log_serial_println! use the same SERIAL1 lock. If a timer interrupt fires while holding this lock and something in the timer/scheduler path tries to log, a deadlock occurs. On x86_64, serial_println! uses COM1 and log_serial_println! uses COM2 (separate hardware), so no deadlock is possible. Fixes: 1. serial_aarch64.rs: Disable interrupts while holding serial lock - Saves DAIF register, masks IRQ/FIQ, performs output, restores - Prevents timer from firing during serial output 2. context_switch.rs: Remove all logging from context switch path - Replaced log::* and serial_println! with lock-free alternatives - Added raw_uart_str() calls for critical error paths only 3. scheduler.rs: Make all logging conditional on x86_64 - Wrapped log_serial_println! calls with #[cfg(target_arch)] - Scheduler functions can be called from context switch path 4. trace.rs: New lock-free tracing infrastructure - Single atomic operation per trace event - Ring buffer for forensic debugging - Macros for common events (context switch, IRQ, syscall) 5. check-critical-path-violations.sh: Enforcement script - Scans critical path files for prohibited logging patterns - Lists allowed lock-free alternatives Result: ARM64 boot tests now pass 5/5 consistently (was ~40% failure rate) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add run-aarch64-boot-test-strict.sh: 20-iteration test requiring 100% success rate (no retries) for CI validation - Add scheduler.rs to critical path violation checker - Fix unguarded log::info! at scheduler.rs line 874 - Update boot test comments: remove outdated "known timing bug" note The strict test revealed two separate pre-existing bugs: 1. Capacity overflow panic during process creation 2. Data abort after entering userspace These are tracked separately from the deadlock fix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update CLAUDE.md to remove Docker-only policy for x86-64 - Update run.sh to run x86_64 QEMU directly on host (no Docker wrapper) - ARM64 already ran natively, this makes x86-64 consistent - Simplify documentation around QEMU usage - Both architectures now run QEMU natively for better performance and simpler debugging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix 1: Capacity overflow in stack address calculation - In manager.rs, use checked_sub() to prevent integer underflow when calculating stack_bottom from stack_top - Add validation in process_memory.rs to verify stack_bottom < stack_top before iterating, preventing Page::range_inclusive() capacity overflow Fix 2: Data abort race condition in syscall return path - Reordered syscall_entry.S to do TTBR0 switch BEFORE restoring x0/x1 - Use x0/x1 as scratch registers during TTBR0 switch (not restored yet) - Restore x0/x1 from frame AFTER TTBR0 switch (frame still valid) - Eliminates race where SP/TTBR0 could become inconsistent after context switch, causing reads from unmapped addresses Both bugs caused ~10% failure rate, combined ~20% intermittent failures. After fix: 20/20 strict boot tests pass (100% success rate). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Root Causes Fixed
1. Serial Logging Deadlock (40% failure)
ARM64 has only one UART, so both
serial_println!andlog_serial_println!use the sameSERIAL1lock. Timer interrupts firing while holding this lock caused deadlocks when the scheduler tried to log.Fix: Disable interrupts during serial output on ARM64 + guard scheduler logging with
#[cfg(target_arch = "x86_64")].2. Capacity Overflow (10% failure)
Integer underflow in
stack_top - USER_STACK_SIZEcreated invalid page ranges, triggeringPage::range_inclusive()capacity overflow.Fix: Use
checked_sub()and add validation inmap_user_stack_to_process.3. Syscall Return Race (10% failure)
The TTBR0 switch code used SP with
stp/ldpbefore and after changing TTBR0. If a context switch occurred, SP and TTBR0 became inconsistent.Fix: Reorder operations to do TTBR0 switch before restoring x0/x1, using them as scratch registers.
Test Results
After all fixes: 20/20 strict boot tests pass (100% success rate)
Test plan
🤖 Generated with Claude Code