refactor: use beacon chain deadlines for QBFT instance cleanup by diegomrsantos · Pull Request #719 · sigp/anchor

diegomrsantos · 2025-10-25T16:18:27Z

Issue Addressed

Fixes instance cleanup issue where QBFT instances were cleaned up too early based on slot-based timeouts (QBFT_RETAIN_SLOTS = 1), preventing instances from reaching later rounds and completing consensus.

Proposed Changes

Core Changes

Refactored instance cleanup to use beacon chain inclusion deadlines instead of slot-based timeouts
Each role now has a deadline based on EIP-7045 and consensus spec requirements:
- Committee/Aggregator: End of epoch E+1 (attestation inclusion window)
- Proposer/SyncCommittee: Same slot (immediate inclusion)
- VoluntaryExit/ValidatorRegistration: One epoch window
Instances are cleaned when:
1. They complete successfully (via completion notification channel)
2. Their beacon chain deadline expires (checked each slot)
Added ManagedInstance struct tracking both channel and deadline
Implemented dual cleanup mechanism: completion-based (immediate) and deadline-based (deferred)

Test Coverage

Added 5 comprehensive tests verifying deadline-based cleanup:
- test_cleanup_removes_only_expired_instances - Verifies instances survive past old 2-slot timeout
- test_instance_completion_notification - Tests immediate cleanup after successful completion
- test_committee_can_reach_late_rounds - Verifies instances can reach round 10+ with max_round=12
- test_cleanup_across_epoch_boundary - Tests deadline calculation across epoch transitions
- test_multiple_instances_completing_rapidly - Verifies burst completion handling
Added 6 focused tests for role-specific deadline calculations (Committee, Aggregator, Proposer, SyncCommittee, VoluntaryExit, ValidatorRegistration)

Code Quality

Refactored all tests to follow Setup/Execute/Assert pattern with:
- Clear section comments (// SETUP, // EXECUTE, // ASSERT)
- Named constants replacing all magic numbers
- Descriptive assertion messages
Added mandatory test structure guidelines to CLAUDE.md and tester-subagent.md
Enhanced documentation in lib.rs with detailed deadline calculation explanations

Test Results

All 23 tests pass (up from 18 due to test refactoring that split one oversized test into 6 focused tests).

Additional Info

This aligns instance cleanup with actual beacon chain requirements rather than arbitrary slot-based timeouts, allowing instances to complete consensus within their protocol-defined windows.

anchor/qbft_manager/src/tests.rs

anchor/qbft_manager/src/lib.rs

dknopik · 2025-12-09T13:47:31Z

anchor/qbft_manager/src/lib.rs

+                // Branch 1: Instance completed - clean immediately
+                Some(id) = completion_rx.recv() => {
+                    match id {
+                        InstanceId::BeaconVote(id) => {
+                            self.beacon_vote_instances.remove(&id);
+                        }
+                        InstanceId::ValidatorConsensus(id) => {
+                            self.validator_consensus_data_instances.remove(&id);
+                        }
+                    }
+                }


There is a problem with this approach:

In theory, there might be a race condition where some tasks try to register their oneshot channel to an instance after it has completed. This might e.g. be the case if multiple validator attestation duties wait for the same committee instance. If we start the instance late (e.g. because of a struggling BN), the first thread will start the instance, which might complete immediately due to replayed messages, giving no opportunity for the other tasks to register their listeners. This is why the current code cleans up at a fixed time regardless of completion.

Instead, we could move the cleanup time in this branch - to give some time (til end of next slot?) to get the instance result. Wdyt?

I need more context to understand what's described in the first paragraph.

Good catch. This race exists, but it is pre-existing, not introduced by this PR.

Before this PR, the instance task stays alive in Decided state and the registry entry lingers until the slot-based cleaner removes it. Once the cleaner drops the entry, the tx is dropped, rx.recv() returns None, and the task exits. A late caller after that point hits Vacant, spawns a new instance, and hangs until timeout. The grace window is larger under the old cleanup scheme, but the underlying late-caller behavior is still there.

This PR does make that behavior easier to hit by breaking out of the loop on Decided and removing the entry immediately via completion notification.

The broader issue is that independent code paths can call decide_instance for the same CommitteeInstanceId at different times. Grouping validators per committee, as in #834, improves that by reducing duplicate local callers within a duty path. Applying the same pattern to sync committee signing would help for the same reason.

But that change alone does not fully eliminate the late-caller behavior; that still depends on the cleanup / lifecycle semantics in qbft_manager.

We could modify ManagedInstance to either hold a channel sender to a running instance, or a resulting value D obtained from a finished instance to accommodate late callers. This would allow us to clean up the instance as soon as it is finished by dropping the sender and storing the finished value. I am unsure how long the resulting value should be kept - 1 slot?

As you said, the underlying root cause is the possibility of late callers, but I am unsure how to prevent this without a major refactor.

Adds test to verify QBFT Committee instances can reach late rounds (9+) as configured with max_round=12. The test creates a Committee instance, forces round changes by keeping operators offline, then advances through multiple slots while verifying the instance survives to reach round 10. Currently fails - instance is cleaned up after 2 slots, reaching round 9 but unable to complete it (needs 120s, gets 8s).

Replace slot-based cleanup with duty-specific beacon chain inclusion deadlines. This allows QBFT instances to progress through all configured rounds without premature removal. Key changes: - Separate instance identity from manager metadata using ManagedInstance wrapper - Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1) - Add slots_per_epoch configuration parameter - Implement dual-trigger cleaner (completion notification + deadline timeout) Fixes instances being cleaned after 2 slots, now properly respecting beacon chain inclusion windows (32-63 slots for attestations).

Improve test readability by applying Setup/Execute/Assert structure: - Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.) - Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests - Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role) - Add descriptive assertion messages explaining what must be true - Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.) All 23 tests pass (up from 18 due to role deadline test split). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add detailed explanation for Committee/Aggregator deadline calculation: - Document the calculation formula: (E+2) * slots_per_epoch - 1 - Explain that this represents the last slot for on-chain inclusion - Reference EIP-7045 specification Enhance ManagedInstance documentation: - Convert to doc comment for better API documentation - Clarify that it tracks both channel and beacon chain deadline - Explain its role in the cleanup task 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This comment was marked as outdated.

Sign in to view

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 25, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Show resolved Hide resolved

diegomrsantos mentioned this pull request Oct 25, 2025

QBFT instances cleaned up too aggressively, preventing late rounds #720

Open

diegomrsantos changed the base branch from stable to unstable October 25, 2025 16:35

diegomrsantos force-pushed the test/qbft-late-rounds branch from 4cd49f3 to 1d995b4 Compare October 28, 2025 20:47

diegomrsantos marked this pull request as draft October 28, 2025 20:48

diegomrsantos force-pushed the test/qbft-late-rounds branch 2 times, most recently from 2cc3d0d to 4767f9d Compare October 28, 2025 23:26

diegomrsantos added the claude-recheck triggers claude review workflow to re-run label Oct 29, 2025

diegomrsantos self-assigned this Oct 29, 2025

diegomrsantos added QBFT and removed claude-recheck triggers claude review workflow to re-run labels Oct 29, 2025

diegomrsantos marked this pull request as ready for review October 29, 2025 19:23

This comment was marked as outdated.

Sign in to view

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/tests.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/lib.rs Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Oct 29, 2025

View reviewed changes

anchor/qbft_manager/src/lib.rs Outdated Show resolved Hide resolved

dknopik added the v1.2.0 label Nov 7, 2025

diegomrsantos changed the title ~~test: add test for Committee instances reaching late rounds~~ refactor: use beacon chain deadlines for QBFT instance cleanup Nov 12, 2025

diegomrsantos requested a review from dknopik November 12, 2025 13:08

dknopik removed the v1.2.0 label Dec 9, 2025

dknopik reviewed Dec 9, 2025

View reviewed changes

diegomrsantos and others added 4 commits March 9, 2026 23:35

style: apply formatting after rebase

8347f06

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

diegomrsantos force-pushed the test/qbft-late-rounds branch from 739902c to 8347f06 Compare March 9, 2026 23:19

diegomrsantos marked this pull request as draft March 10, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use beacon chain deadlines for QBFT instance cleanup#719

refactor: use beacon chain deadlines for QBFT instance cleanup#719
diegomrsantos wants to merge 5 commits intosigp:unstablefrom
diegomrsantos:test/qbft-late-rounds

diegomrsantos commented Oct 25, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dknopik Dec 9, 2025

Uh oh!

diegomrsantos Dec 9, 2025

Uh oh!

diegomrsantos Mar 10, 2026 •

edited

Loading

Uh oh!

dknopik Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

diegomrsantos commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Core Changes

Test Coverage

Code Quality

Test Results

Additional Info

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dknopik Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

diegomrsantos Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dknopik Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

diegomrsantos commented Oct 25, 2025 •

edited

Loading

diegomrsantos Mar 10, 2026 •

edited

Loading