refactor: use beacon chain deadlines for QBFT instance cleanup#719
refactor: use beacon chain deadlines for QBFT instance cleanup#719diegomrsantos wants to merge 5 commits intosigp:unstablefrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
4cd49f3 to
1d995b4
Compare
2cc3d0d to
4767f9d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
| // Branch 1: Instance completed - clean immediately | ||
| Some(id) = completion_rx.recv() => { | ||
| match id { | ||
| InstanceId::BeaconVote(id) => { | ||
| self.beacon_vote_instances.remove(&id); | ||
| } | ||
| InstanceId::ValidatorConsensus(id) => { | ||
| self.validator_consensus_data_instances.remove(&id); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a problem with this approach:
In theory, there might be a race condition where some tasks try to register their oneshot channel to an instance after it has completed. This might e.g. be the case if multiple validator attestation duties wait for the same committee instance. If we start the instance late (e.g. because of a struggling BN), the first thread will start the instance, which might complete immediately due to replayed messages, giving no opportunity for the other tasks to register their listeners. This is why the current code cleans up at a fixed time regardless of completion.
Instead, we could move the cleanup time in this branch - to give some time (til end of next slot?) to get the instance result. Wdyt?
There was a problem hiding this comment.
I need more context to understand what's described in the first paragraph.
There was a problem hiding this comment.
Good catch. This race exists, but it is pre-existing, not introduced by this PR.
Before this PR, the instance task stays alive in Decided state and the registry entry lingers until the slot-based cleaner removes it. Once the cleaner drops the entry, the tx is dropped, rx.recv() returns None, and the task exits. A late caller after that point hits Vacant, spawns a new instance, and hangs until timeout. The grace window is larger under the old cleanup scheme, but the underlying late-caller behavior is still there.
This PR does make that behavior easier to hit by breaking out of the loop on Decided and removing the entry immediately via completion notification.
The broader issue is that independent code paths can call decide_instance for the same CommitteeInstanceId at different times. Grouping validators per committee, as in #834, improves that by reducing duplicate local callers within a duty path. Applying the same pattern to sync committee signing would help for the same reason.
But that change alone does not fully eliminate the late-caller behavior; that still depends on the cleanup / lifecycle semantics in qbft_manager.
There was a problem hiding this comment.
We could modify ManagedInstance to either hold a channel sender to a running instance, or a resulting value D obtained from a finished instance to accommodate late callers. This would allow us to clean up the instance as soon as it is finished by dropping the sender and storing the finished value. I am unsure how long the resulting value should be kept - 1 slot?
As you said, the underlying root cause is the possibility of late callers, but I am unsure how to prevent this without a major refactor.
Adds test to verify QBFT Committee instances can reach late rounds (9+) as configured with max_round=12. The test creates a Committee instance, forces round changes by keeping operators offline, then advances through multiple slots while verifying the instance survives to reach round 10. Currently fails - instance is cleaned up after 2 slots, reaching round 9 but unable to complete it (needs 120s, gets 8s).
Replace slot-based cleanup with duty-specific beacon chain inclusion deadlines. This allows QBFT instances to progress through all configured rounds without premature removal. Key changes: - Separate instance identity from manager metadata using ManagedInstance wrapper - Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1) - Add slots_per_epoch configuration parameter - Implement dual-trigger cleaner (completion notification + deadline timeout) Fixes instances being cleaned after 2 slots, now properly respecting beacon chain inclusion windows (32-63 slots for attestations).
Improve test readability by applying Setup/Execute/Assert structure: - Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.) - Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests - Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role) - Add descriptive assertion messages explaining what must be true - Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.) All 23 tests pass (up from 18 due to role deadline test split). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed explanation for Committee/Aggregator deadline calculation: - Document the calculation formula: (E+2) * slots_per_epoch - 1 - Explain that this represents the last slot for on-chain inclusion - Reference EIP-7045 specification Enhance ManagedInstance documentation: - Convert to doc comment for better API documentation - Clarify that it tracks both channel and beacon chain deadline - Explain its role in the cleanup task 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
739902c to
8347f06
Compare
Issue Addressed
Fixes instance cleanup issue where QBFT instances were cleaned up too early based on slot-based timeouts (QBFT_RETAIN_SLOTS = 1), preventing instances from reaching later rounds and completing consensus.
Proposed Changes
Core Changes
ManagedInstancestruct tracking both channel and deadlineTest Coverage
test_cleanup_removes_only_expired_instances- Verifies instances survive past old 2-slot timeouttest_instance_completion_notification- Tests immediate cleanup after successful completiontest_committee_can_reach_late_rounds- Verifies instances can reach round 10+ with max_round=12test_cleanup_across_epoch_boundary- Tests deadline calculation across epoch transitionstest_multiple_instances_completing_rapidly- Verifies burst completion handlingCode Quality
// SETUP,// EXECUTE,// ASSERT)Test Results
All 23 tests pass (up from 18 due to test refactoring that split one oversized test into 6 focused tests).
Additional Info
This aligns instance cleanup with actual beacon chain requirements rather than arbitrary slot-based timeouts, allowing instances to complete consensus within their protocol-defined windows.