Skip to content

refactor: use beacon chain deadlines for QBFT instance cleanup#719

Draft
diegomrsantos wants to merge 5 commits intosigp:unstablefrom
diegomrsantos:test/qbft-late-rounds
Draft

refactor: use beacon chain deadlines for QBFT instance cleanup#719
diegomrsantos wants to merge 5 commits intosigp:unstablefrom
diegomrsantos:test/qbft-late-rounds

Conversation

@diegomrsantos
Copy link
Member

@diegomrsantos diegomrsantos commented Oct 25, 2025

Issue Addressed

Fixes instance cleanup issue where QBFT instances were cleaned up too early based on slot-based timeouts (QBFT_RETAIN_SLOTS = 1), preventing instances from reaching later rounds and completing consensus.

Proposed Changes

Core Changes

  • Refactored instance cleanup to use beacon chain inclusion deadlines instead of slot-based timeouts
  • Each role now has a deadline based on EIP-7045 and consensus spec requirements:
    • Committee/Aggregator: End of epoch E+1 (attestation inclusion window)
    • Proposer/SyncCommittee: Same slot (immediate inclusion)
    • VoluntaryExit/ValidatorRegistration: One epoch window
  • Instances are cleaned when:
    1. They complete successfully (via completion notification channel)
    2. Their beacon chain deadline expires (checked each slot)
  • Added ManagedInstance struct tracking both channel and deadline
  • Implemented dual cleanup mechanism: completion-based (immediate) and deadline-based (deferred)

Test Coverage

  • Added 5 comprehensive tests verifying deadline-based cleanup:
    • test_cleanup_removes_only_expired_instances - Verifies instances survive past old 2-slot timeout
    • test_instance_completion_notification - Tests immediate cleanup after successful completion
    • test_committee_can_reach_late_rounds - Verifies instances can reach round 10+ with max_round=12
    • test_cleanup_across_epoch_boundary - Tests deadline calculation across epoch transitions
    • test_multiple_instances_completing_rapidly - Verifies burst completion handling
  • Added 6 focused tests for role-specific deadline calculations (Committee, Aggregator, Proposer, SyncCommittee, VoluntaryExit, ValidatorRegistration)

Code Quality

  • Refactored all tests to follow Setup/Execute/Assert pattern with:
    • Clear section comments (// SETUP, // EXECUTE, // ASSERT)
    • Named constants replacing all magic numbers
    • Descriptive assertion messages
  • Added mandatory test structure guidelines to CLAUDE.md and tester-subagent.md
  • Enhanced documentation in lib.rs with detailed deadline calculation explanations

Test Results

All 23 tests pass (up from 18 due to test refactoring that split one oversized test into 6 focused tests).

Additional Info

This aligns instance cleanup with actual beacon chain requirements rather than arbitrary slot-based timeouts, allowing instances to complete consensus within their protocol-defined windows.

@claude-code-actions-sigp

This comment was marked as outdated.

@diegomrsantos diegomrsantos changed the base branch from stable to unstable October 25, 2025 16:35
@diegomrsantos diegomrsantos marked this pull request as draft October 28, 2025 20:48
@diegomrsantos diegomrsantos force-pushed the test/qbft-late-rounds branch 2 times, most recently from 2cc3d0d to 4767f9d Compare October 28, 2025 23:26
@diegomrsantos diegomrsantos added the claude-recheck triggers claude review workflow to re-run label Oct 29, 2025
@diegomrsantos diegomrsantos self-assigned this Oct 29, 2025
@diegomrsantos diegomrsantos added QBFT and removed claude-recheck triggers claude review workflow to re-run labels Oct 29, 2025
@diegomrsantos diegomrsantos marked this pull request as ready for review October 29, 2025 19:23
@claude-code-actions-sigp

This comment was marked as outdated.

@dknopik dknopik added the v1.2.0 label Nov 7, 2025
@diegomrsantos diegomrsantos changed the title test: add test for Committee instances reaching late rounds refactor: use beacon chain deadlines for QBFT instance cleanup Nov 12, 2025
@dknopik dknopik removed the v1.2.0 label Dec 9, 2025
Comment on lines +342 to +352
// Branch 1: Instance completed - clean immediately
Some(id) = completion_rx.recv() => {
match id {
InstanceId::BeaconVote(id) => {
self.beacon_vote_instances.remove(&id);
}
InstanceId::ValidatorConsensus(id) => {
self.validator_consensus_data_instances.remove(&id);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem with this approach:

In theory, there might be a race condition where some tasks try to register their oneshot channel to an instance after it has completed. This might e.g. be the case if multiple validator attestation duties wait for the same committee instance. If we start the instance late (e.g. because of a struggling BN), the first thread will start the instance, which might complete immediately due to replayed messages, giving no opportunity for the other tasks to register their listeners. This is why the current code cleans up at a fixed time regardless of completion.

Instead, we could move the cleanup time in this branch - to give some time (til end of next slot?) to get the instance result. Wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more context to understand what's described in the first paragraph.

Copy link
Member Author

@diegomrsantos diegomrsantos Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This race exists, but it is pre-existing, not introduced by this PR.

Before this PR, the instance task stays alive in Decided state and the registry entry lingers until the slot-based cleaner removes it. Once the cleaner drops the entry, the tx is dropped, rx.recv() returns None, and the task exits. A late caller after that point hits Vacant, spawns a new instance, and hangs until timeout. The grace window is larger under the old cleanup scheme, but the underlying late-caller behavior is still there.

This PR does make that behavior easier to hit by breaking out of the loop on Decided and removing the entry immediately via completion notification.

The broader issue is that independent code paths can call decide_instance for the same CommitteeInstanceId at different times. Grouping validators per committee, as in #834, improves that by reducing duplicate local callers within a duty path. Applying the same pattern to sync committee signing would help for the same reason.

But that change alone does not fully eliminate the late-caller behavior; that still depends on the cleanup / lifecycle semantics in qbft_manager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could modify ManagedInstance to either hold a channel sender to a running instance, or a resulting value D obtained from a finished instance to accommodate late callers. This would allow us to clean up the instance as soon as it is finished by dropping the sender and storing the finished value. I am unsure how long the resulting value should be kept - 1 slot?

As you said, the underlying root cause is the possibility of late callers, but I am unsure how to prevent this without a major refactor.

diegomrsantos and others added 4 commits March 9, 2026 23:35
Adds test to verify QBFT Committee instances can reach late rounds
(9+) as configured with max_round=12.

The test creates a Committee instance, forces round changes by
keeping operators offline, then advances through multiple slots
while verifying the instance survives to reach round 10.

Currently fails - instance is cleaned up after 2 slots, reaching
round 9 but unable to complete it (needs 120s, gets 8s).
Replace slot-based cleanup with duty-specific beacon chain inclusion
deadlines. This allows QBFT instances to progress through all configured
rounds without premature removal.

Key changes:
- Separate instance identity from manager metadata using ManagedInstance wrapper
- Calculate duty-specific deadlines per EIP-7045 (attestations valid until end of epoch E+1)
- Add slots_per_epoch configuration parameter
- Implement dual-trigger cleaner (completion notification + deadline timeout)

Fixes instances being cleaned after 2 slots, now properly respecting
beacon chain inclusion windows (32-63 slots for attestations).
Improve test readability by applying Setup/Execute/Assert structure:

- Replace magic numbers with named constants (SINGLE_INSTANCE, TWO_INSTANCES, etc.)
- Add mandatory section comments (// SETUP, // EXECUTE, // ASSERT) to all new tests
- Split oversized test_role_based_deadline_calculations into 6 focused tests (one per role)
- Add descriptive assertion messages explaining what must be true
- Named all literals in new tests (OLD_CLEANUP_SLOT, BEACON_DEADLINE_SLOT, etc.)

All 23 tests pass (up from 18 due to role deadline test split).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed explanation for Committee/Aggregator deadline calculation:
- Document the calculation formula: (E+2) * slots_per_epoch - 1
- Explain that this represents the last slot for on-chain inclusion
- Reference EIP-7045 specification

Enhance ManagedInstance documentation:
- Convert to doc comment for better API documentation
- Clarify that it tracks both channel and beacon chain deadline
- Explain its role in the cleanup task

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@diegomrsantos diegomrsantos force-pushed the test/qbft-late-rounds branch from 739902c to 8347f06 Compare March 9, 2026 23:19
@diegomrsantos diegomrsantos marked this pull request as draft March 10, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants