Skip to content

Conversation

@cjen1-msft
Copy link
Contributor

We have CheckQuorum to ensure that a leader should step down if it is not a good leader, as this acts as a good liveness probe of the system at large.

Our CheckQuorum condition is that the leader has a committing quorum in any configuration.
Our commit condition is that the leader has a committing quorum in every configuration.

This PR includes a scenario (check_quorum_2) which demonstrates the difference between these two conditions.

Suppose we have a 3 node cluster with n0 the leader initially, and then retire both n1 and n2 but do not replicate this to them.
Then we partition n0.

Due to CheckQuorum n0 is still a good leader, however it is unable to commit.
While the remainder of the cluster elects a new replacement leader and continues to function.

The aim of the change to raft.h is to make clear what the condition is, as well as fixing this.

Note

The other direction of this bug would be more severe, (going from n0, to n0,n1,n2, check_quorum_3) with n0 being inaccessible but staying alive due to CheckQuorum.
But the backups are unable to elect themselves without a vote from n0, so this isn't a problem in practise.

@cjen1-msft cjen1-msft marked this pull request as ready for review October 20, 2025 14:28
@cjen1-msft cjen1-msft requested a review from a team as a code owner October 20, 2025 14:28
Copilot AI review requested due to automatic review settings October 20, 2025 14:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a bug in the CheckQuorum implementation where the leader's quorum check condition was inconsistent with the commit condition. The change ensures that a leader only remains active if it has a committing quorum in every active configuration, not just any configuration. This prevents scenarios where a leader believes it has quorum but is actually unable to commit entries.

Key Changes:

  • Modified CheckQuorum logic in raft.h to require quorum in all configurations instead of any configuration
  • Added test scenarios demonstrating the old incorrect behavior and validating the fix
  • Introduced new test assertions (assert_config, assert_missing_config) to verify configuration state in tests

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/consensus/aft/raft.h Changed CheckQuorum from OR logic (quorum in any config) to AND logic (quorum in every config) using std::all_of
src/consensus/aft/test/driver.h Added assert_config and assert_missing_config helper methods for validating node configurations in tests
src/consensus/aft/test/driver.cpp Registered the new assertion commands in the test driver command dispatcher
tests/raft_scenarios/check_quorum_0_012 Test scenario demonstrating transition from single node to three-node cluster with CheckQuorum behavior
tests/raft_scenarios/check_quorum_012_0 Test scenario demonstrating retirement of nodes while partitioned, verifying correct leader stepdown

cjen1-msft and others added 2 commits October 20, 2025 15:39
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@cjen1-msft
Copy link
Contributor Author

The reason getting trace validation to work required reducing the constraints on IsBecomeFollower is as follows:

Actions from raft.h trace:

  • receive append_entries_request (no change)
  • become follower (update term, follower, rollback uncommittable)
  • <Don't emit any event> Roll back to match incoming append entries
  • execute_append_entries_sync (apply changes)
  • send_append_entries

Required actions from Traceccfraft.tla pov:

  • IsReceiveAppendEntriesRequest (UpdateTerm & ExecuteAppendEntries => update term, follower, rollback uncommittable, apply changes)
  • IsBecomeFollower (no change)
  • IsExecuteAppendEntries (no change)
  • IsSendAppendEntriesResponse (no change)

So if IsBecomeFollower asserts that the spec state matches the raft.h state, and that it is unchanged, then IsReceiveAppendEntries's UpdateTerm will be in conflict with this.

Hence to fix this specific issue we should not roll back the log or increment the term until the message is sent.
But also we could argue that we shouldn't constrain the membershipState as it may not be stable yet.

@cjen1-msft cjen1-msft self-assigned this Oct 22, 2025
@achamayou achamayou added run-long-test Run Long Test job run-long-verification Run Long Verification jobs labels Oct 24, 2025
@achamayou achamayou changed the title Demo incorrect check_quorum behaviour Update CheckQuorum condition from quorum in any config to quorum in every active config Oct 24, 2025
@achamayou
Copy link
Member

The change looks good to me, but holding off until the long tests, including TV have run.

@achamayou
Copy link
Member

achamayou commented Oct 24, 2025

Passing verification runs: https://github.com/microsoft/CCF/actions/runs/18776933466 https://github.com/microsoft/CCF/actions/runs/18776933496, all looks good.

The license/cla looks stuck again, I've kicked off another run with an update branch, if this is not in on Monday I will force merge it.

@achamayou
Copy link
Member

achamayou commented Oct 25, 2025

Failures all look like they come down to I/O:

2025-10-24 17:16:42.512 | ERROR    | {caps} infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/caps_connections_0/out:426: 2025-10-24T17:16:40.824513Z        3   [fail ] CCF/src/host/time_bound_logger.h:55  | Operation took too long (  1.546s): Committing snapshot - fsync(snapshot_432_433)
2025-10-24 17:16:42.512 | ERROR    | {caps} infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/caps_connections_0/out:427: 2025-10-24T17:16:42.353504Z        0   [fail ] CCF/src/host/time_bound_logger.h:55  | Operation took too long (  1.527s): Writing ledger entry - 315 bytes, committable=false
2025-10-24 17:16:42.512 | ERROR    | {caps} infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/caps_connections_0/out:428: 2025-10-24T17:16:42.353504Z        4   [fail ] CCF/src/host/time_bound_logger.h:55  | Operation took too long (  2.774s): Committing snapshot - fsync(snapshot_443_444)

Except a redirect issue in the long test:

17:16:39.365 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:151: 2025-10-24T17:15:31.603892Z        0   [fail ] CCF/src/snapshots/fetch.h:336        | Error during snapshot fetch: Expected PERMANENT_REDIRECT response from GET https://127.206.55.31:45745/node/snapshot/snapshot_117_118.committed, instead received 400
17:16:39.365 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:153: 2025-10-24T17:15:32.619577Z        0   [fail ] CCF/src/snapshots/fetch.h:336        | Error during snapshot fetch: Expected PERMANENT_REDIRECT response from GET https://127.206.55.31:45745/node/snapshot/snapshot_117_118.committed, instead received 400
17:16:39.365 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:155: 2025-10-24T17:15:33.635189Z        0   [fail ] CCF/src/snapshots/fetch.h:336        | Error during snapshot fetch: Expected PERMANENT_REDIRECT response from GET https://127.206.55.31:45745/node/snapshot/snapshot_117_118.committed, instead received 400
17:16:39.366 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:281: 2025-10-24T17:15:47.784980Z        0   [fail ] CF/src/node/rpc/node_frontend.h:1720 | JWT key auto-refresh: request does not originate from primary

Which could be related to recently merged #7395? Unclear, needs investigation.

@cjen1-msft
Copy link
Contributor Author

Double checking I think this might be related directly or indirectly to long writes to disk.
lts_compat_0 is on 6.X where there is logging of the snapshot as it is written, we can see a 1.5-2s stall before New snapshot written to ...:

image

I've opened a PR to add time bound logging around those write calls.

Regardless, at 15:26 the snapshot is requested, and the call fails with a 404.
Then all retries fail with 400 which I haven't worked out yet, this could be due to the new redirect behaviour on a new node to an old primary? @eddyashton

We should probably add info logging if we get a repro for the snapshot fetching.

@cjen1-msft cjen1-msft enabled auto-merge (squash) October 28, 2025 15:32
@achamayou achamayou removed the run-long-test Run Long Test job label Oct 29, 2025
@cjen1-msft cjen1-msft merged commit 47701f3 into microsoft:main Oct 29, 2025
24 checks passed
@cjen1-msft cjen1-msft deleted the demo-check-quorum-fail branch October 30, 2025 11:56
@cjen1-msft cjen1-msft added the 6.x-todo PRs which should be backported to 6.x label Nov 5, 2025
cjen1-msft added a commit to cjen1-msft/CCF that referenced this pull request Nov 5, 2025
…very active config (microsoft#7375)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Amaury Chamayou <amchamay@microsoft.com>
cjen1-msft added a commit to cjen1-msft/CCF that referenced this pull request Nov 5, 2025
…very active config (microsoft#7375)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Amaury Chamayou <amchamay@microsoft.com>
@eddyashton eddyashton added the backported This PR was successfully backported to LTS branch label Dec 3, 2025
@eddyashton
Copy link
Member

Adding backported label - this was backported in #7436.

@eddyashton eddyashton removed the run-long-verification Run Long Verification jobs label Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.x-todo PRs which should be backported to 6.x backported This PR was successfully backported to LTS branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants