Update CheckQuorum condition from quorum in any config to quorum in every active config #7375

cjen1-msft · 2025-10-17T19:04:29Z

We have CheckQuorum to ensure that a leader should step down if it is not a good leader, as this acts as a good liveness probe of the system at large.

Our CheckQuorum condition is that the leader has a committing quorum in any configuration.
Our commit condition is that the leader has a committing quorum in every configuration.

This PR includes a scenario (check_quorum_2) which demonstrates the difference between these two conditions.

Suppose we have a 3 node cluster with n0 the leader initially, and then retire both n1 and n2 but do not replicate this to them.
Then we partition n0.

Due to CheckQuorum n0 is still a good leader, however it is unable to commit.
While the remainder of the cluster elects a new replacement leader and continues to function.

The aim of the change to raft.h is to make clear what the condition is, as well as fixing this.

Note

The other direction of this bug would be more severe, (going from n0, to n0,n1,n2, check_quorum_3) with n0 being inaccessible but staying alive due to CheckQuorum.
But the backups are unable to elect themselves without a vote from n0, so this isn't a problem in practise.

src/consensus/aft/raft.h

tests/raft_scenarios/check_quorum_012_0

tests/raft_scenarios/check_quorum_0_012

tests/raft_scenarios/check_quorum_012_0

Copilot

Pull Request Overview

This PR addresses a bug in the CheckQuorum implementation where the leader's quorum check condition was inconsistent with the commit condition. The change ensures that a leader only remains active if it has a committing quorum in every active configuration, not just any configuration. This prevents scenarios where a leader believes it has quorum but is actually unable to commit entries.

Key Changes:

Modified CheckQuorum logic in raft.h to require quorum in all configurations instead of any configuration
Added test scenarios demonstrating the old incorrect behavior and validating the fix
Introduced new test assertions (assert_config, assert_missing_config) to verify configuration state in tests

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/consensus/aft/raft.h`	Changed CheckQuorum from OR logic (quorum in any config) to AND logic (quorum in every config) using `std::all_of`
`src/consensus/aft/test/driver.h`	Added `assert_config` and `assert_missing_config` helper methods for validating node configurations in tests
`src/consensus/aft/test/driver.cpp`	Registered the new assertion commands in the test driver command dispatcher
`tests/raft_scenarios/check_quorum_0_012`	Test scenario demonstrating transition from single node to three-node cluster with CheckQuorum behavior
`tests/raft_scenarios/check_quorum_012_0`	Test scenario demonstrating retirement of nodes while partitioned, verifying correct leader stepdown

src/consensus/aft/raft.h

src/consensus/aft/test/driver.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

src/consensus/aft/test/driver.cpp

…e actual state of it during rollbacks

cjen1-msft · 2025-10-21T13:25:42Z

The reason getting trace validation to work required reducing the constraints on IsBecomeFollower is as follows:

Actions from raft.h trace:

receive append_entries_request (no change)
become follower (update term, follower, rollback uncommittable)
<Don't emit any event> Roll back to match incoming append entries
execute_append_entries_sync (apply changes)
send_append_entries

Required actions from Traceccfraft.tla pov:

IsReceiveAppendEntriesRequest (UpdateTerm & ExecuteAppendEntries => update term, follower, rollback uncommittable, apply changes)
IsBecomeFollower (no change)
IsExecuteAppendEntries (no change)
IsSendAppendEntriesResponse (no change)

So if IsBecomeFollower asserts that the spec state matches the raft.h state, and that it is unchanged, then IsReceiveAppendEntries's UpdateTerm will be in conflict with this.

Hence to fix this specific issue we should not roll back the log or increment the term until the message is sent.
But also we could argue that we shouldn't constrain the membershipState as it may not be stable yet.

achamayou · 2025-10-24T10:26:48Z

The change looks good to me, but holding off until the long tests, including TV have run.

achamayou · 2025-10-24T16:31:32Z

Passing verification runs: https://github.com/microsoft/CCF/actions/runs/18776933466 https://github.com/microsoft/CCF/actions/runs/18776933496, all looks good.

The license/cla looks stuck again, I've kicked off another run with an update branch, if this is not in on Monday I will force merge it.

achamayou · 2025-10-25T10:06:22Z

Failures all look like they come down to I/O:

2025-10-24 17:16:42.512 | ERROR    | {caps} infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/caps_connections_0/out:426: 2025-10-24T17:16:40.824513Z        3   [fail ] CCF/src/host/time_bound_logger.h:55  | Operation took too long (  1.546s): Committing snapshot - fsync(snapshot_432_433)
2025-10-24 17:16:42.512 | ERROR    | {caps} infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/caps_connections_0/out:427: 2025-10-24T17:16:42.353504Z        0   [fail ] CCF/src/host/time_bound_logger.h:55  | Operation took too long (  1.527s): Writing ledger entry - 315 bytes, committable=false
2025-10-24 17:16:42.512 | ERROR    | {caps} infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/caps_connections_0/out:428: 2025-10-24T17:16:42.353504Z        4   [fail ] CCF/src/host/time_bound_logger.h:55  | Operation took too long (  2.774s): Committing snapshot - fsync(snapshot_443_444)

Except a redirect issue in the long test:

17:16:39.365 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:151: 2025-10-24T17:15:31.603892Z        0   [fail ] CCF/src/snapshots/fetch.h:336        | Error during snapshot fetch: Expected PERMANENT_REDIRECT response from GET https://127.206.55.31:45745/node/snapshot/snapshot_117_118.committed, instead received 400
17:16:39.365 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:153: 2025-10-24T17:15:32.619577Z        0   [fail ] CCF/src/snapshots/fetch.h:336        | Error during snapshot fetch: Expected PERMANENT_REDIRECT response from GET https://127.206.55.31:45745/node/snapshot/snapshot_117_118.committed, instead received 400
17:16:39.365 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:155: 2025-10-24T17:15:33.635189Z        0   [fail ] CCF/src/snapshots/fetch.h:336        | Error during snapshot fetch: Expected PERMANENT_REDIRECT response from GET https://127.206.55.31:45745/node/snapshot/snapshot_117_118.committed, instead received 400
17:16:39.366 | ERROR    | infra.network:log_errors:148 - /__w/CCF/CCF/build/workspace/lts_compatibility_3/out:281: 2025-10-24T17:15:47.784980Z        0   [fail ] CF/src/node/rpc/node_frontend.h:1720 | JWT key auto-refresh: request does not originate from primary

Which could be related to recently merged #7395? Unclear, needs investigation.

cjen1-msft · 2025-10-27T16:19:42Z

Double checking I think this might be related directly or indirectly to long writes to disk.
lts_compat_0 is on 6.X where there is logging of the snapshot as it is written, we can see a 1.5-2s stall before New snapshot written to ...:

I've opened a PR to add time bound logging around those write calls.

Regardless, at 15:26 the snapshot is requested, and the call fails with a 404.
Then all retries fail with 400 which I haven't worked out yet, this could be due to the new redirect behaviour on a new node to an old primary? @eddyashton

We should probably add info logging if we get a repro for the snapshot fetching.

…very active config (microsoft#7375) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Amaury Chamayou <amchamay@microsoft.com>

#7438, #7445, #7458) (#7436)

eddyashton · 2025-12-03T14:14:02Z

Adding backported label - this was backported in #7436.

cjen1-msft added 4 commits October 17, 2025 19:26

Add new check_quorum scenario

c279b6e

clean-up

8c933e5

Demo not-quite-a-bug scenario

97e426f

CheckQuoum fix

79f7302

achamayou reviewed Oct 20, 2025

View reviewed changes

src/consensus/aft/raft.h Show resolved Hide resolved

eddyashton reviewed Oct 20, 2025

View reviewed changes

tests/raft_scenarios/check_quorum_012_0 Show resolved Hide resolved

eddyashton reviewed Oct 20, 2025

View reviewed changes

tests/raft_scenarios/check_quorum_0_012 Show resolved Hide resolved

eddyashton reviewed Oct 20, 2025

View reviewed changes

tests/raft_scenarios/check_quorum_012_0 Show resolved Hide resolved

tests/raft_scenarios/check_quorum_012_0 Outdated Show resolved Hide resolved

cjen1-msft added 5 commits October 20, 2025 11:06

Add sig after start

7474e0b

rename files

88a69b9

Add asserts for configurations

dc00a0f

Update specs to be easier to read

498066a

Assert sync state

17ee9c9

cjen1-msft marked this pull request as ready for review October 20, 2025 14:28

cjen1-msft requested a review from a team as a code owner October 20, 2025 14:28

Copilot AI review requested due to automatic review settings October 20, 2025 14:28

Copilot AI reviewed Oct 20, 2025

View reviewed changes

src/consensus/aft/raft.h Show resolved Hide resolved

src/consensus/aft/test/driver.h Outdated Show resolved Hide resolved

cjen1-msft and others added 2 commits October 20, 2025 15:39

Update src/consensus/aft/test/driver.h

9f41703

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into demo-check-quorum-fail

86ed2cd

achamayou reviewed Oct 20, 2025

View reviewed changes

src/consensus/aft/test/driver.cpp Outdated Show resolved Hide resolved

cjen1-msft added 2 commits October 21, 2025 12:09

Potentially massage mismatch between spec's view of the ledger and th…

75c1ab1

…e actual state of it during rollbacks

Docs for not asserting state

2b7e9a5

cjen1-msft added 4 commits October 21, 2025 14:26

Merge branch 'main' into demo-check-quorum-fail

b1723a4

assert missing->absent config

b4b47e5

Update comment

6c8c08b

fmt

7b82b38

cjen1-msft self-assigned this Oct 22, 2025

cjen1-msft and others added 2 commits October 22, 2025 17:04

Merge branch 'main' into demo-check-quorum-fail

eab57bc

Merge branch 'main' into demo-check-quorum-fail

b6ff325

cjen1-msft and others added 3 commits October 23, 2025 14:45

Merge branch 'main' into demo-check-quorum-fail

9bc5497

Merge branch 'main' into demo-check-quorum-fail

ab829ce

Merge branch 'main' into demo-check-quorum-fail

eabda49

achamayou added run-long-test Run Long Test job run-long-verification Run Long Verification jobs labels Oct 24, 2025

achamayou changed the title ~~Demo incorrect check_quorum behaviour~~ Update CheckQuorum condition from quorum in any config to quorum in every active config Oct 24, 2025

Merge branch 'main' into demo-check-quorum-fail

8a6fe51

Merge branch 'main' into demo-check-quorum-fail

04290ed

cjen1-msft added 2 commits October 27, 2025 16:23

Merge branch 'main' into demo-check-quorum-fail

4a7eb26

Merge branch 'main' into demo-check-quorum-fail

a5cdccf

cjen1-msft enabled auto-merge (squash) October 28, 2025 15:32

cjen1-msft added 2 commits October 28, 2025 16:57

CHANGELOGGING

c976b60

Merge branch 'main' into demo-check-quorum-fail

1281a5f

achamayou approved these changes Oct 29, 2025

View reviewed changes

Merge branch 'main' into demo-check-quorum-fail

de12cb2

achamayou removed the run-long-test Run Long Test job label Oct 29, 2025

cjen1-msft merged commit 47701f3 into microsoft:main Oct 29, 2025
24 checks passed

cjen1-msft deleted the demo-check-quorum-fail branch October 30, 2025 11:56

cjen1-msft added the 6.x-todo PRs which should be backported to 6.x label Nov 5, 2025

cjen1-msft mentioned this pull request Nov 5, 2025

[release/6.x] Backport pre-vote capability (#7374, #7375, #7404, #7419, #7438, #7445, #7458) #7436

Merged

achamayou pushed a commit that referenced this pull request Nov 17, 2025

[release/6.x] Backport pre-vote capability (#7374, #7375, #7404, #7409,

5120ce9

#7438, #7445, #7458) (#7436)

eddyashton added the backported This PR was successfully backported to LTS branch label Dec 3, 2025

eddyashton removed the run-long-verification Run Long Verification jobs label Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update CheckQuorum condition from quorum in any config to quorum in every active config #7375

Update CheckQuorum condition from quorum in any config to quorum in every active config #7375

cjen1-msft commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjen1-msft commented Oct 21, 2025

Uh oh!

achamayou commented Oct 24, 2025

Uh oh!

achamayou commented Oct 24, 2025 •

edited

Loading

Uh oh!

achamayou commented Oct 25, 2025 •

edited

Loading

Uh oh!

cjen1-msft commented Oct 27, 2025

Uh oh!

Uh oh!

eddyashton commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update CheckQuorum condition from quorum in any config to quorum in every active config #7375

Update CheckQuorum condition from quorum in any config to quorum in every active config #7375

Conversation

cjen1-msft commented Oct 17, 2025

Note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjen1-msft commented Oct 21, 2025

Uh oh!

achamayou commented Oct 24, 2025

Uh oh!

achamayou commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achamayou commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjen1-msft commented Oct 27, 2025

Uh oh!

Uh oh!

eddyashton commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

achamayou commented Oct 24, 2025 •

edited

Loading

achamayou commented Oct 25, 2025 •

edited

Loading