Full Replication Logic Pt 2 #302

samliok · 2025-11-24T21:42:33Z

Summary

This PR finalizes(hopefully) the replication scheme by clearly distinguishing between replicating rounds and replicating sequences. It also refactors related components for clarity and reliability. Overall, the replication logic is much more explicit which will hopefully lead to less bugs(a few small ones I have found while making the pr)

Replication Overview

In Simplex, when a node receives a notarization, emptyNotarization, or finalization for a round or sequence ahead of its current state, it indicates the node is behind. The process of catching up(retrieving missing rounds or sequences) is called Replication.

The replicationState struct manages this process. Replication is triggered when:

A notarization or emptyNotarization for a higher round is received, or
A finalization for a future sequence arrives.
It can also be triggered when a lagging node sends an empty vote for an older round. In this case, the normal node will send its highest notarization and finalization to potentially trigger replication for the lagging node.

The replicationState tracks missing rounds and sequences, resends requests as needed, and removes completed items. When a QuorumRound (notarization, empty notarization, or finalization) is received, it’s added to the relevant state. Finalizations always supersede earlier quorum rounds, allowing us to prune older states from both the rounds and sequence trackers.

processReplicationState advances rounds or sequences by checking for available quorum rounds. If the next sequence is complete, it’s committed; otherwise, the function checks for the lowest quorum round. Note that the lowest quorum round can be lower than the current round. This is because we may have advanced to the current round through replication through a "dead" chain. I.e. we may have received a notarization/empty notarization that other block builders decided not to build off of, yet we received it as a valid replication response. Empty notarizations can sometimes be inferred to advance rounds via maybeAdvanceRoundFromEmptyNotarizations.

Key Changes

Introduced requestor struct, which fetches quorum rounds up from the network.
Generic & Simple TimeoutManager:
- As long as tasks exist in taskMap, the TimeoutHandler periodically sends them to the TaskRunner every runInterval.
Epoch Message Handling
- notarization and emptyNotarization messages are now able to be processed for previous, non-finalized rounds.
Quorum Round Validity
- Quorum rounds can now include both empty notarizations & notarizations.

TODOS

We need a message type to specifically request Digests from peers. This is important for replication but also the case where we receive a notarization for our current round but are yet to receive the block.

yacovm

Made a pass, will make another pass afterwards.

blacklist.go

requestor.go

replication_state.go

yacovm · 2025-11-28T23:16:06Z

epoch.go

+
+// send notarization or finalization for this round as well
 func (e *Epoch) maybeSendNotarizationOrFinalization(to NodeID, round uint64) {
 	r, ok := e.rounds[round]


what if the round is finalized but in the storage and not in memory?

i think its fine to not fetch from storage here. If the node needs the finalization, it will get it from the replication path.

Plus we would need to traverse the storage for the round since we can only fetch by sequence numbers

yacovm · 2025-11-28T23:21:39Z

epoch.go

+	blockDependency, missingRounds := e.blockDependencies(block.BlockHeader())
+	// because its finalized we don't care about empty rounds
+	if blockDependency != nil {
+		e.Logger.Error(


Can't we somehow call processFinalizedBlock on out of order blocks and reach here while we don't have the parent block?

i don't think so. We only call processFinalizedBlock with the nextSeqToCommit and if we have a finalization for nextSeqToCommit.prev & nextSeqToCommit then we know they both must be valid.

yacovm · 2025-11-28T23:25:58Z

epoch.go

 			return nil
 		}

+		// if we haven't timed out on the round, send a finalized vote message


why do we need to do this? I don't think this is mandatory. We re-broadcast dropped finalized votes if we have a tail of notarized blocks that's not finalized. We will eventually re-broadcast this after we finish replication, if it's needed.

epoch.go

Signed-off-by: Sam Liokumovich <65994425+samliok@users.noreply.github.com>

testutil/util.go

yacovm · 2025-12-02T15:29:09Z

testutil/util.go


 		// if we are expected to time out for this round, we should not have a notarization
-		require.False(t, e.WAL.(*TestWAL).ContainsNotarization(startRound))
+		// TODO: this line is breaking TestSimplexMultiNodeBlacklist but i do not know why yet


You mean a flake?

yep it was a flake, but this todo is old.

timeout_handler.go

yacovm · 2025-12-02T16:01:08Z

timeout_handler.go

-	for _, task := range tasks {
-		f(task)
+	for id := range t.tasks {
+		if t.shouldRemove(id, task) {


Using this shouldRemove is a bit vague and non standard.

Why not have it be a predicate on a specific task and just pass in a function?

It's much cleaner this way:

func (t *TimeoutHandler[T]) RemoveOldTasks(shouldRemove func(id T, _ struct{}) bool) { t.lock.Lock() defer t.lock.Unlock() maps.DeleteFunc(t.tasks, shouldRemove) }

And it's clear in the caller what we're trying to do:

r.emptyRoundTimeouts.RemoveOldTasks(func(r uint64, _ struct{}) bool { return r <= finalizedRound })

As opposed to:

func shouldDelete(value, target uint64) bool { return value <= target }

which lacks context and it's not clear what value and target are supposed to convey.

i like this 💯. Added

epoch.go

yacovm · 2025-12-02T17:07:32Z

epoch.go


 func (e *Epoch) handleFinalizationForPendingOrFutureRound(message *Finalization, round uint64, nextSeqToCommit uint64) {
-	if round == e.round {
+	if round <= e.round {


I don't have a real objection against this change, but can you tell what made you do this? Was there a flake you encountered or a test that fails? I ran the tests and they pass when reverting the change.

Asking because the only way we're still verifying a proposal for a past round is if we've advanced a round because we received an empty notarization, but since we received a finalization, this is a contradiction.

but we can still have a round that has an empty notarization & finalization? So we may have advanced from the empty notarization and only start processing the block & finalization at a later round.

but we can still have a round that has an empty notarization & finalization?

We may have it only due to the fact that we finalize each block recursively.

But if a round was timed out by f+1 correct nodes and an empty notarization was produced, this round is blocked from being finalized.

We may later notarized and finalize a more advanced round and then we will recursively collect a finalization.
However if that happens, it means we're behind, and we will replicate because we should receive the finalization with the higher round, no?

yacovm · 2025-12-02T19:28:16Z

epoch.go

 	}

-	e.increaseRound()
+	if notarization.Vote.Round == e.round && r.finalization == nil {


When I comment out this change and run the tests, they pass.

Is it possible to make a test that fails without this change?

We update the finalization of a round when we receive it or assemble it ourselves or when we replicate it or when we recover from a crash.

If we receive the finalization for e.round at an earlier round without the block, and then receive the block eventually when we're at e.round, we will advance to the next round via loading this as a future message.

If we replicate a block we either replicate it with a notarization or a finalization and then we should be able to advance the round. Why do we need the finalization to be nil?

I don't think we need the finalization check here actually, but there are tests that flake if this is uncommented. This is because we can call persistNotarization with notarizations from previous rounds now(either during replication, or via handleNotarizationMessage). We need to only increase the round if its for the current round otherwise we call increaseRound twice. I'll make a simple test for this.

added the test, and found a small bug

yacovm · 2025-12-02T19:52:05Z

epoch.go

 	if !found {
 		// should never happen since we check this when we verify the proposal metadata
-		e.Logger.Error("Could not find predecessor block for proposal scheduling",
+		e.Logger.Info("Could not find predecessor block for proposal scheduling",


why are we changing that from error to info? Do the tests fail without this?

because processNotarizeBlock may actually hit this code path.
Ex. In round 2 we receive an empty round during replication instead of a notarized one, and then process round 3 which relies on the notarized round for 2. We don't have the predecessor block which is why we can end up here.

yacovm · 2025-12-02T20:37:06Z

epoch.go

+		}
 	}

 	seqs := req.Seqs


can we check that we don't have more than e.maxRoundWindow sequences? Otherwise it's a DoS if we get a million sequences to retrieve.

yep, will do the same with rounds. we may want to use a smaller value because sending e.maxRoundWindow blocks + finalizations in one message could get big. added

yacovm · 2025-12-02T20:37:53Z

epoch.go

 	}
 }

+// if this round is storage, we do not need to retrieve it from storage


if this round is storage, we do not need to retrieve it from storage

What does this mean? I can't parse this

honestly same, I don't think its relevant anymore. deleted.

yacovm · 2025-12-02T20:42:53Z

epoch.go


-		if err := e.verifyQuorumRound(data, from); err != nil {
-			e.Logger.Debug("Received invalid quorum round", zap.Uint64("seq", data.GetSequence()), zap.Stringer("from", from))
+		// TODO: if empty notarizations occur for long periods, we may receive a nextSeqToCommit that has a round considered too far ahead.


this may happen just because we're simply really far behind, no?

yep! if we fall behind many empty rounds, the round number would have increased a lot(compared to our round number) but the sequence number may only be a few away.

Ex. We are in round 1, seq 1 and disconnect. we may end up reconnecting on round 100 seq 3. In this case we will block seq 3 from being processed, so this TODO just notes we may want to change this threshold. What are your thoughts?

My thoughts are that we just may be really far behind and the data.GetSequence() != nextSeqToCommit is simply because we're really behind, we get a notarized block, but it has nothing to do with empty notarizations, so the comment is misleading.

updated the comment

replicate chains and refactor replication state

0cc2f78

samliok force-pushed the cr branch from f689d3c to 0cc2f78 Compare November 24, 2025 21:43

samliok marked this pull request as draft November 24, 2025 21:44

samliok added 3 commits November 24, 2025 18:12

breaking MultiNodeBlacklist Test

b8552f3

failove test uncomment

af999d3

dont skip replicationafternodedisconnects

24c6462

samliok marked this pull request as ready for review November 24, 2025 23:25

samliok added 3 commits November 25, 2025 07:19

remove cancel since we may not have updated our epoch round yet

ee838ad

uncomment so we can check ci but test is still flaking

285d2e0

blacklist flake fix

e75288b

yacovm reviewed Nov 28, 2025

View reviewed changes

samliok and others added 13 commits December 1, 2025 14:20

Merge branch 'main' into cr

cfc356e

Signed-off-by: Sam Liokumovich <65994425+samliok@users.noreply.github.com>

merge conflicts

a1129d2

cleanup old finalized tasks, nits and clarifications from review

f17e136

add finalized check in process

61c8191

nits

f526717

nits

5a8248e

don't vote

27f3b8e

don't vote pt 21

3ad0158

flake

c2d8ef8

flake

217f04b

send segments helper

98edf52

remove digests from map

6a02896

add mixing comment

78f92c1

yacovm reviewed Dec 2, 2025

View reviewed changes

testutil/util.go Outdated Show resolved Hide resolved

yacovm reviewed Dec 2, 2025

View reviewed changes

samliok added 3 commits December 2, 2025 17:04

naming, println, and a few nits from review

8ab3735

simplify timeouthandler and remove the should deleteFunc

fb5dfb0

dos check

5b34b96

samliok added 4 commits December 2, 2025 17:51

add tests to ensure we don't double increment

943dcd7

old todo

519bef6

revert <= change

c60bfc7

update comment

f672a6c

samliok mentioned this pull request Dec 2, 2025

Block replication messages if not replicating #305

Open

Full Replication Logic Pt 2 #302

Are you sure you want to change the base?

Full Replication Logic Pt 2 #302

Uh oh!

Conversation

samliok commented Nov 24, 2025

Summary

Replication Overview

Key Changes

Uh oh!

yacovm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samliok Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samliok Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

samliok Dec 2, 2025 •

edited

Loading

samliok Dec 2, 2025 •

edited

Loading