Skip to content

Conversation

@samliok
Copy link
Collaborator

@samliok samliok commented Nov 24, 2025

Summary

This PR finalizes(hopefully) the replication scheme by clearly distinguishing between replicating rounds and replicating sequences. It also refactors related components for clarity and reliability. Overall, the replication logic is much more explicit which will hopefully lead to less bugs(a few small ones I have found while making the pr)

Replication Overview

In Simplex, when a node receives a notarization, emptyNotarization, or finalization for a round or sequence ahead of its current state, it indicates the node is behind. The process of catching up(retrieving missing rounds or sequences) is called Replication.

The replicationState struct manages this process. Replication is triggered when:

  • A notarization or emptyNotarization for a higher round is received, or
  • A finalization for a future sequence arrives.
  • It can also be triggered when a lagging node sends an empty vote for an older round. In this case, the normal node will send its highest notarization and finalization to potentially trigger replication for the lagging node.

The replicationState tracks missing rounds and sequences, resends requests as needed, and removes completed items. When a QuorumRound (notarization, empty notarization, or finalization) is received, it’s added to the relevant state. Finalizations always supersede earlier quorum rounds, allowing us to prune older states from both the rounds and sequence trackers.

processReplicationState advances rounds or sequences by checking for available quorum rounds. If the next sequence is complete, it’s committed; otherwise, the function checks for the lowest quorum round. Note that the lowest quorum round can be lower than the current round. This is because we may have advanced to the current round through replication through a "dead" chain. I.e. we may have received a notarization/empty notarization that other block builders decided not to build off of, yet we received it as a valid replication response. Empty notarizations can sometimes be inferred to advance rounds via maybeAdvanceRoundFromEmptyNotarizations.

Key Changes

  • Introduced requestor struct, which fetches quorum rounds up from the network.
  • Generic & Simple TimeoutManager:
    • As long as tasks exist in taskMap, the TimeoutHandler periodically sends them to the TaskRunner every runInterval.
  • Epoch Message Handling
    • notarization and emptyNotarization messages are now able to be processed for previous, non-finalized rounds.
  • Quorum Round Validity
    • Quorum rounds can now include both empty notarizations & notarizations.

TODOS

  • We need a message type to specifically request Digests from peers. This is important for replication but also the case where we receive a notarization for our current round but are yet to receive the block.

@samliok samliok marked this pull request as ready for review November 24, 2025 23:25
Copy link
Collaborator

@yacovm yacovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a pass, will make another pass afterwards.


// send notarization or finalization for this round as well
func (e *Epoch) maybeSendNotarizationOrFinalization(to NodeID, round uint64) {
r, ok := e.rounds[round]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the round is finalized but in the storage and not in memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its fine to not fetch from storage here. If the node needs the finalization, it will get it from the replication path.

Plus we would need to traverse the storage for the round since we can only fetch by sequence numbers

blockDependency, missingRounds := e.blockDependencies(block.BlockHeader())
// because its finalized we don't care about empty rounds
if blockDependency != nil {
e.Logger.Error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we somehow call processFinalizedBlock on out of order blocks and reach here while we don't have the parent block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think so. We only call processFinalizedBlock with the nextSeqToCommit and if we have a finalization for nextSeqToCommit.prev & nextSeqToCommit then we know they both must be valid.

epoch.go Outdated
return nil
}

// if we haven't timed out on the round, send a finalized vote message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this? I don't think this is mandatory. We re-broadcast dropped finalized votes if we have a tail of notarized blocks that's not finalized. We will eventually re-broadcast this after we finish replication, if it's needed.

testutil/util.go Outdated

// if we are expected to time out for this round, we should not have a notarization
require.False(t, e.WAL.(*TestWAL).ContainsNotarization(startRound))
// TODO: this line is breaking TestSimplexMultiNodeBlacklist but i do not know why yet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a flake?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep it was a flake, but this todo is old.

for _, task := range tasks {
f(task)
for id := range t.tasks {
if t.shouldRemove(id, task) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this shouldRemove is a bit vague and non standard.

Why not have it be a predicate on a specific task and just pass in a function?

It's much cleaner this way:

func (t *TimeoutHandler[T]) RemoveOldTasks(shouldRemove func(id T, _ struct{}) bool) {
	t.lock.Lock()
	defer t.lock.Unlock()
	
	maps.DeleteFunc(t.tasks, shouldRemove)
}

And it's clear in the caller what we're trying to do:

	r.emptyRoundTimeouts.RemoveOldTasks(func(r uint64, _ struct{}) bool {
		return r <= finalizedRound
	})

As opposed to:

func shouldDelete(value, target uint64) bool {
	return value <= target
}

which lacks context and it's not clear what value and target are supposed to convey.

Copy link
Collaborator Author

@samliok samliok Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this 💯. Added

epoch.go Outdated

func (e *Epoch) handleFinalizationForPendingOrFutureRound(message *Finalization, round uint64, nextSeqToCommit uint64) {
if round == e.round {
if round <= e.round {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a real objection against this change, but can you tell what made you do this? Was there a flake you encountered or a test that fails? I ran the tests and they pass when reverting the change.

Asking because the only way we're still verifying a proposal for a past round is if we've advanced a round because we received an empty notarization, but since we received a finalization, this is a contradiction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can still have a round that has an empty notarization & finalization? So we may have advanced from the empty notarization and only start processing the block & finalization at a later round.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we can still have a round that has an empty notarization & finalization?

We may have it only due to the fact that we finalize each block recursively.

But if a round was timed out by f+1 correct nodes and an empty notarization was produced, this round is blocked from being finalized.

We may later notarized and finalize a more advanced round and then we will recursively collect a finalization.
However if that happens, it means we're behind, and we will replicate because we should receive the finalization with the higher round, no?

}

e.increaseRound()
if notarization.Vote.Round == e.round && r.finalization == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I comment out this change and run the tests, they pass.

Is it possible to make a test that fails without this change?

We update the finalization of a round when we receive it or assemble it ourselves or when we replicate it or when we recover from a crash.

If we receive the finalization for e.round at an earlier round without the block, and then receive the block eventually when we're at e.round, we will advance to the next round via loading this as a future message.

If we replicate a block we either replicate it with a notarization or a finalization and then we should be able to advance the round. Why do we need the finalization to be nil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the finalization check here actually, but there are tests that flake if this is uncommented. This is because we can call persistNotarization with notarizations from previous rounds now(either during replication, or via handleNotarizationMessage). We need to only increase the round if its for the current round otherwise we call increaseRound twice. I'll make a simple test for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the test, and found a small bug

if !found {
// should never happen since we check this when we verify the proposal metadata
e.Logger.Error("Could not find predecessor block for proposal scheduling",
e.Logger.Info("Could not find predecessor block for proposal scheduling",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we changing that from error to info? Do the tests fail without this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because processNotarizeBlock may actually hit this code path.
Ex. In round 2 we receive an empty round during replication instead of a notarized one, and then process round 3 which relies on the notarized round for 2. We don't have the predecessor block which is why we can end up here.

}
}

seqs := req.Seqs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check that we don't have more than e.maxRoundWindow sequences? Otherwise it's a DoS if we get a million sequences to retrieve.

Copy link
Collaborator Author

@samliok samliok Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will do the same with rounds. we may want to use a smaller value because sending e.maxRoundWindow blocks + finalizations in one message could get big. added

epoch.go Outdated
}
}

// if this round is storage, we do not need to retrieve it from storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this round is storage, we do not need to retrieve it from storage

What does this mean? I can't parse this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly same, I don't think its relevant anymore. deleted.

epoch.go Outdated

if err := e.verifyQuorumRound(data, from); err != nil {
e.Logger.Debug("Received invalid quorum round", zap.Uint64("seq", data.GetSequence()), zap.Stringer("from", from))
// TODO: if empty notarizations occur for long periods, we may receive a nextSeqToCommit that has a round considered too far ahead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may happen just because we're simply really far behind, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! if we fall behind many empty rounds, the round number would have increased a lot(compared to our round number) but the sequence number may only be a few away.

Ex. We are in round 1, seq 1 and disconnect. we may end up reconnecting on round 100 seq 3. In this case we will block seq 3 from being processed, so this TODO just notes we may want to change this threshold. What are your thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts are that we just may be really far behind and the data.GetSequence() != nextSeqToCommit is simply because we're really behind, we get a notarized block, but it has nothing to do with empty notarizations, so the comment is misleading.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants