-
Notifications
You must be signed in to change notification settings - Fork 1
Full Replication Logic Pt 2 #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
yacovm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a pass, will make another pass afterwards.
|
|
||
| // send notarization or finalization for this round as well | ||
| func (e *Epoch) maybeSendNotarizationOrFinalization(to NodeID, round uint64) { | ||
| r, ok := e.rounds[round] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the round is finalized but in the storage and not in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its fine to not fetch from storage here. If the node needs the finalization, it will get it from the replication path.
Plus we would need to traverse the storage for the round since we can only fetch by sequence numbers
| blockDependency, missingRounds := e.blockDependencies(block.BlockHeader()) | ||
| // because its finalized we don't care about empty rounds | ||
| if blockDependency != nil { | ||
| e.Logger.Error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we somehow call processFinalizedBlock on out of order blocks and reach here while we don't have the parent block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think so. We only call processFinalizedBlock with the nextSeqToCommit and if we have a finalization for nextSeqToCommit.prev & nextSeqToCommit then we know they both must be valid.
epoch.go
Outdated
| return nil | ||
| } | ||
|
|
||
| // if we haven't timed out on the round, send a finalized vote message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do this? I don't think this is mandatory. We re-broadcast dropped finalized votes if we have a tail of notarized blocks that's not finalized. We will eventually re-broadcast this after we finish replication, if it's needed.
Signed-off-by: Sam Liokumovich <65994425+samliok@users.noreply.github.com>
testutil/util.go
Outdated
|
|
||
| // if we are expected to time out for this round, we should not have a notarization | ||
| require.False(t, e.WAL.(*TestWAL).ContainsNotarization(startRound)) | ||
| // TODO: this line is breaking TestSimplexMultiNodeBlacklist but i do not know why yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean a flake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep it was a flake, but this todo is old.
timeout_handler.go
Outdated
| for _, task := range tasks { | ||
| f(task) | ||
| for id := range t.tasks { | ||
| if t.shouldRemove(id, task) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this shouldRemove is a bit vague and non standard.
Why not have it be a predicate on a specific task and just pass in a function?
It's much cleaner this way:
func (t *TimeoutHandler[T]) RemoveOldTasks(shouldRemove func(id T, _ struct{}) bool) {
t.lock.Lock()
defer t.lock.Unlock()
maps.DeleteFunc(t.tasks, shouldRemove)
}
And it's clear in the caller what we're trying to do:
r.emptyRoundTimeouts.RemoveOldTasks(func(r uint64, _ struct{}) bool {
return r <= finalizedRound
})
As opposed to:
func shouldDelete(value, target uint64) bool {
return value <= target
}
which lacks context and it's not clear what value and target are supposed to convey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like this 💯. Added
epoch.go
Outdated
|
|
||
| func (e *Epoch) handleFinalizationForPendingOrFutureRound(message *Finalization, round uint64, nextSeqToCommit uint64) { | ||
| if round == e.round { | ||
| if round <= e.round { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a real objection against this change, but can you tell what made you do this? Was there a flake you encountered or a test that fails? I ran the tests and they pass when reverting the change.
Asking because the only way we're still verifying a proposal for a past round is if we've advanced a round because we received an empty notarization, but since we received a finalization, this is a contradiction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we can still have a round that has an empty notarization & finalization? So we may have advanced from the empty notarization and only start processing the block & finalization at a later round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we can still have a round that has an empty notarization & finalization?
We may have it only due to the fact that we finalize each block recursively.
But if a round was timed out by f+1 correct nodes and an empty notarization was produced, this round is blocked from being finalized.
We may later notarized and finalize a more advanced round and then we will recursively collect a finalization.
However if that happens, it means we're behind, and we will replicate because we should receive the finalization with the higher round, no?
| } | ||
|
|
||
| e.increaseRound() | ||
| if notarization.Vote.Round == e.round && r.finalization == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I comment out this change and run the tests, they pass.
Is it possible to make a test that fails without this change?
We update the finalization of a round when we receive it or assemble it ourselves or when we replicate it or when we recover from a crash.
If we receive the finalization for e.round at an earlier round without the block, and then receive the block eventually when we're at e.round, we will advance to the next round via loading this as a future message.
If we replicate a block we either replicate it with a notarization or a finalization and then we should be able to advance the round. Why do we need the finalization to be nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the finalization check here actually, but there are tests that flake if this is uncommented. This is because we can call persistNotarization with notarizations from previous rounds now(either during replication, or via handleNotarizationMessage). We need to only increase the round if its for the current round otherwise we call increaseRound twice. I'll make a simple test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the test, and found a small bug
| if !found { | ||
| // should never happen since we check this when we verify the proposal metadata | ||
| e.Logger.Error("Could not find predecessor block for proposal scheduling", | ||
| e.Logger.Info("Could not find predecessor block for proposal scheduling", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we changing that from error to info? Do the tests fail without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because processNotarizeBlock may actually hit this code path.
Ex. In round 2 we receive an empty round during replication instead of a notarized one, and then process round 3 which relies on the notarized round for 2. We don't have the predecessor block which is why we can end up here.
| } | ||
| } | ||
|
|
||
| seqs := req.Seqs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check that we don't have more than e.maxRoundWindow sequences? Otherwise it's a DoS if we get a million sequences to retrieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will do the same with rounds. we may want to use a smaller value because sending e.maxRoundWindow blocks + finalizations in one message could get big. added
epoch.go
Outdated
| } | ||
| } | ||
|
|
||
| // if this round is storage, we do not need to retrieve it from storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this round is storage, we do not need to retrieve it from storage
What does this mean? I can't parse this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly same, I don't think its relevant anymore. deleted.
epoch.go
Outdated
|
|
||
| if err := e.verifyQuorumRound(data, from); err != nil { | ||
| e.Logger.Debug("Received invalid quorum round", zap.Uint64("seq", data.GetSequence()), zap.Stringer("from", from)) | ||
| // TODO: if empty notarizations occur for long periods, we may receive a nextSeqToCommit that has a round considered too far ahead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may happen just because we're simply really far behind, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep! if we fall behind many empty rounds, the round number would have increased a lot(compared to our round number) but the sequence number may only be a few away.
Ex. We are in round 1, seq 1 and disconnect. we may end up reconnecting on round 100 seq 3. In this case we will block seq 3 from being processed, so this TODO just notes we may want to change this threshold. What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts are that we just may be really far behind and the data.GetSequence() != nextSeqToCommit is simply because we're really behind, we get a notarized block, but it has nothing to do with empty notarizations, so the comment is misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the comment
Summary
This PR finalizes(hopefully) the replication scheme by clearly distinguishing between replicating rounds and replicating sequences. It also refactors related components for clarity and reliability. Overall, the replication logic is much more explicit which will hopefully lead to less bugs(a few small ones I have found while making the pr)
Replication Overview
In Simplex, when a node receives a
notarization,emptyNotarization, orfinalizationfor a round or sequence ahead of its current state, it indicates the node is behind. The process of catching up(retrieving missing rounds or sequences) is called Replication.The
replicationStatestruct manages this process. Replication is triggered when:notarizationoremptyNotarizationfor a higher round is received, orfinalizationfor a future sequence arrives.notarizationandfinalizationto potentially trigger replication for the lagging node.The
replicationStatetracks missing rounds and sequences, resends requests as needed, and removes completed items. When aQuorumRound(notarization, empty notarization, or finalization) is received, it’s added to the relevant state. Finalizations always supersede earlier quorum rounds, allowing us to prune older states from both the rounds and sequence trackers.processReplicationStateadvances rounds or sequences by checking for available quorum rounds. If the next sequence is complete, it’s committed; otherwise, the function checks for the lowest quorum round. Note that the lowest quorum round can be lower than the current round. This is because we may have advanced to the current round through replication through a "dead" chain. I.e. we may have received a notarization/empty notarization that other block builders decided not to build off of, yet we received it as a valid replication response. Empty notarizations can sometimes be inferred to advance rounds viamaybeAdvanceRoundFromEmptyNotarizations.Key Changes
requestorstruct, which fetches quorum rounds up from the network.TimeoutManager:taskMap, theTimeoutHandlerperiodically sends them to theTaskRunnereveryrunInterval.notarizationandemptyNotarizationmessages are now able to be processed for previous, non-finalized rounds.TODOS
Digestsfrom peers. This is important for replication but also the case where we receive a notarization for our current round but are yet to receive the block.