Conversation
| warn!(?operator_id, "Conflicting signature from operator"); | ||
| match fetch_share_pubkeys_by_validator_index(&database, validator_index).await { | ||
| Ok(pubkeys) => { | ||
| resolve_duplicate_signature( | ||
| &mut signature_share, | ||
| operator_id, | ||
| &signature, | ||
| signing_root, | ||
| &pubkeys, | ||
| ); | ||
| } | ||
| Err(err) => { |
There was a problem hiding this comment.
Performance / DoS concern: Every conflicting signature from the same operator triggers a spawn_blocking + DB query. A malicious operator flooding distinct signatures for the same operator_id will hit this path on every message, causing unbounded DB queries.
Consider caching the share_pubkeys lookup result (e.g., in a local variable that persists across loop iterations), so the DB is only queried once per operator conflict:
| warn!(?operator_id, "Conflicting signature from operator"); | |
| match fetch_share_pubkeys_by_validator_index(&database, validator_index).await { | |
| Ok(pubkeys) => { | |
| resolve_duplicate_signature( | |
| &mut signature_share, | |
| operator_id, | |
| &signature, | |
| signing_root, | |
| &pubkeys, | |
| ); | |
| } | |
| Err(err) => { | |
| if received_different_share { | |
| warn!(?operator_id, "Conflicting signature from operator"); | |
| // TODO: consider caching share_pubkeys across loop iterations to avoid | |
| // repeated DB queries from a malicious operator sending many distinct sigs | |
| match fetch_share_pubkeys_by_validator_index(&database, validator_index).await { | |
| Ok(pubkeys) => { | |
| resolve_duplicate_signature( | |
| &mut signature_share, | |
| operator_id, | |
| &signature, | |
| signing_root, | |
| &pubkeys, | |
| ); | |
| } | |
| Err(err) => { | |
| error!( | |
| ?err, | |
| ?validator_index, | |
| "DB lookup failed for duplicate resolution" | |
| ); | |
| } | |
| } |
There was a problem hiding this comment.
The message validator enforces MAX_MESSAGES_PER_ROUND = 1 per (operator, slot, kind), so at most ~2 messages from the same operator can reach the collector per slot (accounting for a small race window in the validation queue). This means the duplicate resolution path fires at most once per operator per collector instance -> bounded by committee size (4-13), not by attacker volume. A cache would save a handful of DB calls on an already-rare path so not worth the added complexity.
|
Could we please use the current PR template for the PRs? |
It seems we also need to get #870 merged into stable |
| if let Some(threshold) = threshold | ||
| && signature_share.len() as u64 >= threshold | ||
| { | ||
| let signature = match combine_signatures(mem::take(&mut signature_share)) { | ||
| Ok(signature) => Arc::new(signature), | ||
| Err(err) => { | ||
| let Some(validator_pk) = &validator_pubkey else { | ||
| error!("No validator pubkey available for verification"); | ||
| return; | ||
| }; | ||
|
|
||
| match try_combine_and_verify(&signature_share, validator_pk, signing_root) { | ||
| CombineOutcome::Success(signature) => { | ||
| trace!(?signature, "Successfully recovered signature"); | ||
| for notifier in mem::take(&mut notifiers) { | ||
| if notifier.send(Arc::clone(&signature)).is_err() { | ||
| warn!("Callback dropped since signature is no longer relevant"); | ||
| } | ||
| } | ||
| full_signature = Some(signature); | ||
| } | ||
| CombineOutcome::CombineFailed(err) => { | ||
| error!(?err, "Failed to recover signature"); | ||
| return; | ||
| } | ||
| }; | ||
|
|
||
| trace!(?signature, "Successfully recovered signature"); | ||
|
|
||
| for notifier in mem::take(&mut notifiers) { | ||
| if notifier.send(Arc::clone(&signature)).is_err() { | ||
| warn!("Callback dropped - signature is no longer relevant"); | ||
| CombineOutcome::VerificationFailed => { | ||
| metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL); | ||
| warn!("Reconstructed signature failed verification so run fallback"); | ||
| let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await { | ||
| Ok(pubkeys) => pubkeys, | ||
| Err(err) => { | ||
| error!(?err, "Failed to look up share pubkeys"); | ||
| return; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Bug: fallback can leave valid shares stranded when count remains >= threshold
After find_invalid_shares removes bad shares, if signature_share.len() >= threshold still holds (e.g. threshold=3, had 5 shares, 1 bad removed → 4 remain), the code falls through and blocks on rx.recv().await without re-attempting combination. If no further messages arrive, the collector is stuck forever holding enough valid shares to succeed.
This can happen in practice when more than threshold partial signatures arrive before the first combine attempt, and only a minority are invalid.
Fix: wrap the threshold check in a loop so it re-attempts after eviction:
| if let Some(threshold) = threshold | |
| && signature_share.len() as u64 >= threshold | |
| { | |
| let signature = match combine_signatures(mem::take(&mut signature_share)) { | |
| Ok(signature) => Arc::new(signature), | |
| Err(err) => { | |
| let Some(validator_pk) = &validator_pubkey else { | |
| error!("No validator pubkey available for verification"); | |
| return; | |
| }; | |
| match try_combine_and_verify(&signature_share, validator_pk, signing_root) { | |
| CombineOutcome::Success(signature) => { | |
| trace!(?signature, "Successfully recovered signature"); | |
| for notifier in mem::take(&mut notifiers) { | |
| if notifier.send(Arc::clone(&signature)).is_err() { | |
| warn!("Callback dropped since signature is no longer relevant"); | |
| } | |
| } | |
| full_signature = Some(signature); | |
| } | |
| CombineOutcome::CombineFailed(err) => { | |
| error!(?err, "Failed to recover signature"); | |
| return; | |
| } | |
| }; | |
| trace!(?signature, "Successfully recovered signature"); | |
| for notifier in mem::take(&mut notifiers) { | |
| if notifier.send(Arc::clone(&signature)).is_err() { | |
| warn!("Callback dropped - signature is no longer relevant"); | |
| CombineOutcome::VerificationFailed => { | |
| metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL); | |
| warn!("Reconstructed signature failed verification so run fallback"); | |
| let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await { | |
| Ok(pubkeys) => pubkeys, | |
| Err(err) => { | |
| error!(?err, "Failed to look up share pubkeys"); | |
| return; | |
| } | |
| }; | |
| // Re-check threshold after potential eviction of invalid shares. | |
| while let Some(threshold) = threshold { | |
| if (signature_share.len() as u64) < threshold { | |
| break; | |
| } | |
| let Some(validator_pk) = &validator_pubkey else { | |
| error!("No validator pubkey available for verification"); | |
| return; | |
| }; | |
| match try_combine_and_verify(&signature_share, validator_pk, signing_root) { | |
| CombineOutcome::Success(signature) => { | |
| trace!(?signature, "Successfully recovered signature"); | |
| for notifier in mem::take(&mut notifiers) { | |
| if notifier.send(Arc::clone(&signature)).is_err() { | |
| warn!("Callback dropped since signature is no longer relevant"); | |
| } | |
| } | |
| full_signature = Some(signature); | |
| break; | |
| } | |
| CombineOutcome::CombineFailed(err) => { | |
| error!(?err, "Failed to recover signature"); | |
| return; | |
| } | |
| CombineOutcome::VerificationFailed => { | |
| metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL); | |
| warn!("Reconstructed signature failed verification so run fallback"); | |
| let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await { | |
| Ok(pubkeys) => pubkeys, | |
| Err(err) => { | |
| error!(?err, "Failed to look up share pubkeys"); | |
| return; | |
| } | |
| }; | |
| let invalid_operators = | |
| find_invalid_shares(&signature_share, signing_root, &share_pubkeys); | |
| if invalid_operators.is_empty() { | |
| let operators: Vec<_> = signature_share.keys().copied().collect(); | |
| error!( | |
| ?signing_root, | |
| ?operators, | |
| "Verification failed but no individual share was invalid" | |
| ); | |
| return; | |
| } | |
| warn!(?invalid_operators, "Removing invalid shares"); | |
| for op in &invalid_operators { | |
| signature_share.remove(op); | |
| } | |
| // Loop back to re-check threshold with remaining shares | |
| } | |
| } | |
| } |
There was a problem hiding this comment.
Valid point, when network partials arrive before RegisterNotifier sets the threshold, they accumulate unchecked. The first combine then runs with len >> threshold, and after eviction len can still be >= threshold with all valid shares remaining. The if let didn't re-check after eviction, so it fell through to rx.recv() waiting for a message that might never come.
Changed if let to while let so the threshold check re-runs immediately after eviction. Each iteration either succeeds, drops below threshold, or returns on fatal error -> no unnecessary wait.
true true! I can make a pr for this tomorrow |
Problem, Evidence, and Context
ReconstructSignature-> post-reconstruction BLS verificationFallBackAndVerifyEachSignature-> per-share fallback verificationresolveDuplicateSignature-> duplicate share handlingChange Overview
resolveDuplicateSignaturehandling when conflicting partial sigs arrive from the same operatorRisks, Trade-offs, and Mitigations
spawn_blocking. This only fires when reconstruction verification fails, which should be rare in normal operationMAX_MESSAGES_PER_ROUND= 1) bounds this to at most one query per operator per collector instanceValidation
verify_reconstructed_signature,verify_partial_signature,find_invalid_shares,resolve_duplicate_signature, andtry_combine_and_verifyRollback
Additional Info / Next Steps