Skip to content

feat: partial sig with fallback#890

Open
shane-moore wants to merge 14 commits intosigp:unstablefrom
shane-moore:feat/partial-sig-with-fallback
Open

feat: partial sig with fallback#890
shane-moore wants to merge 14 commits intosigp:unstablefrom
shane-moore:feat/partial-sig-with-fallback

Conversation

@shane-moore
Copy link
Member

@shane-moore shane-moore commented Mar 17, 2026

Problem, Evidence, and Context

Change Overview

  • After Lagrange reconstruction, verify the combined signature against the validator's master BLS pubkey
  • On verification failure, fall back to per-share verification using operator share pubkeys from DB, evict invalid shares, and continue waiting for replacement shares to restore quorum
  • Add resolveDuplicateSignature handling when conflicting partial sigs arrive from the same operator
  • Add metrics for signature verification failures and operator evictions
  • ~300 lines of production code, ~700 lines of tests

Risks, Trade-offs, and Mitigations

  • The fallback path queries the DB via spawn_blocking. This only fires when reconstruction verification fails, which should be rare in normal operation
  • Duplicate resolution also hits the DB, but the message validator's per-operator rate limit (MAX_MESSAGES_PER_ROUND = 1) bounds this to at most one query per operator per collector instance
  • No changes to the happy path -> reconstruction success follows the same code path as before

Validation

  • 8 unit tests covering verify_reconstructed_signature, verify_partial_signature, find_invalid_shares, resolve_duplicate_signature, and try_combine_and_verify
  • 4 integration tests: happy path, fallback eviction + recovery, duplicate resolution, and evicted operator rejection

Rollback

  • Clean revert, no data/schema changes. Reverting removes verification and returns to the prior behavior of trusting all partial signatures.

Additional Info / Next Steps

  • Tests can be split to a separate PR if preferred; kept here because they validate the core logic.

@shane-moore shane-moore changed the title Feat/partial sig with fallback feat: partial sig with fallback Mar 17, 2026
Comment on lines +593 to +604
warn!(?operator_id, "Conflicting signature from operator");
match fetch_share_pubkeys_by_validator_index(&database, validator_index).await {
Ok(pubkeys) => {
resolve_duplicate_signature(
&mut signature_share,
operator_id,
&signature,
signing_root,
&pubkeys,
);
}
Err(err) => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance / DoS concern: Every conflicting signature from the same operator triggers a spawn_blocking + DB query. A malicious operator flooding distinct signatures for the same operator_id will hit this path on every message, causing unbounded DB queries.

Consider caching the share_pubkeys lookup result (e.g., in a local variable that persists across loop iterations), so the DB is only queried once per operator conflict:

Suggested change
warn!(?operator_id, "Conflicting signature from operator");
match fetch_share_pubkeys_by_validator_index(&database, validator_index).await {
Ok(pubkeys) => {
resolve_duplicate_signature(
&mut signature_share,
operator_id,
&signature,
signing_root,
&pubkeys,
);
}
Err(err) => {
if received_different_share {
warn!(?operator_id, "Conflicting signature from operator");
// TODO: consider caching share_pubkeys across loop iterations to avoid
// repeated DB queries from a malicious operator sending many distinct sigs
match fetch_share_pubkeys_by_validator_index(&database, validator_index).await {
Ok(pubkeys) => {
resolve_duplicate_signature(
&mut signature_share,
operator_id,
&signature,
signing_root,
&pubkeys,
);
}
Err(err) => {
error!(
?err,
?validator_index,
"DB lookup failed for duplicate resolution"
);
}
}

Copy link
Member Author

@shane-moore shane-moore Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message validator enforces MAX_MESSAGES_PER_ROUND = 1 per (operator, slot, kind), so at most ~2 messages from the same operator can reach the collector per slot (accounting for a small race window in the validation queue). This means the duplicate resolution path fires at most once per operator per collector instance -> bounded by committee size (4-13), not by attacker volume. A cache would save a handful of DB calls on an already-rare path so not worth the added complexity.

@diegomrsantos
Copy link
Member

Could we please use the current PR template for the PRs?

@shane-moore shane-moore marked this pull request as draft March 17, 2026 00:44
@shane-moore shane-moore marked this pull request as ready for review March 17, 2026 00:44
@diegomrsantos
Copy link
Member

Could we please use the current PR template for the PRs?

It seems we also need to get #870 merged into stable

Comment on lines +616 to +647
if let Some(threshold) = threshold
&& signature_share.len() as u64 >= threshold
{
let signature = match combine_signatures(mem::take(&mut signature_share)) {
Ok(signature) => Arc::new(signature),
Err(err) => {
let Some(validator_pk) = &validator_pubkey else {
error!("No validator pubkey available for verification");
return;
};

match try_combine_and_verify(&signature_share, validator_pk, signing_root) {
CombineOutcome::Success(signature) => {
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped since signature is no longer relevant");
}
}
full_signature = Some(signature);
}
CombineOutcome::CombineFailed(err) => {
error!(?err, "Failed to recover signature");
return;
}
};

trace!(?signature, "Successfully recovered signature");

for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped - signature is no longer relevant");
CombineOutcome::VerificationFailed => {
metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL);
warn!("Reconstructed signature failed verification so run fallback");
let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await {
Ok(pubkeys) => pubkeys,
Err(err) => {
error!(?err, "Failed to look up share pubkeys");
return;
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: fallback can leave valid shares stranded when count remains >= threshold

After find_invalid_shares removes bad shares, if signature_share.len() >= threshold still holds (e.g. threshold=3, had 5 shares, 1 bad removed → 4 remain), the code falls through and blocks on rx.recv().await without re-attempting combination. If no further messages arrive, the collector is stuck forever holding enough valid shares to succeed.

This can happen in practice when more than threshold partial signatures arrive before the first combine attempt, and only a minority are invalid.

Fix: wrap the threshold check in a loop so it re-attempts after eviction:

Suggested change
if let Some(threshold) = threshold
&& signature_share.len() as u64 >= threshold
{
let signature = match combine_signatures(mem::take(&mut signature_share)) {
Ok(signature) => Arc::new(signature),
Err(err) => {
let Some(validator_pk) = &validator_pubkey else {
error!("No validator pubkey available for verification");
return;
};
match try_combine_and_verify(&signature_share, validator_pk, signing_root) {
CombineOutcome::Success(signature) => {
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped since signature is no longer relevant");
}
}
full_signature = Some(signature);
}
CombineOutcome::CombineFailed(err) => {
error!(?err, "Failed to recover signature");
return;
}
};
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped - signature is no longer relevant");
CombineOutcome::VerificationFailed => {
metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL);
warn!("Reconstructed signature failed verification so run fallback");
let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await {
Ok(pubkeys) => pubkeys,
Err(err) => {
error!(?err, "Failed to look up share pubkeys");
return;
}
};
// Re-check threshold after potential eviction of invalid shares.
while let Some(threshold) = threshold {
if (signature_share.len() as u64) < threshold {
break;
}
let Some(validator_pk) = &validator_pubkey else {
error!("No validator pubkey available for verification");
return;
};
match try_combine_and_verify(&signature_share, validator_pk, signing_root) {
CombineOutcome::Success(signature) => {
trace!(?signature, "Successfully recovered signature");
for notifier in mem::take(&mut notifiers) {
if notifier.send(Arc::clone(&signature)).is_err() {
warn!("Callback dropped since signature is no longer relevant");
}
}
full_signature = Some(signature);
break;
}
CombineOutcome::CombineFailed(err) => {
error!(?err, "Failed to recover signature");
return;
}
CombineOutcome::VerificationFailed => {
metrics::inc_counter(&metrics::SIGNATURE_VERIFICATION_FAILURES_TOTAL);
warn!("Reconstructed signature failed verification so run fallback");
let share_pubkeys = match fetch_share_pubkeys(&database, validator_pk).await {
Ok(pubkeys) => pubkeys,
Err(err) => {
error!(?err, "Failed to look up share pubkeys");
return;
}
};
let invalid_operators =
find_invalid_shares(&signature_share, signing_root, &share_pubkeys);
if invalid_operators.is_empty() {
let operators: Vec<_> = signature_share.keys().copied().collect();
error!(
?signing_root,
?operators,
"Verification failed but no individual share was invalid"
);
return;
}
warn!(?invalid_operators, "Removing invalid shares");
for op in &invalid_operators {
signature_share.remove(op);
}
// Loop back to re-check threshold with remaining shares
}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, when network partials arrive before RegisterNotifier sets the threshold, they accumulate unchecked. The first combine then runs with len >> threshold, and after eviction len can still be >= threshold with all valid shares remaining. The if let didn't re-check after eviction, so it fell through to rx.recv() waiting for a message that might never come.

Changed if let to while let so the threshold check re-runs immediately after eviction. Each iteration either succeeds, drops below threshold, or returns on fatal error -> no unnecessary wait.

@shane-moore
Copy link
Member Author

Could we please use the current PR template for the PRs?

It seems we also need to get #870 merged into stable

true true! I can make a pr for this tomorrow

@shane-moore shane-moore added the claude-recheck triggers claude review workflow to re-run label Mar 17, 2026
@github-actions github-actions bot removed the claude-recheck triggers claude review workflow to re-run label Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants