Skip to content

fix: restore a post-commit boundary in event processing#885

Closed
diegomrsantos wants to merge 16 commits intosigp:unstablefrom
diegomrsantos:fix/db-publication-boundary-simple
Closed

fix: restore a post-commit boundary in event processing#885
diegomrsantos wants to merge 16 commits intosigp:unstablefrom
diegomrsantos:fix/db-publication-boundary-simple

Conversation

@diegomrsantos
Copy link
Member

Problem, Evidence, and Context

Change Overview

  • Keep RPC log fetching batched, but process and commit logs one event at a time.
  • Add an exact processed-event cursor to metadata so sync can safely resume within a block.
  • Move the durable-write boundary into the database layer so persisted state, processed-event progress, and in-memory publication advance in the right order.
  • Update sync to resume from the finer-grained cursor and skip already committed logs from the same block.
  • Keep slashing protection registration as the one pre-commit safety precondition for validator activation.
  • Intentionally did not change event decoding, contract-facing semantics, or the existing test-only transaction helpers outside the production event/sync path.

Risks, Trade-offs, and Mitigations

  • Main trade-off: more SQLite transactions during sync, in exchange for a much simpler and safer commit/publication model.
  • Main data-plane risk: schema version increases to v4.
  • Main runtime risk: sync resume semantics change from block-level to event-level.
  • Mitigations:
    • log fetching remains batched for RPC efficiency
    • progress is now persisted with exact (block_number, transaction_index, log_index) information
    • same-block resume behavior is covered by targeted tests
    • the production event/sync path now publishes in-memory state only after commit

Validation

  • cargo fmt --all
  • cargo fmt --all --check
  • cargo clippy -p database -p eth --all-targets -- -D warnings
  • cargo test -p database -p eth --features database/test-utils
  • Added coverage for cursor persistence across restart and resuming within the same block without replaying already committed logs.

Rollback

  • Safe code rollback before merge: revert this PR.
  • Operational caveat: this migrates the DB schema to v4. Rolling back to an older binary that does not understand schema v4 is not safe on an already-upgraded DB.
  • If a rollback is needed after deploying a build with this change, use a compatible binary or restore the DB from a pre-migration backup.

Blockers / Dependencies

  • N/A

Additional Info / Next Steps

  • The production path now uses the new boundary, but some legacy transaction-taking database helpers still exist for tests and non-production utility paths. They can be cleaned up separately once this behavior change lands.

@diegomrsantos
Copy link
Member Author

diegomrsantos commented Mar 12, 2026

The earlier review guide is outdated after the history rewrite. The easiest way to review this PR now is to follow the 4-commit stack in order and focus on a small set of functions.

The core model is:

  1. EventProcessor decides what one log means.
  2. NetworkDatabase makes that event durable and persists the exact progress cursor in the same transaction.
  3. Only after commit do we update NetworkState and notify watchers.
  4. Once a whole fetched range succeeds, we collapse partial per-event progress back to a fully processed block.

If any production path violates that ordering, it is suspicious.

1. Database-owned commit boundary

Commit:

  • d4a6e60a9 refactor(database): add post-commit event progress boundary

Read these first:

  • anchor/database/src/lib.rs

    • mark_event_processed: persists only the exact (block, tx, log) cursor for a committed or intentionally skipped event
    • advance_processed_block: persists the coarse “whole block completed” boundary and clears the finer cursor
    • commit_db_update: the core boundary of the PR; applies the SQL change, persists matching progress in the same transaction, commits, and only then publishes the corresponding in-memory state update
    • apply_progress_to_tx / apply_progress_to_state: mirror the same progress model in SQLite and NetworkState
  • anchor/database/src/state.rs

    • get_last_processed_event_from_db: treats the three cursor columns as one logical value and rejects mixed NULL/non-NULL rows as corruption
    • next_block_to_fetch: defines resume semantics by preferring the partial cursor block when present so the same block can be re-fetched and already-committed logs skipped deterministically
  • anchor/database/src/operator_operations.rs

    • commit_operator_added: commits operator insert + max_operator_id_seen + exact cursor together
    • commit_seen_operator_id: commits only “we have seen operator id N” plus the exact cursor for malformed/skipped operator events
    • commit_operator_removed: commits operator removal plus exact cursor
  • anchor/database/src/cluster_operations.rs

    • commit_validator_added: commits nonce bump + validator/cluster/share insert + exact cursor together
    • commit_owner_nonce: commits nonce bump + exact cursor for malformed/skipped ValidatorAdded
    • commit_cluster_status: commits liquidation/reactivation status plus exact cursor
    • commit_validator_removed: commits validator removal plus exact cursor
  • anchor/database/src/validator_operations.rs

    • commit_fee_recipient_updated: commits fee-recipient update plus exact cursor
    • set_validator_indices: commits index updates and then mirrors them into state
    • update_graffiti: same post-commit boundary, but without sync-progress changes

What to check:

  • no shared state mutation before tx.commit()
  • durable event change and durable progress always commit together
  • NetworkState is only updated after commit
  • processed block progress is monotonic and clears any partial cursor
  • production writes go through the DB-owned commit boundary

2. Per-event processing model

Commit:

  • ba5fffa2d refactor(eth): process logs with per-event commits

Read these next:

  • anchor/eth/src/event_processor.rs

    • process_logs: skips already-committed logs before processing the fetched range
    • process_logs_inner: runs the range sequentially and only advances the full block boundary if the whole range succeeded
    • process_single_log: derives the exact cursor for one log and routes it
    • dispatch_log: only chooses the handler
    • finish_processed_log: the key function where cursor ownership becomes explicit:
      • handler already committed cursor
      • handler succeeded and caller must advance cursor
      • handler was skippable and caller advances cursor
      • handler was fatal
    • mark_event_processed: persists cursor-only progress when the caller owns advancement
    • advance_processed_block_if_needed: collapses partial in-block progress back to a full block boundary only when appropriate
    • skip_processed_logs / log_at_or_before_cursor / cursor_for_log: the replay/resume mechanics that make “re-fetch the same block, skip already-committed logs, continue” actually work
  • anchor/eth/src/util.rs

    • validate_operators: validates operator-set structure and treats missing operators as invalid event data rather than internal DB failure

Then read the handlers with the most important semantics:

  • process_operator_added: duplicate operator ids are skippable, malformed/duplicate operator payloads can still persist max_operator_id_seen, and later valid operator ids should not be blocked by a bad earlier one
  • process_validator_added: malformed validator-add events still consume the owner nonce, duplicate validator-add events are skipped without letting the SQL unique constraint abort sync, slashing registration happens before the main DB commit and explicitly relies on idempotence, and fee recipient is no longer read before the write
  • process_validator_removed: missing or mismatched validator state is treated as invalid event data rather than a local DB failure
  • process_validator_exited: intentionally different from the others, because this is mostly a side-effect event, so it returns NeedsCursorAdvance and lets the caller mark progress afterwards

What to check:

  • there is no long-lived outer DB transaction anymore
  • one log is now the meaningful unit of progress
  • cursor ownership is explicit for every handler outcome
  • block progress is only advanced after the whole fetched range succeeds

3. Sync semantics and hardening

Commit:

  • ffad62c65 fix(eth): harden per-event sync semantics

Read:

  • anchor/eth/src/sync.rs

    • historical_sync: fetching is still overlapped, but processing is serialized behind one running_processor; each batch is processed in spawn_blocking(move || process_logs(...)), and after a round it re-reads the committed resume point from DB instead of assuming end_block + 1
    • live_sync: computes start_block from the same next_block_to_fetch model, skips already-synced blocks correctly on reconnect/reorg, and immediately awaits the blocking event-processing work so there is backpressure
  • anchor/eth/src/event_processor.rs

    • advance_processed_block_if_needed: refuses to regress the processed block boundary and only collapses to block progress when the whole range actually succeeded

This commit also contains the benchmark-driven correctness fixes that made the branch actually match the intended semantics:

  • duplicate ValidatorAdded
  • OperatorAdded malformed/duplicate handling preserving max_operator_id_seen
  • re-reading committed sync progress instead of assuming the outer loop state

4. Tests and cleanup

Commit:

  • e5066390f refactor(database): align tests with commit boundary

Read this last.

Highest-signal tests:

  • anchor/database/src/tests/state_tests.rs
    • test_processed_event_cursor_after_restart: proves exact event cursor persists across restart and is cleared once the full block boundary is committed
  • anchor/eth/tests/integration.rs
    • duplicate validator-added test: proves duplicate ValidatorAdded is skipped without re-queueing index sync and while still advancing nonce/progress
    • malformed/duplicate operator-added tests: prove bad operator history no longer blocks later valid operator ids and still preserves max_operator_id_seen
    • multi-event processing test: proves multiple events in one fetched range still produce the expected final block progress

What to check:

  • restart preserves the exact event cursor
  • replay in the same block skips already-committed logs
  • malformed duplicate history no longer blocks later valid events
  • tests now use the same committed API shape as production instead of older tx-only helpers

Shortest useful path

If you only want the highest-signal path through the diff, I would read these in order:

  1. anchor/database/src/lib.rs
  2. anchor/database/src/state.rs
  3. anchor/database/src/operator_operations.rs
  4. anchor/database/src/cluster_operations.rs
  5. anchor/database/src/validator_operations.rs
  6. anchor/eth/src/event_processor.rs
  7. anchor/eth/src/sync.rs
  8. anchor/database/src/tests/state_tests.rs
  9. anchor/eth/tests/integration.rs

What to ignore on first pass

On a first review, I would ignore:

  • most helper/test cleanup details
  • small parsing helper refactors
  • doc-only changes

The main review question for this PR is still:

does every production event now cross one DB-owned boundary where the durable event change and exact sync progress commit together, with all in-memory publication strictly after commit?

};
let fee_recipient = self
.db
.with_state(|state| state.fee_recipient_for_owner(&owner).unwrap_or(owner));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidatorAdded does not need fee_recipient to persist the validator. The write path only needs cluster_id, owner, validator metadata, and shares; fee_recipient lives in the owners table and is joined later when materializing a full Cluster on reads. Reading it here was forcing a read-model concern into the write path and is what made the earlier design more confusing than necessary.

@diegomrsantos diegomrsantos marked this pull request as ready for review March 16, 2026 21:51
@diegomrsantos
Copy link
Member Author

@claude review this PR

@claude-code-actions-sigp

This comment was marked as outdated.

@diegomrsantos
Copy link
Member Author

@claude review this PR again

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Mar 16, 2026

Claude finished @diegomrsantos's task in 3m 1s —— View job


Review: fix: restore a post-commit boundary in event processing (re-review)

  • Read PR diff against origin/unstable
  • Review database core: lib.rs, schema.rs, state.rs, sql_operations.rs
  • Review database operations: cluster_operations.rs, operator_operations.rs, validator_operations.rs
  • Review event processor: event_processor.rs
  • Review sync changes: sync.rs, index_sync.rs, util.rs
  • Review test files
  • Verify commit boundary invariant across all paths
  • Post findings

Changes since last review

Five commits landed since the previous review, directly addressing all three actionable items raised:

  1. e340195 + apply_progress_to_state in lib.rs:398-401, 405-408 — Added debug_assert! in both ProgressUpdate::Event and ProgressUpdate::Block arms, plus a debug_assert! in next_block_to_fetch (state.rs:419-422). This addresses the cursor ordering invariant concern.

  2. a8e6d5f event_processor.rs:263-273advance_processed_block_if_needed now reads current_block and explicitly returns early if end_block < current_block. This addresses the block regression concern.

  3. 79ea6cd slashing.rs:26-30 + event_processor.rs:604-608 — Documented the idempotence requirement for register_validator in both the trait contract and the call site.

  4. 853a116 sync.rs:505-511 — Historical sync outer loop now re-reads committed progress via next_block_to_fetch() instead of assuming end_block + 1.

  5. 56dd8ab — Fixed test helper assertions to match the new API.

New test coverage: test_older_end_block_does_not_regress_progress proves the regression guard works.

Invariant verification

Re-verified the core invariant — SQL + cursor commit together inside a transaction, in-memory watch<NetworkState> updates only after commit, side effects after that — across all production paths.

commit_db_update (lib.rs:342-368) enforces the ordering:

apply_tx(&tx)  →  apply_progress_to_tx(progress, &tx)  →  tx.commit()  →  apply_state + apply_progress_to_state

Every commit_* wrapper flows through this single boundary. No production path mutates watched state before tx.commit().

Handler Commits via Cursor ownership In-memory after commit
process_operator_added commit_operator_added / commit_seen_operator_id CursorCommitted Yes
process_operator_removed commit_operator_removed CursorCommitted Yes
process_validator_added commit_validator_added / commit_owner_nonce CursorCommitted Yes
process_validator_removed commit_validator_removed CursorCommitted Yes
process_cluster_liquidated commit_cluster_status CursorCommitted Yes
process_cluster_reactivated commit_cluster_status CursorCommitted Yes
process_fee_recipient_updated commit_fee_recipient_updated CursorCommitted Yes
process_validator_exited caller via NeedsCursorAdvance mark_event_processed Correct (side-effect only)
unknown/missing topic caller via mark_event_processed explicit Correct

Issues

No new correctness issues found. The previous review's three actionable items have all been addressed with appropriate fixes and test coverage.

One minor observation remains:

apply_delete_validator_state uses expect on committed state

anchor/database/src/cluster_operations.rs:198-202

let metadata = state
    .multi_state
    .validator_metadata
    .remove(validator_pubkey)
    .expect("Data should have existed");

This runs inside the apply_state closure of commit_db_update, after tx.commit() has already succeeded. If the in-memory state diverges from the DB (a bug elsewhere), this expect will panic rather than returning an error. In practice this is fine — divergence between committed DB state and in-memory state would indicate a serious bug that warrants a crash. This is not new to this PR and is consistent with the existing pattern, so not actionable here.

Observations

  • The notify parameter usage is correct throughout: state-changing events (commit_operator_added, commit_validator_added, etc.) use true, while bookkeeping operations (commit_owner_nonce, commit_seen_operator_id, mark_event_processed, advance_processed_block) use false.

  • The SkippableCommitted error variant correctly prevents double cursor advancement in finish_processed_log — when a handler commits partial progress (nonce bump or max-seen-operator-id) before returning a skip error, the caller does not call mark_event_processed again.

  • skip_processed_logs uses lexicographic (block_number, transaction_index, log_index) comparison with <=, which correctly skips the cursor event itself on replay.

  • The historical_sync outer loop now re-reads committed progress (sync.rs:508-511) instead of start_block = end_block + 1, keeping it aligned with the DB-owned progress model.

  • Schema v4 migration (schema.rs:32-36) adds three nullable INTEGER columns with CHECK (>= 0) constraints. UPDATE_BLOCK_NUMBER (sql_operations.rs:111-117) correctly NULLs cursor columns when collapsing to a block boundary.

  • Test coverage is solid: test_processed_event_cursor_after_restart, test_resume_skips_already_processed_logs_in_same_block, and the new test_older_end_block_does_not_regress_progress together prove the three key properties of the cursor system.

Summary

The previous review's actionable items have been addressed. The commit boundary invariant holds across all production paths. The defensive guards (debug_assert! for cursor ordering, explicit regression check in advance_processed_block_if_needed) and documentation (slashing idempotence contract, next_block_to_fetch invariant) strengthen the model. No new issues found.


@diegomrsantos diegomrsantos force-pushed the fix/db-publication-boundary-simple branch from 79ea6cd to 53c6cda Compare March 17, 2026 14:04
@diegomrsantos diegomrsantos force-pushed the fix/db-publication-boundary-simple branch from 53c6cda to e506639 Compare March 17, 2026 14:52
let operators_valid = self
.db
.with_state(|state| validate_operators(&operator_ids, &cluster_id, state));
if let Err(err) = operators_valid {
Copy link
Member Author

@diegomrsantos diegomrsantos Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dknopik @shane-moore I reverted validate_operators back to ExecutionError::Database, but this call site is still where the ambiguity matters. If required operators are missing from committed state, we know the validator event is not safely applicable, but we do not know whether that comes from malformed history or local state inconsistency. This PR is not meant to redefine that boundary, so I kept the old classification and wanted to call out the ambiguity explicitly.

"Failed to fetch validator metadata from database"
);
return Err(ExecutionError::Database(
return Err(EventActionError::Fatal(ExecutionError::Database(
Copy link
Member Author

@diegomrsantos diegomrsantos Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dknopik @shane-moore I reverted these missing-metadata and missing-cluster branches back to DB-style failures. From this code we cannot prove the ValidatorRemoved event is invalid; all we know is that committed local state is missing the validator state the event expects. That could still be malformed history, but it could also be local inconsistency, so treating it as skippable InvalidEvent felt out of scope for this PR.

@diegomrsantos diegomrsantos marked this pull request as draft March 18, 2026 07:58
@diegomrsantos
Copy link
Member Author

Closing in favor of the smaller replacement stack discussed on #880:
#880 (comment)

The problem is valid, but this PR is too large for the boundary fix and it introduces event-level resume machinery that we do not want to take forward. The replacement path is a smaller series from unstable: database prep first, then the block-scoped transaction fix, then cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant