Skip to content

Comments

feature-design: data migrations#503

Open
aristidesstaffieri wants to merge 10 commits intomainfrom
feature/data-migrations-design
Open

feature-design: data migrations#503
aristidesstaffieri wants to merge 10 commits intomainfrom
feature/data-migrations-design

Conversation

@aristidesstaffieri
Copy link
Contributor

Closes #468

What

Adds a design document for the data migrations feature.

Why

Document the feature for implementation and receive feedback from stake holders.

Known limitations

N/A

Issue that this PR addresses

#468

Checklist

PR Structure

  • It is not possible to break this PR down into smaller PRs.
  • This PR does not mix refactoring changes with feature changes.
  • This PR's title starts with name of package that is most changed in the PR, or all if the changes are broad or impact many packages.

Thoroughness

  • This PR adds tests for the new functionality or fixes.
  • All updated queries have been tested (refer to this check if the data set returned by the updated query is expected to be same as the original one).

Release

  • This is not a breaking change.
  • This is ready to be tested in development.
  • The new functionality is gated with a feature flag if this is not ready for production.

@aristidesstaffieri aristidesstaffieri self-assigned this Feb 13, 2026
Copilot AI review requested due to automatic review settings February 13, 2026 19:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a feature design document describing a proposed “data migrations” system for protocol classification, backfill processing, and API exposure in the wallet-backend.

Changes:

  • Introduces a detailed design doc covering proposed schema, workflows (setup/live/backfill), and cursor tracking.
  • Documents contract classification via WASM inspection and a known_wasms cache approach.
  • Describes planned API surface changes for history enrichment and current-state gating by migration status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
-- 'failed' - Migration failed
```

**Migration Cursor Tracking** (via `ingest_store` table):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 rather than requiring end-ledger for protocol-migrate we could also use ingest_store key/val pairs to store the first ledger live ingestion began applying the newly supported processors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's true, this was actually part of the previous design and it seemed like the consensus was for a more implicit approach where the operator controls the ranges. I do think we could have live ingestion write this value per protocol and the migration table could clean up the ingest store before exiting.

If we do this, the backfill migration should check the ingest store to ensure that live ingestion set an end-ledger for the protocol(s) before the migration starts, and fail early if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditya1702 what's your input here, I believe you were a fan of the implicit approach but if we use the ingest store pattern we can be more explicit and clean up the row after the migration.

Comment on lines +907 to +908
│ (holds out-of-order results │
│ until ready to commit) │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful not to get stuck here. Meaning, if the buffer is waiting to receive results N and holds results N+1, we need to be sure N is coming, or we abort the whole process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point. The commit stages depend on each other so the system should have some threshold of time before it tries to either re-process a batch or exit entirely. I can add more detail around this.

Comment on lines +975 to +978
type OperationProtocol {
protocol: Protocol!
contractId: String!
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have contractId here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the client wants to know which OperationProtocol is the root invocation, it can use this field to match the contract ID that was invoked in the operation. This can be useful for displaying titles or a hierarchy for the call stack with richer details.

…_status to differentiate between not started and in progress migrations
…steps in the ContractData branch, removes Balance branch
…istinction between uploads and upgrades/deployments
…col-setup in the "When Checkpoint Classification Runs" section
…dow, in order to discard state changes outside of retention.

1. Schema changes: enabled field removed, display_name removed, status default is not_started
2. Status values: All updated to new naming scheme (not_started, classification_in_progress, classification_success, backfilling_in_progress, backfilling_success, failed)
3. protocol-setup: Now uses --protocol-id flag (opt-in), updated command examples and workflow
4. Classification section (line 125): Updated to describe ContractCode validation and ContractData lookup
5. Checkpoint population diagram: Removed Balance branch, updated to show WASM hash storage in known_wasms
6. Live ingestion classification diagram: Separated into ContractCode and ContractData paths with RPC fallback
7. Live State Production diagram: Updated classification box to mention ContractCode uploads and ContractData Instance changes
8. Backfill migration: Added retention-aware processing throughout (flow diagram, workflow diagram, parallel processing)
9. Parallel backfill worker pool: Added steps for retention window filtering
… relationship between classification and state production
@aristidesstaffieri
Copy link
Contributor Author

aristidesstaffieri commented Feb 19, 2026

@JakeUrban @aditya1702 after spending more time thinking about this, I realize there is an under-explored part of this design.

Current state production will be specific to the protocol that produces the state, but it should fall into two categories: additive state changes and non-additive state changes.

The way you track current state from the migration to the live ingestion state production will depend on the type of state produced. For example -

Non Additive State Changes:
These can use a last-write-wins approach because the state change has no relation to the previous state. If we ingest a collectible transfer, we don't need to know who owned it previously in order to correctly represent the current ownership.

Additive State Changes:
This is more complicated. Live ingestion cannot write current state without access to the previous state, which may still be in the process of being collected by the backfill migration. If I see a transfer for a SEP-41 token during live ingestion, I need to know what the balance was before the transfer in order to produce the balance at that ledger.

Possible solutions -

**Option A**: Assume that protocols that produce additive changes will expose an interface to produce the dependent state. This work for SEP-41 but requires an RPC call during live ingestion, at least until the migration completes.

**Option B**: State changes table IS the source If the state changes persisted for history contain enough information to derive current state updates, then:

 1. Live ingestion writes state changes to state_changes table (already doing this)
 2. When migration completes, query state_changes WHERE ledger >= N and apply to current state

**Option C**: Separate pending updates table

 1. Live ingestion writes current state updates to pending_current_state_updates table during migration
 2. When migration completes, apply pending updates in ledger order, then drop/truncate the table

Option B seems like the most complete solution to me, but option A is simpler. It may be hard to assume that all protocols that produce additive state will have an interface to access the dependent state but this seems true for the few that I've considered(SEP-41, SEP-56). I propose we go with option B.

@JakeUrban
Copy link
Contributor

@aristidesstaffieri boiling the problem down, is this an accurate description?

Live ingestion may not be able to update current state without knowing the previous state. For example, if live ingestion observes a transfer of 5, it needs to know the previous balances of the sender and receiver in order to know the balances of both after the transfer.

The problem is that live ingestion may not be able to get the previous state until protocol-migrate completes.

That problem statement makes sense to me, but I don't understand the solutions you're proposing.

Option A: I don't think we can assume protocols expose an interface for answering historical state queries like "what was my balance at ledger N".

Option B: I think its safe to assume that our historical state changes will have enough information to derive current state -- we can design our historical state changes schema with that as a requirement. But how does step 2 work exactly?

Option C: I don't think live ingestion can write current state anywhere, because it won't know it, as explained in the problem statement.

@aristidesstaffieri
Copy link
Contributor Author

aristidesstaffieri commented Feb 19, 2026

@aristidesstaffieri boiling the problem down, is this an accurate description?

Live ingestion may not be able to update current state without knowing the previous state. For example, if live ingestion observes a transfer of 5, it needs to know the previous balances of the sender and receiver in order to know the balances of both after the transfer.

The problem is that live ingestion may not be able to get the previous state until protocol-migrate completes.

That problem statement makes sense to me, but I don't understand the solutions you're proposing.

Option A: I don't think we can assume protocols expose an interface for answering historical state queries like "what was my balance at ledger N".

Option B: I think its safe to assume that our historical state changes will have enough information to derive current state -- we can design our historical state changes schema with that as a requirement. But how does step 2 work exactly?

Option C: I don't think live ingestion can write current state anywhere, because it won't know it, as explained in the problem statement.

ok after some offline discussion, here is the proposed solution to this problem -

We will remove the --end-ledger flag from the protocol-migrate and the migration process will run until it catches up to the tip.

The live ingestion process will now do the following(per protocol):

If the current state for the protocol has been written up to the last ledger before this one, produce new state changes and current state.
If the current state is not available for the ledger before the current one, only produce state changes.

The live ingestion process will keep an in memory map of protocol <-> current state as a write-through cache in order to remove the need to query the database at times when a migration is not in progress.

Proposed steps to change:
Introduce a per-protocol cursor in ingest_store:

protocol_{ID}_current_state_cursor = last ledger for which current state was written

Both processes use this cursor. The key rule:

  • Migration: Processes each ledger, writes current state, atomically advances cursor from N-1 to N using compare-and-swap (CAS). If CAS fails, migration is done.
  • Live ingestion: At ledger N, reads cursor. If cursor >= N-1, produces current state for N and CAS-advances cursor. Otherwise, only produces state changes (history).

The CAS is a conditional update:
UPDATE ingest_store SET value = $new WHERE key = $cursor_name AND value = $expected
This is atomic under PostgreSQL READ COMMITTED. Exactly one process succeeds per ledger. The existing IngestStoreModel.Update() (ingest_store.go:48) is an unconditional upsert and cannot
be used -- a new CompareAndSwap method is needed.

The critical property: migration processes ALL ledgers including the overlap with live ingestion. It doesn't stop at live ingestion's start ledger. Both processes independently process
each ledger near the tip, but only one writes current state (the CAS winner).

Timeline:
T=0s: Cursor=10004. Migration processes 10005, CAS 10004->10005. Success.
T=0.5s: Migration processes 10006, CAS 10005->10006. Success.
T=1s: Migration processes 10007, CAS 10006->10007. Success.
T=5s: Live ingestion processes 10008.
Reads cursor=10007. 10007 >= 10007 (N-1). YES.
Produces current state for 10008, CAS 10007->10008. Success.
T=5.5s: Migration processes 10008, CAS 10007->10008. FAILS.
Migration detects handoff. Sets status=backfilling_success. Exits.

Current state was produced for every single ledger. No gap.

What If Live Ingestion Checks Before Migration Commits?

T=0s: Cursor=10004. Migration starts processing 10005.
T=5s: Live ingestion processes 10008.
Reads cursor=10004. 10004 < 10007. NO current state.
T=5.5s: Migration commits 10005, CAS success. Cursor=10005.
T=6s: Migration commits 10006. Cursor=10006.
T=6.5s: Migration commits 10007. Cursor=10007.
T=7s: Migration commits 10008, CAS 10007->10008. Success.
(Migration wrote current state for 10008 - live ingestion didn't)
T=10s: Live ingestion processes 10009.
Reads cursor=10008. 10008 >= 10008. YES.
Takes over. CAS 10008->10009. Success.
T=10.5s: Migration processes 10009, CAS 10008->10009. FAILS. Exits.

Still no gap. Migration filled in ledger 10008's current state because it processes everything. The "race" just determines which process writes current state for a given ledger, but one
always does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SPIKE: Data Migrations strategy

2 participants