feature-design: data migrations#503

Open

aristidesstaffieri wants to merge 10 commits intomainfrom

feature/data-migrations-design

Contributor

aristidesstaffieri commented Feb 13, 2026

Closes #468

What

Adds a design document for the data migrations feature.

Why

Document the feature for implementation and receive feedback from stake holders.

Known limitations

N/A

Issue that this PR addresses

Checklist

PR Structure

It is not possible to break this PR down into smaller PRs.
This PR does not mix refactoring changes with feature changes.
This PR's title starts with name of package that is most changed in the PR, or all if the changes are broad or impact many packages.

Thoroughness

This PR adds tests for the new functionality or fixes.
All updated queries have been tested (refer to this check if the data set returned by the updated query is expected to be same as the original one).

Release

This is not a breaking change.
This is ready to be tested in development.
The new functionality is gated with a feature flag if this is not ready for production.


          adds design document for the data migrations feature

0821eb4

aristidesstaffieri self-assigned this

Copilot AI review requested due to automatic review settings

February 13, 2026 19:44

Copilot started reviewing on behalf of aristidesstaffieri

February 13, 2026 19:45

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

Adds a feature design document describing a proposed “data migrations” system for protocol classification, backfill processing, and API exposure in the wallet-backend.

Changes:

Introduces a detailed design doc covering proposed schema, workflows (setup/live/backfill), and cursor tracking.
Documents contract classification via WASM inspection and a known_wasms cache approach.
Describes planned API surface changes for history enrichment and current-state gating by migration status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/feature-design/data-migrations.md Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved


          Apply suggestions from code review

91d8ae8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

JakeUrban reviewed

View reviewed changes

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md

+              -- 'failed'               - Migration failed
+              ```
+              **Migration Cursor Tracking** (via `ingest_store` table):

Contributor

JakeUrban Feb 17, 2026

💡 rather than requiring end-ledger for protocol-migrate we could also use ingest_store key/val pairs to store the first ledger live ingestion began applying the newly supported processors.

Contributor Author

aristidesstaffieri Feb 17, 2026

yeah it's true, this was actually part of the previous design and it seemed like the consensus was for a more implicit approach where the operator controls the ranges. I do think we could have live ingestion write this value per protocol and the migration table could clean up the ingest store before exiting.

If we do this, the backfill migration should check the ingest store to ensure that live ingestion set an end-ledger for the protocol(s) before the migration starts, and fail early if not.

Contributor Author

aristidesstaffieri Feb 19, 2026

@aditya1702 what's your input here, I believe you were a fan of the implicit approach but if we use the ingest store pattern we can be more explicit and clean up the row after the migration.

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md

Comment on lines +907 to +908

		│ (holds out-of-order results │
		│ until ready to commit) │

Contributor

JakeUrban Feb 17, 2026

We need to be careful not to get stuck here. Meaning, if the buffer is waiting to receive results N and holds results N+1, we need to be sure N is coming, or we abort the whole process.

Contributor Author

aristidesstaffieri Feb 17, 2026

Yeah this is a good point. The commit stages depend on each other so the system should have some threshold of time before it tries to either re-process a batch or exit entirely. I can add more detail around this.

docs/feature-design/data-migrations.md

Comment on lines +975 to +978

+              type OperationProtocol {
+                protocol: Protocol!
+                contractId: String!
+              }

Contributor

JakeUrban Feb 17, 2026

Why have contractId here?

Contributor Author

aristidesstaffieri Feb 17, 2026

If the client wants to know which OperationProtocol is the root invocation, it can use this field to match the contract ID that was invoked in the operation. This can be useful for displaying titles or a hierarchy for the call stack with richer details.

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

docs/feature-design/data-migrations.md Outdated Show resolved Hide resolved

aristidesstaffieri added 8 commits

February 19, 2026 08:29


          removes display_name from the protocols table

ecba294


          removes the enabled flag on the protocols table, adds a new migration…

8af584a

…_status to differentiate between not started and in progress migrations


          updates diagram for checkpoint population flow to better reflect the …

d4eb19d

…steps in the ContractData branch, removes Balance branch


          updates live ingestion classification diagram to better reflect the d…

72083b9

…istinction between uploads and upgrades/deployments


          removes incorrect details about live ingestions relationship to proto…

fa32632

…col-setup in the "When Checkpoint Classification Runs" section


          Updates the migration design to be aware of the history retention win…

17442d1

…dow, in order to discard state changes outside of retention.

1. Schema changes: enabled field removed, display_name removed, status default is not_started
2. Status values: All updated to new naming scheme (not_started, classification_in_progress, classification_success, backfilling_in_progress, backfilling_success, failed)
3. protocol-setup: Now uses --protocol-id flag (opt-in), updated command examples and workflow
4. Classification section (line 125): Updated to describe ContractCode validation and ContractData lookup
5. Checkpoint population diagram: Removed Balance branch, updated to show WASM hash storage in known_wasms
6. Live ingestion classification diagram: Separated into ContractCode and ContractData paths with RPC fallback
7. Live State Production diagram: Updated classification box to mention ContractCode uploads and ContractData Instance changes
8. Backfill migration: Added retention-aware processing throughout (flow diagram, workflow diagram, parallel processing)
9. Parallel backfill worker pool: Added steps for retention window filtering


          updates live ingestion state production diagram to better reflect the…

a7bf397

… relationship between classification and state production


          removes migration status checks from current state queries, exposes m…

2a03fed

…igration status in the API for protocols

Contributor Author

aristidesstaffieri commented Feb 19, 2026 •

edited

Loading

@JakeUrban @aditya1702 after spending more time thinking about this, I realize there is an under-explored part of this design.

Current state production will be specific to the protocol that produces the state, but it should fall into two categories: additive state changes and non-additive state changes.

The way you track current state from the migration to the live ingestion state production will depend on the type of state produced. For example -

Non Additive State Changes:
These can use a last-write-wins approach because the state change has no relation to the previous state. If we ingest a collectible transfer, we don't need to know who owned it previously in order to correctly represent the current ownership.

Additive State Changes:
This is more complicated. Live ingestion cannot write current state without access to the previous state, which may still be in the process of being collected by the backfill migration. If I see a transfer for a SEP-41 token during live ingestion, I need to know what the balance was before the transfer in order to produce the balance at that ledger.

Possible solutions -

**Option A**: Assume that protocols that produce additive changes will expose an interface to produce the dependent state. This work for SEP-41 but requires an RPC call during live ingestion, at least until the migration completes.

**Option B**: State changes table IS the source If the state changes persisted for history contain enough information to derive current state updates, then:

 1. Live ingestion writes state changes to state_changes table (already doing this)
 2. When migration completes, query state_changes WHERE ledger >= N and apply to current state

**Option C**: Separate pending updates table

 1. Live ingestion writes current state updates to pending_current_state_updates table during migration
 2. When migration completes, apply pending updates in ledger order, then drop/truncate the table

Option B seems like the most complete solution to me, but option A is simpler. It may be hard to assume that all protocols that produce additive state will have an interface to access the dependent state but this seems true for the few that I've considered(SEP-41, SEP-56). I propose we go with option B.

Contributor

JakeUrban commented Feb 19, 2026

@aristidesstaffieri boiling the problem down, is this an accurate description?

Live ingestion may not be able to update current state without knowing the previous state. For example, if live ingestion observes a transfer of 5, it needs to know the previous balances of the sender and receiver in order to know the balances of both after the transfer.

The problem is that live ingestion may not be able to get the previous state until protocol-migrate completes.

That problem statement makes sense to me, but I don't understand the solutions you're proposing.

Option A: I don't think we can assume protocols expose an interface for answering historical state queries like "what was my balance at ledger N".

Option B: I think its safe to assume that our historical state changes will have enough information to derive current state -- we can design our historical state changes schema with that as a requirement. But how does step 2 work exactly?

Option C: I don't think live ingestion can write current state anywhere, because it won't know it, as explained in the problem statement.

Contributor Author

aristidesstaffieri commented Feb 19, 2026 •

edited

Loading

@aristidesstaffieri boiling the problem down, is this an accurate description?

Live ingestion may not be able to update current state without knowing the previous state. For example, if live ingestion observes a transfer of 5, it needs to know the previous balances of the sender and receiver in order to know the balances of both after the transfer.

The problem is that live ingestion may not be able to get the previous state until protocol-migrate completes.

That problem statement makes sense to me, but I don't understand the solutions you're proposing.

Option A: I don't think we can assume protocols expose an interface for answering historical state queries like "what was my balance at ledger N".

Option B: I think its safe to assume that our historical state changes will have enough information to derive current state -- we can design our historical state changes schema with that as a requirement. But how does step 2 work exactly?

Option C: I don't think live ingestion can write current state anywhere, because it won't know it, as explained in the problem statement.

ok after some offline discussion, here is the proposed solution to this problem -

We will remove the --end-ledger flag from the protocol-migrate and the migration process will run until it catches up to the tip.

The live ingestion process will now do the following(per protocol):

If the current state for the protocol has been written up to the last ledger before this one, produce new state changes and current state.
If the current state is not available for the ledger before the current one, only produce state changes.

The live ingestion process will keep an in memory map of protocol <-> current state as a write-through cache in order to remove the need to query the database at times when a migration is not in progress.

Proposed steps to change:
Introduce a per-protocol cursor in ingest_store:

protocol_{ID}_current_state_cursor = last ledger for which current state was written

Both processes use this cursor. The key rule:

Migration: Processes each ledger, writes current state, atomically advances cursor from N-1 to N using compare-and-swap (CAS). If CAS fails, migration is done.
Live ingestion: At ledger N, reads cursor. If cursor >= N-1, produces current state for N and CAS-advances cursor. Otherwise, only produces state changes (history).

The CAS is a conditional update:
UPDATE ingest_store SET value = $new WHERE key = $cursor_name AND value = $expected
This is atomic under PostgreSQL READ COMMITTED. Exactly one process succeeds per ledger. The existing IngestStoreModel.Update() (ingest_store.go:48) is an unconditional upsert and cannot
be used -- a new CompareAndSwap method is needed.

The critical property: migration processes ALL ledgers including the overlap with live ingestion. It doesn't stop at live ingestion's start ledger. Both processes independently process
each ledger near the tip, but only one writes current state (the CAS winner).

Timeline:
T=0s: Cursor=10004. Migration processes 10005, CAS 10004->10005. Success.
T=0.5s: Migration processes 10006, CAS 10005->10006. Success.
T=1s: Migration processes 10007, CAS 10006->10007. Success.
T=5s: Live ingestion processes 10008.
Reads cursor=10007. 10007 >= 10007 (N-1). YES.
Produces current state for 10008, CAS 10007->10008. Success.
T=5.5s: Migration processes 10008, CAS 10007->10008. FAILS.
Migration detects handoff. Sets status=backfilling_success. Exits.

Current state was produced for every single ledger. No gap.

What If Live Ingestion Checks Before Migration Commits?

T=0s: Cursor=10004. Migration starts processing 10005.
T=5s: Live ingestion processes 10008.
Reads cursor=10004. 10004 < 10007. NO current state.
T=5.5s: Migration commits 10005, CAS success. Cursor=10005.
T=6s: Migration commits 10006. Cursor=10006.
T=6.5s: Migration commits 10007. Cursor=10007.
T=7s: Migration commits 10008, CAS 10007->10008. Success.
(Migration wrote current state for 10008 - live ingestion didn't)
T=10s: Live ingestion processes 10009.
Reads cursor=10008. 10008 >= 10008. YES.
Takes over. CAS 10008->10009. Success.
T=10.5s: Migration processes 10009, CAS 10008->10009. FAILS. Exits.

Still no gap. Migration filled in ledger 10008's current state because it processes everything. The "race" just determines which process writes current state for a given ledger, but one
always does.

This was referenced Feb 20, 2026

Track WASMs during checkpoint population #505

Open

Classification step #1: Protocol Setup #506

Open

Classification Step #2: Live Ingest WASM Uploads #507

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet