-
Notifications
You must be signed in to change notification settings - Fork 48
Add trigger for periodic topology snapshots #3188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add trigger for periodic topology snapshots #3188
Conversation
Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com> [ci] refactor Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
e7fee87 to
7a1cb7e
Compare
Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
martinflorian-da
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience! It's very nice work, thank you!
I did collect a fair number of comments though... Let's do another iteration once you had a chance to look at them.
| export async function topologySnapshotConfig(prefix: string): Promise<BackupConfig> { | ||
| const bucketSpec = await bootstrapDataBucketSpec( | ||
| config.optionalEnv('TOPOLOGY_SNAPSHOT_PROJECT') || 'da-cn-devnet', | ||
| config.optionalEnv('TOPOLOGY_SNAPSHOT_BUCKET') || 'da-cn-topology-snapshots' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add these to the config schema! We want to avoid adding new env vars whenever it's reasonable...
| config.optionalEnv('TOPOLOGY_SNAPSHOT_BUCKET') || 'da-cn-topology-snapshots' | ||
| ); | ||
|
|
||
| return { backupInterval: '2h', location: { bucket: bucketSpec, prefix: prefix } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also make the backupInterval configurable in the config.yaml schema. Seems cheap enough to do, and if we need it we'll already have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a comment to that config option that it will produce a dump at most once per day, even if you set the interval to <24h
| onboardingPollingInterval?: string; | ||
| cometBftGovernanceKey?: CnInput<SvCometBftGovernanceKey>; | ||
| initialRound?: string; | ||
| periodicTopologySnapshot?: boolean; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the existence of a periodicTopologySnapshotConfig already imply true/false here? Why the extra variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks right to me but I would probably have added a small Helm test for this. (Don't mind it if you think it's not necessary.)
| limits: | ||
| memory: '18Gi' | ||
| svApp: | ||
| periodicTopologySnapshot: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: How about using this here for all defaults that you currently have in pulumi?
And making something like enabled one of the parameters of (what would have to become) something like periodicTopologySnapshotConfig.
Actually I think we can make this a bit more future-proof by having a config path like svApp.periodicSnapshots.topology. This way it's clear how to add ACS later on, for example.
| logEntries, | ||
| )(logEntry => | ||
| logEntry.message should (include(s"Took a new topology snapshot for $utcDate") or | ||
| include("Today's topology snapshot already exists.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm so for identity dumps we seem to be fine with taking multiple per day - why limit ourselves on topology snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I found the explanation for this now on the issue...
Let's add a comment about our reasoning here though (to the trigger code).
| BackupDump.write( | ||
| config.location, | ||
| Paths.get(s"$folderName/transactions-summary"), | ||
| summary.toString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder: Shall we also add the sequencer ID here? Might be informative, and that file is only for human consumption anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also the specific timestamp of the dump? The bucket has a granularity of "days", might be nice to quickly see what was the time more specifically.
| onboardingState <- sequencerAdminConnection.getOnboardingState(sequencerId) | ||
| authorizedStore <- sequencerAdminConnection.exportAuthorizedStoreSnapshot(sequencerId.uid) | ||
| // list a summary of the transactions state at the time of the snapshot to validate further imports | ||
| summary <- sequencerAdminConnection.getSummaryOfTransactions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you have a race here? You want the summary of transactions at exactly the time where the snapshots were taken.
Probably also want to add an appropriate TimeQuery the authorized store snapshot if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy pasting what we did last time on the console:
import com.digitalasset.canton.data.CantonTimestamp
val now = CantonTimestamp.now()
val genesisState = sequencer.topology.transactions.genesis_stateV2(None, Some(now))
sequencer.topology.transactions.list(store = sequencer.synchronizer_id, proposals = false, timeQuery = TimeQuery.Snapshot(now)).result.groupMapReduce(_.mapping.code)(_ => 1)(_ + _)
genesisState.writeTo(new java.io.FileOutputStream("/tmp/state-export/genesis-state"))
| )(implicit traceContext: TraceContext): Future[TaskSuccess] = | ||
| for { | ||
| sequencerId <- sequencerAdminConnection.getSequencerId | ||
| onboardingState <- sequencerAdminConnection.getOnboardingState(sequencerId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh don't we want getGenesisState here?
This mistake makes me think that you might want to test a topology import from the export that you're doing here; one time manually should be more than enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I read about it now on the issue...
So do you understand how to test an import from this? Would we also need keys from the sequencer (probably not?)?
And why do we actually need the authorized store? (I'd avoid dumping something we don't know anyone will use.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and aren't you missing this part of Moritz's design?
stream the snapshot so we never hold the full one in memory.
| BackupDump.write( | ||
| config.location, | ||
| Paths.get(s"$folderName/genesis-state"), | ||
| onboardingState.toStringUtf8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't just dump bytes to make it smaller and more easier to use/import?
fixes #1133
test bucket: https://console.cloud.google.com/storage/browser/da-splice-topology-snapshots;tab=objects?forceOnBucketsSortingFiltering=true&project=da-cn-splice&prefix=&forceOnObjectsSortingFiltering=false
The plan would be to activate it on sv1 in our production clusters (will be available from 0.5.2).
Remaining:
Pull Request Checklist
Cluster Testing
/cluster_teston this PR to request it, and ping someone with access to the DA-internal system to approve it./hdm_teston this PR to request it, and ping someone with access to the DA-internal system to approve it.PR Guidelines
Fixes #n, and mention issues worked on using#nMerge Guidelines