Skip to content

Conversation

@julientinguely-da
Copy link
Contributor

@julientinguely-da julientinguely-da commented Nov 12, 2025

fixes #1133

test bucket: https://console.cloud.google.com/storage/browser/da-splice-topology-snapshots;tab=objects?forceOnBucketsSortingFiltering=true&project=da-cn-splice&prefix=&forceOnObjectsSortingFiltering=false

The plan would be to activate it on sv1 in our production clusters (will be available from 0.5.2).

Remaining:

  • refactor code
  • add test
  • pulumi changes

Pull Request Checklist

Cluster Testing

  • If a cluster test is required, comment /cluster_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If a hard-migration test is required (from the latest release), comment /hdm_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.

PR Guidelines

  • Include any change that might be observable by our partners or affect their deployment in the release notes.
  • Specify fixed issues with Fixes #n, and mention issues worked on using #n
  • Include a screenshot for frontend-related PRs - see README or use your favorite screenshot tool

Merge Guidelines

  • Make the git commit message look sensible when squash-merging on GitHub (most likely: just copy your PR description).

Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>

[ci] refactor

Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
@julientinguely-da julientinguely-da force-pushed the julien/1133-periodic-snapshots-prod-cluster branch from e7fee87 to 7a1cb7e Compare November 14, 2025 14:31
Signed-off-by: Julien Tinguely <julien.tinguely@digitalasset.com>
@julientinguely-da julientinguely-da marked this pull request as ready for review November 14, 2025 17:06
Copy link
Contributor

@martinflorian-da martinflorian-da left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience! It's very nice work, thank you!

I did collect a fair number of comments though... Let's do another iteration once you had a chance to look at them.

export async function topologySnapshotConfig(prefix: string): Promise<BackupConfig> {
const bucketSpec = await bootstrapDataBucketSpec(
config.optionalEnv('TOPOLOGY_SNAPSHOT_PROJECT') || 'da-cn-devnet',
config.optionalEnv('TOPOLOGY_SNAPSHOT_BUCKET') || 'da-cn-topology-snapshots'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add these to the config schema! We want to avoid adding new env vars whenever it's reasonable...

config.optionalEnv('TOPOLOGY_SNAPSHOT_BUCKET') || 'da-cn-topology-snapshots'
);

return { backupInterval: '2h', location: { bucket: bucketSpec, prefix: prefix } };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also make the backupInterval configurable in the config.yaml schema. Seems cheap enough to do, and if we need it we'll already have it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a comment to that config option that it will produce a dump at most once per day, even if you set the interval to <24h

onboardingPollingInterval?: string;
cometBftGovernanceKey?: CnInput<SvCometBftGovernanceKey>;
initialRound?: string;
periodicTopologySnapshot?: boolean;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the existence of a periodicTopologySnapshotConfig already imply true/false here? Why the extra variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks right to me but I would probably have added a small Helm test for this. (Don't mind it if you think it's not necessary.)

limits:
memory: '18Gi'
svApp:
periodicTopologySnapshot: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: How about using this here for all defaults that you currently have in pulumi?

And making something like enabled one of the parameters of (what would have to become) something like periodicTopologySnapshotConfig.

Actually I think we can make this a bit more future-proof by having a config path like svApp.periodicSnapshots.topology. This way it's clear how to add ACS later on, for example.

logEntries,
)(logEntry =>
logEntry.message should (include(s"Took a new topology snapshot for $utcDate") or
include("Today's topology snapshot already exists."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm so for identity dumps we seem to be fine with taking multiple per day - why limit ourselves on topology snapshots?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I found the explanation for this now on the issue...

Let's add a comment about our reasoning here though (to the trigger code).

BackupDump.write(
config.location,
Paths.get(s"$folderName/transactions-summary"),
summary.toString,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: Shall we also add the sequencer ID here? Might be informative, and that file is only for human consumption anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also the specific timestamp of the dump? The bucket has a granularity of "days", might be nice to quickly see what was the time more specifically.

onboardingState <- sequencerAdminConnection.getOnboardingState(sequencerId)
authorizedStore <- sequencerAdminConnection.exportAuthorizedStoreSnapshot(sequencerId.uid)
// list a summary of the transactions state at the time of the snapshot to validate further imports
summary <- sequencerAdminConnection.getSummaryOfTransactions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have a race here? You want the summary of transactions at exactly the time where the snapshots were taken.

Probably also want to add an appropriate TimeQuery the authorized store snapshot if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasting what we did last time on the console:

import com.digitalasset.canton.data.CantonTimestamp
val now = CantonTimestamp.now()
val genesisState = sequencer.topology.transactions.genesis_stateV2(None, Some(now))
sequencer.topology.transactions.list(store = sequencer.synchronizer_id, proposals = false, timeQuery = TimeQuery.Snapshot(now)).result.groupMapReduce(_.mapping.code)(_ => 1)(_ + _)
genesisState.writeTo(new java.io.FileOutputStream("/tmp/state-export/genesis-state"))

)(implicit traceContext: TraceContext): Future[TaskSuccess] =
for {
sequencerId <- sequencerAdminConnection.getSequencerId
onboardingState <- sequencerAdminConnection.getOnboardingState(sequencerId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh don't we want getGenesisState here?

This mistake makes me think that you might want to test a topology import from the export that you're doing here; one time manually should be more than enough.

Copy link
Contributor

@martinflorian-da martinflorian-da Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I read about it now on the issue...

So do you understand how to test an import from this? Would we also need keys from the sequencer (probably not?)?

And why do we actually need the authorized store? (I'd avoid dumping something we don't know anyone will use.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and aren't you missing this part of Moritz's design?

stream the snapshot so we never hold the full one in memory.

BackupDump.write(
config.location,
Paths.get(s"$folderName/genesis-state"),
onboardingState.toStringUtf8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just dump bytes to make it smaller and more easier to use/import?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Take a periodic topology snapshot from Dev/Test/Main for Canton tests to consume

3 participants