-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Description
In a recent mission parallel catchup failed with the following message:
2025-12-17T09:12:18.582 GAJSL [default INFO] Performing maintenance
2025-12-17T09:12:18.582 GAJSL [History INFO] Trimming history <= ledger 28109119
2025-12-17T09:12:34.801 GAJSL [Process WARNING] process 7596 exited 1: aws s3 cp --region us-east-1 s3://ssc-history-archive/prd/core-live/core_live_001/results/01/ac/e9/results-01ace9bf.xdr.gz /data/buckets/tmp/catchup-c0109315de5346b9/results/01/ac/e9/results-01ace9bf.xdr.gz.tmp
2025-12-17T09:12:34.801 GAJSL [History WARNING] Could not download file: archive core_live_001 maybe missing file results/01/ac/e9/results-01ace9bf.xdr.gz
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Download & apply checkpoints: num checkpoints left to apply:255 (0% done)
2025-12-17T09:12:34.806 GAJSL [History INFO] Catching up to ledger 28113151: Failed: catchup-seq
2025-12-17T09:12:34.806 GAJSL [History WARNING] Catchup failed
full log attached here (the mission artifacts will be gone soon)
stellar-core-2025-12-17_08-48-07.log
There is no other failed message, and the process fail immediately after the s3 failure, without retry.
The simplest command equivalent to the mission
dotnet run --project src/App/App.fsproj -- mission HistoryPubnetParallelCatchupV2 --destination ./logs --image=docker-registry.services.stellar-ops.com/dev/stellar-core:25.0.1-2925.0d5731bae.noble-vnext-buildtests --old-image=docker-registry.services.stellar-ops.com/dev/stellar-core:24.1.0-2861.5a7035d49.focal-buildtests --netdelay-image=docker-registry.services.stellar-ops.com/dev/sdf-netdelay:latest --postgres-image=docker-registry.services.stellar-ops.com/dev/postgres:9.5.22 --nginx-image=docker-registry.services.stellar-ops.com/dev/nginx:latest --prometheus-exporter-image=docker-registry.services.stellar-ops.com/dev/stellar-core-prometheus-exporter:latest --ingress-internal-domain=stellar-supercluster.kube001-ssc-eks.services.stellar-ops.com --job-monitor-external-host=ssc-job-monitor-eks.services.stellar-ops.com --pubnet-parallel-catchup-starting-ledger=0 --require-node-labels-pc-v2=purpose:catchup --tolerate-node-taints-pc-v2=catchup:NoSchedule --require-node-labels=purpose:largetests --tolerate-node-taints=largetests --asan-options quarantine_size_mb=1:malloc_context_size=5:alloc_dealloc_mismatch=0 --catchup-skip-known-results-for-testing=true --pubnet-parallel-catchup-ledgers-per-job=1280 --service-account-annotations-pc-v2=eks.amazonaws.com/role-arn:arn:aws:iam::746476062914:role/kube001-ssc-eks-supercluster --s3-history-mirror-override-pc-v2=ssc-history-archive/prd/core-live --pubnet-parallel-catchup-num-workers=2
Stellar core should have fail retries with exponential backoff. And when I try to reproduce this, I can observe the retry working, here is the logs I observed in my local run (wth auth error, but retries).
We need to investigate why this behavior happens.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working