-
Couldn't load subscription status.
- Fork 35
Open
Description
After writing a checkpoint to the parallel file system, a later job attempts to restart. SCR detects that the checkpoint exists, but it fails when trying to fetch the files.
SCR v3.0.0: rank 0 on frontier00010: NPROCS=24576
SCR v3.0.0: rank 0 on frontier00010: NNODES=3072
SCR v3.0.0: rank 0 on frontier00010: Stopping all async flush operations
SCR v3.0.0: rank 0 on frontier00010: Attempting fetch: cycle=40
SCR v3.0.0 ERROR: rank 0 on frontier00010: Failed to add files to AXL transfer handle 0 @ /gpfs/scr-v3.0.1/scr/src/scr_util_mpi.c:354
SCR v3.0.0: rank 0 on frontier00010: Deleting dataset 1 `cycle=40' from cache
SCR v3.0.0: rank 0 on frontier00010: One or more processes failed to read its files @ /gpfs/scr-v3.0.1/scr/src/scr_fetch.c:471
SCR v3.0.0: rank 0 on frontier00010: scr_fetch_latest: return code 1, 2.088182 secs
SCR v3.0.0 ERROR: rank 0 on frontier00010: Failed to fetch checkpoint set into cache. Restarting from the beginning @ /gpfs/scr-v3.0.1/scr/src/scr.c:2549
Metadata
Metadata
Assignees
Labels
No labels