This is a local issue related to the reproducible sums PR. The purpose is to plan and take notes on what is likely to be a pretty complicated debugging task.
Here's my comment on the issue:
HI Bill, quick update. On Tuesday night I started an Intel "ERP" test - exact restart with changing processors in CESM. I checked today and it failed with a seg fault after the restart with the stack trace below. This test runs once for 3 years with 256 tasks and then restarts for 1 year with 128 tasks.
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000001A82C30 cism_reprosum_mod 856 cism_reprosum_mod.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000E8CC55 cism_parallel_mp_ 9600 parallel_mpi.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000DD323D cism_parallel_mp_ 6222 parallel_mpi.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000A4D5F6 glide_diagnostics 385 glide_diagnostics.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000A47E73 glide_diagnostics 82 glide_diagnostics.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000001A40E17 glad_initialise_m 277 glad_initialise.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 00000000009EAE50 glad_main_mp_glad 275 glad_main.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 00000000009C2178 glc_initmod_mp_gl 368 glc_InitMod.F90
Bill responded with:
@Katetc, thanks for the update. It looks like the failure is at l. 856 of cism_reprosum_mod.F90, which is this call:
call mpi_allreduce (arr_lextremes, arr_gextremes, 2*(nflds+1), &
MPI_INTEGER, MPI_MIN, mpi_comm, ierr)
Backing up a bit, the parallel_global_sum call is here in glide_diagnostics:
tot_area = parallel_global_sum(cell_area, parallel, ice_mask)
I don't see anything problematic about the cell_area field. But this is the first call to parallel_global_sum on restart, and I suspect that any call to parallel_global_sum will generate the same error.
Just to be sure, could you try commenting out the call to glide_write_diag at l. 82 and repeating the test? If you get another seg fault traceable to a different module, that would show that the issue is with the reprosums and not the cell_area field. From there, we can try to figure out what might go wrong on restart in CESM.
Meanwhile, could you point me to your config file? I'll try a similar test with the standalone model and see if it completes.
Further testing showed errors with singular matrices. I tried Claude who suggested a missing halo update but that didn't help. Here's my ideas for how to fix this:
This is a local issue related to the reproducible sums PR. The purpose is to plan and take notes on what is likely to be a pretty complicated debugging task.
Here's my comment on the issue:
Bill responded with:
Further testing showed errors with singular matrices. I tried Claude who suggested a missing halo update but that didn't help. Here's my ideas for how to fix this: