Skip to content

Debugging CISM Reprosums #36

@Katetc

Description

@Katetc

This is a local issue related to the reproducible sums PR. The purpose is to plan and take notes on what is likely to be a pretty complicated debugging task.

Here's my comment on the issue:

HI Bill, quick update. On Tuesday night I started an Intel "ERP" test - exact restart with changing processors in CESM. I checked today and it failed with a seg fault after the restart with the stack trace below. This test runs once for 3 years with 256 tasks and then restarts for 1 year with 128 tasks.

dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000001A82C30 cism_reprosum_mod 856 cism_reprosum_mod.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000E8CC55 cism_parallel_mp_ 9600 parallel_mpi.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000DD323D cism_parallel_mp_ 6222 parallel_mpi.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000A4D5F6 glide_diagnostics 385 glide_diagnostics.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000000A47E73 glide_diagnostics 82 glide_diagnostics.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 0000000001A40E17 glad_initialise_m 277 glad_initialise.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 00000000009EAE50 glad_main_mp_glad 275 glad_main.F90
dec0385.hsn.de.hpc.ucar.edu 75: cesm.exe 00000000009C2178 glc_initmod_mp_gl 368 glc_InitMod.F90

Bill responded with:

@Katetc, thanks for the update. It looks like the failure is at l. 856 of cism_reprosum_mod.F90, which is this call:

    call mpi_allreduce (arr_lextremes, arr_gextremes, 2*(nflds+1), &
                        MPI_INTEGER, MPI_MIN, mpi_comm, ierr)

Backing up a bit, the parallel_global_sum call is here in glide_diagnostics:

  tot_area = parallel_global_sum(cell_area, parallel, ice_mask)

I don't see anything problematic about the cell_area field. But this is the first call to parallel_global_sum on restart, and I suspect that any call to parallel_global_sum will generate the same error.

Just to be sure, could you try commenting out the call to glide_write_diag at l. 82 and repeating the test? If you get another seg fault traceable to a different module, that would show that the issue is with the reprosums and not the cell_area field. From there, we can try to figure out what might go wrong on restart in CESM.

Meanwhile, could you point me to your config file? I'll try a similar test with the standalone model and see if it completes.

Further testing showed errors with singular matrices. I tried Claude who suggested a missing halo update but that didn't help. Here's my ideas for how to fix this:

  • Notes on the exact error for each compiler/test case
  • Notes on what Claude suggested
  • Send Bill requested config file and case set up.
  • Get DDT set up
  • Run restart in debugger

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions