NaNs appearing in model domain

Hi,

I am trying to spin up the DINO configuration at 1/16 degree in nemo v4.2.1. My method is to run NEMO for x years, rebuild the restart file at year x, then pickup NEMO from that restart point, and repeat.

The problem is, the model appears to populate the domain with NaNs (see figures below). This happens at different timestamps depending on the length the simulation is first run. For example, running for one year, then picking up, the restart file and output files have a NaN region. Similarly but for 3 years, rebuild, 3 years. The model does eventually blow up. So without picking up, the model can run for longer.

Figure 1: restart field after 2 years of simulation.

Figure 2: restart field after 6 years of simulation.

I think the problem is related to processors/tasks assigned to NEMO and XIOS, or possibly the Cray Fortran compiler, or a combination of these.

Some other details:

We use 4160 cpus across 50 nodes, 4060 for nemo and 100 for xios. This is run on LUMI that uses the cray fortran compilers. I have tried running across less cpus e.g. 200, 2000 but the crashing still persists.

Any thoughts greatly appreciated.

Cheers

Hi Thomas,

If you haven’t already, you could try recompiling and running with debugging flags that include array bounds checking perhaps. That might identify an out-of-bounds error associated with restart I/O. Those debugging flags should probably include a reduced or zero level of compiler code optimisation, e.g. -O1 or whatever is appropriate for that compiler, just in case heavy optimisation has introduced some weird behaviour for your particular setup.

Chris

Hei, Are you sure it is a technical problem. I have experienced this very often, and the problem

is that the divergence check is done only on the baroclinic time step, but not on the barotropic one.

So at the end of the baroclinic time step, the fields contain a lot of NaNs, which are not detected by the divergence check, so the model continues and just crunches NaNs, even prints them out.

Hope this helps,

Thanks, Chris. I’ll explore the debugging steps.

Robinson - both the simulations I have shown are independent of each other, with the NaN region changing. E.g. in the second simulation (Figure 2), the restart file at year 3 is fine. But in the first (Figure 1) the restart file at year 2 has NaNs. I would have thought a numerical instability would show in the same place at the same time? This leads me to think it is something else.

If the model is instable from a barotropic point of view, NaNs can appear in the barotropic mode

at any location. At least based on my experience. I would recommend trying to lower the time step to see if it solves the problem.