NaNs appearing in model domain

tmwilder · 11 December 2025 14:10

Hi,

I am trying to spin up the DINO configuration at 1/16 degree in nemo v4.2.1. My method is to run NEMO for x years, rebuild the restart file at year x, then pickup NEMO from that restart point, and repeat.

The problem is, the model appears to populate the domain with NaNs (see figures below). This happens at different timestamps depending on the length the simulation is first run. For example, running for one year, then picking up, the restart file and output files have a NaN region. Similarly but for 3 years, rebuild, 3 years. The model does eventually blow up. So without picking up, the model can run for longer.

Figure 1: restart field after 2 years of simulation.

Figure 2: restart field after 6 years of simulation.

I think the problem is related to processors/tasks assigned to NEMO and XIOS, or possibly the Cray Fortran compiler, or a combination of these.

Some other details:

We use 4160 cpus across 50 nodes, 4060 for nemo and 100 for xios. This is run on LUMI that uses the cray fortran compilers. I have tried running across less cpus e.g. 200, 2000 but the crashing still persists.

Any thoughts greatly appreciated.

Cheers

ChrisWilson · 11 December 2025 16:31

Hi Thomas,

If you haven’t already, you could try recompiling and running with debugging flags that include array bounds checking perhaps. That might identify an out-of-bounds error associated with restart I/O. Those debugging flags should probably include a reduced or zero level of compiler code optimisation, e.g. -O1 or whatever is appropriate for that compiler, just in case heavy optimisation has introduced some weird behaviour for your particular setup.

Chris

robinson · 12 December 2025 11:21

Hei, Are you sure it is a technical problem. I have experienced this very often, and the problem

is that the divergence check is done only on the baroclinic time step, but not on the barotropic one.

So at the end of the baroclinic time step, the fields contain a lot of NaNs, which are not detected by the divergence check, so the model continues and just crunches NaNs, even prints them out.

Hope this helps,

tmwilder · 12 December 2025 11:39

Thanks, Chris. I’ll explore the debugging steps.

Robinson - both the simulations I have shown are independent of each other, with the NaN region changing. E.g. in the second simulation (Figure 2), the restart file at year 3 is fine. But in the first (Figure 1) the restart file at year 2 has NaNs. I would have thought a numerical instability would show in the same place at the same time? This leads me to think it is something else.

robinson · 12 December 2025 11:52

If the model is instable from a barotropic point of view, NaNs can appear in the barotropic mode

at any location. At least based on my experience. I would recommend trying to lower the time step to see if it solves the problem.

Topic		Replies	Views
Junk domain input at high processor number v4.2.x XIOS , OASIS	11	539	18 October 2024
[v.4.0.x] Restarting from single file with different processor numbers stops at first time step v4.0.x XIOS	2	514	22 April 2022
Abort due to unrealistic or NaN value for SSH, velocity or salinity	13	1207	30 August 2022
Nemo hangs when running with updated impi/netcdf/hdf5 versions	11	206	13 October 2024
Set non-optimal domain decomposition and balance process distribution over CPU nodes Versions DOM , XIOS	7	664	24 January 2022

NaNs appearing in model domain

Related topics