Junk domain input at high processor number

nick.heavens-cgg · 30 March 2023 12:15

I am currently working with a coupled framework between NEMO 4.2 and WAVEWATCH III for the UK coastal shelf at 1.5 km (AMM15). I am porting this at the moment to a supercomputer and running it with far more processors than the 24 or so used for testing. I initially started with NEMO on 304 processors and it crashed after 2 time steps with overly high ssh. Minimum salinity numbers in run.stat looked very low relative to my previous tests. When I looked at the abort file, I saw [salinity example]:

Junk input is being written in particular domains in ssh, temperature, salinity etc.

What I have found so far is processor number matters. Putting NEMO on 144 and 256 processors does not produce this behaviour and the simulation runs much like it did in the test configuration (nearly bit for bit).

If anyone has encountered a similar issue before or has any ideas of what to look into, please let me know.

Best regards,
Nicholas G. Heavens, PhD, FGS, FRAS, FRMetS, FBS
Climate Modeller
CGG Environmental Science

Antoine_Barthelemy · 27 July 2023 14:18

Hi Nicholas,

I’m encountering an issue similar to yours. I try to run a global eORCA025 simulation, and NEMO crashes at the second time step with all the variables in the “run.stat” file becoming NaN.

I am curious to know if you found the reason behind your problems. Thanks in advance!

nick.heavens-cgg · 27 July 2023 14:33

Dear Antoine,

I have yet to find a solution for this particular issue. I am currently pursuing it by trying to test large CPU NEMO runs on another system. What does the abort file look like? I encountered a similar issue when I was running into the memory limits of my system.

Nick

Antoine_Barthelemy · 27 July 2023 15:52

The abort file does not look like much since the fields have become NaN. I have just launched a run with more memory to test if issues are caused by a lack of memory.

Thanks for your reply!

Saeed · 1 August 2023 07:11

Hi Nicholas

Have you check the sensitivity of your setup to MPI_FABRICS as i listed below. When I run nemo4.2 on the cluster, i have to specify MPI_FABRICS explicitly in the submit script.

export I_MPI_FABRICS=shm:ofi
export I_MPI_FABRICS=shm:dapl
export I_MPI_FABRICS=shm:ofa
export I_MPI_FABRICS=shm:dapl

Also, do you use xios in the attached mode or detached mode?

nick.heavens-cgg · 1 August 2023 08:27

Dear Saeed,

It’s a good idea, but fabrics issues typically give me errors related to trying to access the wrong communicator device. Fabrics issues could explain another long-term problem with running on this system, though, so thanks for bringing it up.

I am using NEMO in attached mode. Because WAVEWATCH III (to which NEMO is coupled here) does not use XIOS, it’s impossible to get a stable Oasis layer with XIOS running in detached mode. [This latter observation is more for the benefit of someone doing a web search.]

Nick

Saeed · 1 August 2023 08:53

Hi Nick

You are welocme and hope i can help. I usually use the following flags to compile nemo 4.2 and if you are using ifort:

-debug -no-vec -no-simd

somehow your result and model behaviour is sensitive to MPI decomposition and no-vec and no-simd flags can help to minimize the number of processor dependency.

Saeed

nick.heavens-cgg · 4 March 2024 12:29

Dear all,

I have been able to explore this problem in exhaustive detail on a new cluster and with two new coupled modelling systems as well as with standalone simulations.

The cluster consists of nodes with 64 Intel(R) Xeon(R) Platinum 8462Y+ cores.

I can use either an Intel MPI compiler setup or a GNU-OpenMPI setup.

My two coupled modelling systems are:

(1) a modified version of the UK regional environmental prediction system that couples NEMO 4.2.0 to WAVEWATCH III using OASIS3-MCT5

(2) a system that couples MPAS-A, NEMO, and WAVEWATCH III on global grids using OASIS3-MCT5.

With all of these modelling systems (whether I use Intel MPI or Open MPI), I encounter the same basic problem. When I put NEMO on more than x number of cores, I encounter output.abort on the 1st or 2nd timestep with unrealistically high or infinite values in the ssh or U- current fields.

The critical x varies with modelling system and MPI/compiler. It’s lower for the global modelling system and using Open MPI.

Increasing processors above the critical x results in more unrealistic values.

Most importantly, while I can run NEMO for one model day in standalone up to 512 cpu (what is available to me), I cannot run NEMO beyond timesep or two in the coupled system with WAVEWATCH III above 319 cpu (though the underlying issue appears to exist for processor counts less than this; 256 cpu being definitely reliable for Intel MPI). Open MPI is not reliable above 64 cpu.

I had NEMO output the initial conditions to look further at what was going on. The initial conditions for cases that progress beyond the first few timesteps are consistent with the restart file I initialise NEMO from. The initial conditions for cases that do not progress beyond the first few timesteps have weird artifacts in vozocrtx, vomecrty, and vovecrtz fields but not in ssh or any other field.

Here is an example of normal initialisation (256 cpu):

Here is an example of not normal initialisation (320 cpu):

I am able to run the regional ocean coupled modelling system in various ways. If I turn off ln_wave, I can run the model until two hours model time has completed, when the coupler phase locks and crashes. The initialisation condition is normal for 320 cpu in this case. If I keep ln_wave turned on but turn off all of the options under &namsbc_wave like ln_swd and ln_taw etc., I get a non-normal initialisation and an abort.

If I remove all of the exchanged fields from namcouple and the namelists for NEMO and WAVEWATCH III, I get a non-normal initialisation and an abort.

As I read the code, the only thing NEMO does in that last case is to initialise OASIS-based communication on the MPI local communicator. Setting ln_wave to .false. ensures lk_oasis is set to .false. and ensures that NEMO does not make any communications through OASIS. There is no reason I can see beyond some fundamental issue with NEMO and its interactions with OASIS3-MCT5 and XIOS that can explain the behaviour I observe.

I have tried Saeed’s recommendations about MPI fabrics and adding a couple of FCFLAGS, as outlined in the messages above. They do not change the behaviour in any way.

Nick

smasson · 4 March 2024 17:34

Could it be related to Wrong default value of xcplmask (#330) · Issues · NEMO Workspace / Nemo · GitLab ?

nick.heavens-cgg · 5 March 2024 10:26

Dear smasson,

There are a variety of reasons that makes a connection unlikely.

The problem happens in a system that uses coupled atmospheric forcing on ocean+ wave components and in a system that uses bulk atmospheric forcing on ocean+ wave components.
The mask seems only used with ln_mixcpl option, which I don’t think either system uses.

One system uses ln_cpl+ln_wave, the other uses ln_wave only.

xcplmask modifies energy variables, not the current fields. Whatever is going on is corrupting the initial conditions of the current fields.
I tried the fix to which you pointed. It had no effect.

The bigger question is whether anyone is running OASIS-coupled systems with NEMO on many hundreds of processors successfully, particularly with Intel compilers. I tend to be minimalist with compiler settings, and there could be something that’s been missed because compiler settings get rid of the problem.

Thanks,

Nick

nick.heavens-cgg · 5 March 2024 11:25

The solution is indeed compiler settings.

FCFLAGS= -i4 -i8 -init=zero -init=arrays -O3

The -init options initialize all scalars and arrays as zero without having to do it explicitly in the code. There is obviously some uninitialised array leaking into the current fields. But I ran at 320 cpu with the case that had broken there. I have run at 384 cpu as well with no issues.

Thank you all for your advice on this. My insight was realising that more people would have complained if a standard compiler setting didn’t fix the issue. I noticed the UK Met Office was using:

FCFLAGS= -i4 -i8 -init=zero -init=arrays -traceback -debug minimal -debug inline-debug-info -O3 -fp-model consistent

I then experimented to see which ones fixed the problem.

Nick

nick.heavens-cgg · 18 October 2024 09:50

I have found a better solution to this problem, partly because I needed to use NEMO compiled with GNU compilers and GNU is not so generous in initialising arrays.

The coupler is involved, but…

…the issue arises from some of the water balance code being activated in the sbc_cpl_rcv subroutine within sbccpl.F90. The array variable that receives input from the coupler is allocated but not initialised before being filled with the variables from OASIS3-MCT. If there is no active variable to fill that part of the array (as is common), the values of the allocated array remain. These numbers become nonsense at some cpu number. If these numbers are then passed to the water balance code (particularly runoff and calving), the problem described in this thread arises. There are ways to make sure runoff and calving etc. are shutdown if they are uninvolved in coupling, but the cleanest way to fix this issue is simply to zero out this array before filling as you would initialise any other dynamic variable.

I.e., at line 1150 of the present code,
change:

         DO jn = 1, jprcv                                          ! received fields sent by the atmosphere
         IF( srcv(jn)%laction )   CALL cpl_rcv( jn, isec, frcv(jn)%z3, xcplmask(A2D(0),1:nn_cplmodel), nrcvinfo(jn) )
      END DO

to:

      DO jn = 1, jprcv 
                                 ! received fields sent by the atmosphere
         IF( srcv(jn)%laction ) THEN
                frcv(jn)%z3(:,:,1) = 0._wp   
              CALL cpl_rcv( jn, isec, frcv(jn)%z3, xcplmask(A2D(0),1:nn_cplmodel), nrcvinfo(jn) )
        ENDIF
      END DO

You then can build without zeroing arrays in the compiler.

Nick

Topic		Replies	Views
[v4.2.x] NEMO-OASIS-WRF fails with infinite SSH by increasing MPI tasks used by NEMO v4.2.x OASIS	5	556	8 July 2022
Nemo hangs when running with updated impi/netcdf/hdf5 versions	11	117	13 October 2024
Initialisation problem in the NEMO 4.0.4 & 4.0.7 version v4.0.x	4	75	13 January 2025
Segmentation fault on nemogcm.f90 and nemo.f90 Versions arch , deps	22	1094	19 August 2022
What does these parameter mean? v4.2.x gyre	7	387	21 March 2024

Junk domain input at high processor number

Related topics