Junk domain input at high processor number

I am currently working with a coupled framework between NEMO 4.2 and WAVEWATCH III for the UK coastal shelf at 1.5 km (AMM15). I am porting this at the moment to a supercomputer and running it with far more processors than the 24 or so used for testing. I initially started with NEMO on 304 processors and it crashed after 2 time steps with overly high ssh. Minimum salinity numbers in run.stat looked very low relative to my previous tests. When I looked at the abort file, I saw [salinity example]:

Junk input is being written in particular domains in ssh, temperature, salinity etc.

What I have found so far is processor number matters. Putting NEMO on 144 and 256 processors does not produce this behaviour and the simulation runs much like it did in the test configuration (nearly bit for bit).

If anyone has encountered a similar issue before or has any ideas of what to look into, please let me know.

Best regards,
Nicholas G. Heavens, PhD, FGS, FRAS, FRMetS, FBS
Climate Modeller
CGG Environmental Science

Hi Nicholas,

I’m encountering an issue similar to yours. I try to run a global eORCA025 simulation, and NEMO crashes at the second time step with all the variables in the “run.stat” file becoming NaN.

I am curious to know if you found the reason behind your problems. Thanks in advance!

1 Like

Dear Antoine,

I have yet to find a solution for this particular issue. I am currently pursuing it by trying to test large CPU NEMO runs on another system. What does the abort file look like? I encountered a similar issue when I was running into the memory limits of my system.

Nick

1 Like

The abort file does not look like much since the fields have become NaN. I have just launched a run with more memory to test if issues are caused by a lack of memory.

Thanks for your reply!

Hi Nicholas

Have you check the sensitivity of your setup to MPI_FABRICS as i listed below. When I run nemo4.2 on the cluster, i have to specify MPI_FABRICS explicitly in the submit script.

export I_MPI_FABRICS=shm:ofi
export I_MPI_FABRICS=shm:dapl
export I_MPI_FABRICS=shm:ofa
export I_MPI_FABRICS=shm:dapl

Also, do you use xios in the attached mode or detached mode?

Dear Saeed,

It’s a good idea, but fabrics issues typically give me errors related to trying to access the wrong communicator device. Fabrics issues could explain another long-term problem with running on this system, though, so thanks for bringing it up.

I am using NEMO in attached mode. Because WAVEWATCH III (to which NEMO is coupled here) does not use XIOS, it’s impossible to get a stable Oasis layer with XIOS running in detached mode. [This latter observation is more for the benefit of someone doing a web search.]

Nick

1 Like

Hi Nick

You are welocme and hope i can help. I usually use the following flags to compile nemo 4.2 and if you are using ifort:

-debug -no-vec -no-simd

somehow your result and model behaviour is sensitive to MPI decomposition and no-vec and no-simd flags can help to minimize the number of processor dependency.

Saeed

1 Like

Dear all,

I have been able to explore this problem in exhaustive detail on a new cluster and with two new coupled modelling systems as well as with standalone simulations.

The cluster consists of nodes with 64 Intel(R) Xeon(R) Platinum 8462Y+ cores.

I can use either an Intel MPI compiler setup or a GNU-OpenMPI setup.

My two coupled modelling systems are:

(1) a modified version of the UK regional environmental prediction system that couples NEMO 4.2.0 to WAVEWATCH III using OASIS3-MCT5

(2) a system that couples MPAS-A, NEMO, and WAVEWATCH III on global grids using OASIS3-MCT5.

With all of these modelling systems (whether I use Intel MPI or Open MPI), I encounter the same basic problem. When I put NEMO on more than x number of cores, I encounter output.abort on the 1st or 2nd timestep with unrealistically high or infinite values in the ssh or U- current fields.

The critical x varies with modelling system and MPI/compiler. It’s lower for the global modelling system and using Open MPI.

Increasing processors above the critical x results in more unrealistic values.

Most importantly, while I can run NEMO for one model day in standalone up to 512 cpu (what is available to me), I cannot run NEMO beyond timesep or two in the coupled system with WAVEWATCH III above 319 cpu (though the underlying issue appears to exist for processor counts less than this; 256 cpu being definitely reliable for Intel MPI). Open MPI is not reliable above 64 cpu.

I had NEMO output the initial conditions to look further at what was going on. The initial conditions for cases that progress beyond the first few timesteps are consistent with the restart file I initialise NEMO from. The initial conditions for cases that do not progress beyond the first few timesteps have weird artifacts in vozocrtx, vomecrty, and vovecrtz fields but not in ssh or any other field.

Here is an example of normal initialisation (256 cpu):

Here is an example of not normal initialisation (320 cpu):

I am able to run the regional ocean coupled modelling system in various ways. If I turn off ln_wave, I can run the model until two hours model time has completed, when the coupler phase locks and crashes. The initialisation condition is normal for 320 cpu in this case. If I keep ln_wave turned on but turn off all of the options under &namsbc_wave like ln_swd and ln_taw etc., I get a non-normal initialisation and an abort.

If I remove all of the exchanged fields from namcouple and the namelists for NEMO and WAVEWATCH III, I get a non-normal initialisation and an abort.

As I read the code, the only thing NEMO does in that last case is to initialise OASIS-based communication on the MPI local communicator. Setting ln_wave to .false. ensures lk_oasis is set to .false. and ensures that NEMO does not make any communications through OASIS. There is no reason I can see beyond some fundamental issue with NEMO and its interactions with OASIS3-MCT5 and XIOS that can explain the behaviour I observe.

I have tried Saeed’s recommendations about MPI fabrics and adding a couple of FCFLAGS, as outlined in the messages above. They do not change the behaviour in any way.

Nick

Could it be related to Wrong default value of xcplmask (#330) · Issues · NEMO Workspace / Nemo · GitLab ?

1 Like

Dear smasson,

There are a variety of reasons that makes a connection unlikely.

  1. The problem happens in a system that uses coupled atmospheric forcing on ocean+ wave components and in a system that uses bulk atmospheric forcing on ocean+ wave components.

  2. The mask seems only used with ln_mixcpl option, which I don’t think either system uses.

One system uses ln_cpl+ln_wave, the other uses ln_wave only.

  1. xcplmask modifies energy variables, not the current fields. Whatever is going on is corrupting the initial conditions of the current fields.

  2. I tried the fix to which you pointed. It had no effect.

The bigger question is whether anyone is running OASIS-coupled systems with NEMO on many hundreds of processors successfully, particularly with Intel compilers. I tend to be minimalist with compiler settings, and there could be something that’s been missed because compiler settings get rid of the problem.

Thanks,

Nick

The solution is indeed compiler settings.

FCFLAGS= -i4 -i8 -init=zero -init=arrays -O3

The -init options initialize all scalars and arrays as zero without having to do it explicitly in the code. There is obviously some uninitialised array leaking into the current fields. But I ran at 320 cpu with the case that had broken there. I have run at 384 cpu as well with no issues.

Thank you all for your advice on this. My insight was realising that more people would have complained if a standard compiler setting didn’t fix the issue. I noticed the UK Met Office was using:

FCFLAGS= -i4 -i8 -init=zero -init=arrays -traceback -debug minimal -debug inline-debug-info -O3 -fp-model consistent

I then experimented to see which ones fixed the problem.

Nick

1 Like