Dear all,
I have been able to explore this problem in exhaustive detail on a new cluster and with two new coupled modelling systems as well as with standalone simulations.
The cluster consists of nodes with 64 Intel(R) Xeon(R) Platinum 8462Y+ cores.
I can use either an Intel MPI compiler setup or a GNU-OpenMPI setup.
My two coupled modelling systems are:
(1) a modified version of the UK regional environmental prediction system that couples NEMO 4.2.0 to WAVEWATCH III using OASIS3-MCT5
(2) a system that couples MPAS-A, NEMO, and WAVEWATCH III on global grids using OASIS3-MCT5.
With all of these modelling systems (whether I use Intel MPI or Open MPI), I encounter the same basic problem. When I put NEMO on more than x number of cores, I encounter output.abort on the 1st or 2nd timestep with unrealistically high or infinite values in the ssh or U- current fields.
The critical x varies with modelling system and MPI/compiler. It’s lower for the global modelling system and using Open MPI.
Increasing processors above the critical x results in more unrealistic values.
Most importantly, while I can run NEMO for one model day in standalone up to 512 cpu (what is available to me), I cannot run NEMO beyond timesep or two in the coupled system with WAVEWATCH III above 319 cpu (though the underlying issue appears to exist for processor counts less than this; 256 cpu being definitely reliable for Intel MPI). Open MPI is not reliable above 64 cpu.
I had NEMO output the initial conditions to look further at what was going on. The initial conditions for cases that progress beyond the first few timesteps are consistent with the restart file I initialise NEMO from. The initial conditions for cases that do not progress beyond the first few timesteps have weird artifacts in vozocrtx, vomecrty, and vovecrtz fields but not in ssh or any other field.
Here is an example of normal initialisation (256 cpu):
Here is an example of not normal initialisation (320 cpu):
I am able to run the regional ocean coupled modelling system in various ways. If I turn off ln_wave, I can run the model until two hours model time has completed, when the coupler phase locks and crashes. The initialisation condition is normal for 320 cpu in this case. If I keep ln_wave turned on but turn off all of the options under &namsbc_wave like ln_swd and ln_taw etc., I get a non-normal initialisation and an abort.
If I remove all of the exchanged fields from namcouple and the namelists for NEMO and WAVEWATCH III, I get a non-normal initialisation and an abort.
As I read the code, the only thing NEMO does in that last case is to initialise OASIS-based communication on the MPI local communicator. Setting ln_wave to .false. ensures lk_oasis is set to .false. and ensures that NEMO does not make any communications through OASIS. There is no reason I can see beyond some fundamental issue with NEMO and its interactions with OASIS3-MCT5 and XIOS that can explain the behaviour I observe.
I have tried Saeed’s recommendations about MPI fabrics and adding a couple of FCFLAGS, as outlined in the messages above. They do not change the behaviour in any way.
Nick