I am currently working with a coupled framework between NEMO 4.2 and WAVEWATCH III for the UK coastal shelf at 1.5 km (AMM15). I am porting this at the moment to a supercomputer and running it with far more processors than the 24 or so used for testing. I initially started with NEMO on 304 processors and it crashed after 2 time steps with overly high ssh. Minimum salinity numbers in run.stat looked very low relative to my previous tests. When I looked at the abort file, I saw [salinity example]:
Junk input is being written in particular domains in ssh, temperature, salinity etc.
What I have found so far is processor number matters. Putting NEMO on 144 and 256 processors does not produce this behaviour and the simulation runs much like it did in the test configuration (nearly bit for bit).
If anyone has encountered a similar issue before or has any ideas of what to look into, please let me know.
Nicholas G. Heavens, PhD, FGS, FRAS, FRMetS, FBS
CGG Environmental Science
I’m encountering an issue similar to yours. I try to run a global eORCA025 simulation, and NEMO crashes at the second time step with all the variables in the “run.stat” file becoming NaN.
I am curious to know if you found the reason behind your problems. Thanks in advance!
I have yet to find a solution for this particular issue. I am currently pursuing it by trying to test large CPU NEMO runs on another system. What does the abort file look like? I encountered a similar issue when I was running into the memory limits of my system.
The abort file does not look like much since the fields have become NaN. I have just launched a run with more memory to test if issues are caused by a lack of memory.
Thanks for your reply!
Have you check the sensitivity of your setup to MPI_FABRICS as i listed below. When I run nemo4.2 on the cluster, i have to specify MPI_FABRICS explicitly in the submit script.
Also, do you use xios in the attached mode or detached mode?
It’s a good idea, but fabrics issues typically give me errors related to trying to access the wrong communicator device. Fabrics issues could explain another long-term problem with running on this system, though, so thanks for bringing it up.
I am using NEMO in attached mode. Because WAVEWATCH III (to which NEMO is coupled here) does not use XIOS, it’s impossible to get a stable Oasis layer with XIOS running in detached mode. [This latter observation is more for the benefit of someone doing a web search.]
You are welocme and hope i can help. I usually use the following flags to compile nemo 4.2 and if you are using ifort:
-debug -no-vec -no-simd
somehow your result and model behaviour is sensitive to MPI decomposition and no-vec and no-simd flags can help to minimize the number of processor dependency.