Currently I am implementing a new tracer package in Nemo v4.0. However, I am confronted with the problem that the model run would fail or would not fail after performing minor changes in some routines.
This problem bears some resemblance with the issue posted by Marilia.
I investigated further by building a “minimalist” configuration: ORCA2 + OFF + very basic MY_TRC code (3 tracers which are only transported).
When XIOS is executed in detached (server) mode the model would fail upon the simple addition or removal of PRINT instructions in different modules (I put some at strategic locations in OFF/nemogcm.F90, iom.F90, trcini_my_trc.F90, trcsms_my_trc.F90).
I tested several domain decomposition (optimal or not). For each of these decomposition I have been able to make the model freeze or run to the end by the simple addition or removal of PRINT instruction.
The reason that the model freezes seems to be that some processors are not able to go further than the call to xios_close_context_definition() in iom_init.
I also checked with debugging options at compilation both for Nemo and Xios; but this did not help.
Up to now I did not encounter any problem when using xios in attached mode.
It seems our problems (there was more than what I described in the post) were fully solved by a compiler change
We upgraded to the latest available version on our system (release 2021) and now everything works smoothly.
There is some information Yann Meurdesoif (XIOS main developer) gave on XIOS-users mailing list
It seems that it come from known bug of the MPI intel library with MPI_IProbe, coming itself from the MPICH strain. The problem has been fixed in MPICH but not yet in the intel version.
By the past, intel MPI was coming with 2 versions : the standard release and the release_mt. You can switch from one to the other with environment variables by sourcing a script coming with the lib.
The standard release has the bug and the release_mt has not. For recent versions, they decided to merge both releases. So the release_mt has disappear but the bug is still here. So there is no way for a workaround.
The best will be to use an old intelMPI version (2019) and use the release_mt lib. An other solution will be to use the openMPI lib if available on your computer.
Intel has received reproducer from different source, I hope they will fix the bug soon.
We are working on nemo4.0.3 (eORCA1 & eORCA025) at CCCma. The eORCA1 sets ln_use_jattr = .true. to exclude the ice sheet area so the model actually runs on grid of 362x292. eORCA025 is on the original grid of 1442x1207.
I was trying to use xios2.5 to write/read restarts in one-file mode (i.e., ln_xios_read = .TRUE. & nn_wxios = 1) for both configs. However, both runs aborted right before the line CALL xios_close_context_definition() in iom.F90. I couldn’t find the source code for xios_close_context_definition (probably in the library). Does anyone encounter same problem or know how to fix it? Thanks in advance!
Is it working without ln_xios_read = .TRUE. and nn_wxios = 1?
Using xios for restart is quite inefficient as the restart are read/written at the very beginning/end of the simulation and the reading/writing process cannot be recovered by computation (except if you write restarts during the simulation). The only benefit is to get only 1 restart file so can easily change the number of cores you use between 2 jobs but most of users don’t do this…
Yes, it is working without ln_xios_read = .TRUE. and nn_wxios = 1. The reason why I was trying this is that I noticed lately that the 1st and last columns of nav_lat and nav_lon in restarts are incorrect very likely due to REBUILD (while they are OK in the model output), and I thought that using xios might be more efficient (good to know it is opposite) and also could get 1 restart files as you mentioned. Will also look into the REBUILD code.
BTW, regarding the crash, Yann (Meurdesoif) mentioned that
It seems that it come from known bug of the MPI intel library with MPI_IProbe, coming itself from the MPICH strain. The problem has been fixed in MPICH but not already in intel version. But there is some workaround, depending of your intel MPI version.
We are using Intel MPI 2021.5 that comes with inteloneapi-2022.1.2. There were similar discussions within ECCC (Environment and Climate Change Canada) that XIOS2.0 does not work with inteloneapi-2022.1.2 due to the bug (needs workaround) although NEMO4.0.3 + XIOS2.5 seems to work fine for us so far except during the attempt for optimization.
I am at ECCC and started running into intermittent hangs at this place in the code…It is inside a collective gather in XIOS (clientServerMap->computeConnectedClients(client->serverSize, client->clientSize, client->intraComm, connectedServerRank_);), I do already use the release_mt version of intel mpi inteloneapi-2021.4.0… I see that again as in the bug case an intercommunicator is involved…