[intel] Model hangs on initialization phase at `xios_close_context_definition` call

Currently I am implementing a new tracer package in Nemo v4.0. However, I am confronted with the problem that the model run would fail or would not fail after performing minor changes in some routines.

This problem bears some resemblance with the issue posted by Marilia.

I investigated further by building a “minimalist” configuration: ORCA2 + OFF + very basic MY_TRC code (3 tracers which are only transported).

When XIOS is executed in detached (server) mode the model would fail upon the simple addition or removal of PRINT instructions in different modules (I put some at strategic locations in OFF/nemogcm.F90, iom.F90, trcini_my_trc.F90, trcsms_my_trc.F90).

I tested several domain decomposition (optimal or not). For each of these decomposition I have been able to make the model freeze or run to the end by the simple addition or removal of PRINT instruction. :thinking:

The reason that the model freezes seems to be that some processors are not able to go further than the call to xios_close_context_definition() in iom_init.

I also checked with debugging options at compilation both for Nemo and Xios; but this did not help.

Up to now I did not encounter any problem when using xios in attached mode.

Thanks for your suggestions

It seems our problems (there was more than what I described in the post) were fully solved by a compiler change :slightly_smiling_face:
We upgraded to the latest available version on our system (release 2021) and now everything works smoothly.

We have the same problem here. What compiler did you use before and after your upgrade?

There is some information Yann Meurdesoif (XIOS main developer) gave on XIOS-users mailing list

It seems that it come from known bug of the MPI intel library with MPI_IProbe, coming itself from the MPICH strain. The problem has been fixed in MPICH but not yet in the intel version.

By the past, intel MPI was coming with 2 versions : the standard release and the release_mt. You can switch from one to the other with environment variables by sourcing a script coming with the lib.

The standard release has the bug and the release_mt has not. For recent versions, they decided to merge both releases. So the release_mt has disappear but the bug is still here. So there is no way for a workaround.

The best will be to use an old intelMPI version (2019) and use the release_mt lib. An other solution will be to use the openMPI lib if available on your computer.

Intel has received reproducer from different source, I hope they will fix the bug soon.

We are working on nemo4.0.3 (eORCA1 & eORCA025) at CCCma. The eORCA1 sets ln_use_jattr = .true. to exclude the ice sheet area so the model actually runs on grid of 362x292. eORCA025 is on the original grid of 1442x1207.

I was trying to use xios2.5 to write/read restarts in one-file mode (i.e., ln_xios_read = .TRUE. & nn_wxios = 1) for both configs. However, both runs aborted right before the line CALL xios_close_context_definition() in iom.F90. I couldn’t find the source code for xios_close_context_definition (probably in the library). Does anyone encounter same problem or know how to fix it? Thanks in advance!

Duo

I should have said both runs aborted when CALL xios_close_context_definition() in iom.F90.

Duo

Is it working without ln_xios_read = .TRUE. and nn_wxios = 1?

Using xios for restart is quite inefficient as the restart are read/written at the very beginning/end of the simulation and the reading/writing process cannot be recovered by computation (except if you write restarts during the simulation). The only benefit is to get only 1 restart file so can easily change the number of cores you use between 2 jobs but most of users don’t do this…

Hi Sebastien,

Thanks for your message!

Yes, it is working without ln_xios_read = .TRUE. and nn_wxios = 1. The reason why I was trying this is that I noticed lately that the 1st and last columns of nav_lat and nav_lon in restarts are incorrect very likely due to REBUILD (while they are OK in the model output), and I thought that using xios might be more efficient (good to know it is opposite) and also could get 1 restart files as you mentioned. Will also look into the REBUILD code.

BTW, regarding the crash, Yann (Meurdesoif) mentioned that

It seems that it come from known bug of the MPI intel library with MPI_IProbe, coming itself from the MPICH strain. The problem has been fixed in MPICH but not already in intel version. But there is some workaround, depending of your intel MPI version.

We are using Intel MPI 2021.5 that comes with inteloneapi-2022.1.2. There were similar discussions within ECCC (Environment and Climate Change Canada) that XIOS2.0 does not work with inteloneapi-2022.1.2 due to the bug (needs workaround) although NEMO4.0.3 + XIOS2.5 seems to work fine for us so far except during the attempt for optimization.

Regards,

Duo

Hi:

I am at ECCC and started running into intermittent hangs at this place in the code…It is inside a collective gather in XIOS (clientServerMap->computeConnectedClients(client->serverSize, client->clientSize, client->intraComm, connectedServerRank_);), I do already use the release_mt version of intel mpi inteloneapi-2021.4.0… I see that again as in the bug case an intercommunicator is involved…

Image              PC                Routine            Line        Source             
nemo.exe           0000000001C7E02C  Unknown               Unknown  Unknown
libc-2.28.so       000014E2FE78B880  Unknown               Unknown  Unknown
libmpi.so.12.0.0   000014E2F30C8780  Unknown               Unknown  Unknown
libmpi.so.12.0.0   000014E2F3272112  Unknown               Unknown  Unknown
libmpi.so.12.0.0   000014E2F2FAA9FF  Unknown               Unknown  Unknown
libmpi.so.12.0.0   000014E2F30497C4  Unknown               Unknown  Unknown
libmpi.so.12.0.0   000014E2F30200CC  Unknown               Unknown  Unknown
libmpi.so.12.0.0   000014E2F3153465  Unknown               Unknown  Unknown
libmpi.so.12       000014E2F2FAC1F8  MPI_Allgather         Unknown  Unknown
nemo.exe           00000000015C50A7  _ZN4xios20CClient          64  client_server_mapping.cpp
nemo.exe           00000000016066E0  _ZN4xios7CDomain2        1757  domain.cpp
nemo.exe           000000000160B8D8  _ZN4xios7CDomain4        1466  domain.cpp
nemo.exe           00000000016D8461  _ZN4xios5CGrid34c         197  grid.cpp
nemo.exe           00000000016D7094  _ZN4xios5CGrid14c         300  grid.cpp
nemo.exe           0000000001648D5E  _ZN4xios6CField29         759  field.cpp
nemo.exe           000000000166F223  _ZN4xios5CFile26s         754  file.cpp
nemo.exe           00000000015D1E53  _ZN4xios8CContext         495  context.cpp
nemo.exe           00000000015CF9EB  _ZN4xios8CContext         895  context.cpp
nemo.exe           00000000015DC7B1  _ZN4xios8CContext         405  context.cpp
nemo.exe           0000000001851991  cxios_context_clo         102  icdata.cpp
nemo.exe           0000000000955366  iom_mp_iom_init_          271  iom.F90
nemo.exe           0000000000485A34  step_mp_stp_               96  step.F90
nemo.exe           000000000043701D  nemogcm_mp_nemo_g         174  nemogcm.F90
nemo.exe           0000000000436F36  MAIN__                     18  nemo.f90
nemo.exe           0000000000436EE2  Unknown               Unknown  Unknown
libc-2.28.so       000014E2FE7777B3  __libc_start_main     Unknown  Unknown
nemo.exe           0000000000436DEE  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)

My NEMO version is 3.6. And I am not using XIOS to write restarts… It says killed because it had hung until it hit walltime…