Hi, I am running a large ~4800x6600 configuration on ~4000 cores and 160 XIOS cores but sporadically get a time-out and trace it back to a MPI operation (mpi_allreduce or mpi_probe) inside the model or XIOS, although most often in the later during the init (the dia_mlr portion). We are running with inteloneapi-2024.2.0. Since I am unable to figure whether this is a hardware or software issue, I am reaching out to the community to figure if others have encountered the same difficulty…
Hi!
We also face such behaviour since about half a year now and also cannot pin the issue down to any of the components NEMO, XIOS or HPC software. In our case this appeared coincidently with a major software update on the related HPC system.
We mostly see the jobs stalling when they are supposed to produce output (but not exclusively). The issue occurs randomly - some jobs which ran into it once, can simply run through without any issue after re-submission.
We only see this in our AGRIF configurations running on ~3700 cores for NEMO + 100 cores for XIOS and ~3700 NEMO + 800 XIOS for a double nested configuration with huge amount of output) while un-nested (smaller) configurations don’t seem to be affected (~2000 + 80).
We also use intel-compilers.
addendum: this appears across NEMO (3.6 and 5.0) and XIOS versions/revisions (2 and 3, several revisions tested)
Thanks for the info! Do you confirm also using Intel oneAPI? We are thinking switching to openMPI…
We use intel classic compiler 2021.10.0 as part of oneAPI Release 2023.2.
Directly using oneAPI (ifx, icx…) never worked on our systems - so far.
We were also recommended to use GCC/OpenMPI but this is still on our list as it didn’t work right away.
Sorry not a solution but I’ve had sporadic issues when scaling up my much smaller experiments (~200 cores). I narrowed it down to a fld_read call which the new ifx compiler seemed to be treating differently. What level of optimisation are you using? I believe there were some bugs that were fixed in the 2025 release, although I was still encountering problems and am now in the process of moving to GCC.