MPI freezes sporadically with intel oneAPI on a large configuration, often inside a call to XIOS

dupontf · 15 December 2025 19:10

Hi, I am running a large ~4800x6600 configuration on ~4000 cores and 160 XIOS cores but sporadically get a time-out and trace it back to a MPI operation (mpi_allreduce or mpi_probe) inside the model or XIOS, although most often in the later during the init (the dia_mlr portion). We are running with inteloneapi-2024.2.0. Since I am unable to figure whether this is a hardware or software issue, I am reaching out to the community to figure if others have encountered the same difficulty…

Franziska · 16 December 2025 06:04

Hi!

We also face such behaviour since about half a year now and also cannot pin the issue down to any of the components NEMO, XIOS or HPC software. In our case this appeared coincidently with a major software update on the related HPC system.

We mostly see the jobs stalling when they are supposed to produce output (but not exclusively). The issue occurs randomly - some jobs which ran into it once, can simply run through without any issue after re-submission.
We only see this in our AGRIF configurations running on ~3700 cores for NEMO + 100 cores for XIOS and ~3700 NEMO + 800 XIOS for a double nested configuration with huge amount of output) while un-nested (smaller) configurations don’t seem to be affected (~2000 + 80).
We also use intel-compilers.

addendum: this appears across NEMO (3.6 and 5.0) and XIOS versions/revisions (2 and 3, several revisions tested)

dupontf · 16 December 2025 19:12

Thanks for the info! Do you confirm also using Intel oneAPI? We are thinking switching to openMPI…

Franziska · 17 December 2025 08:02

We use intel classic compiler 2021.10.0 as part of oneAPI Release 2023.2.
Directly using oneAPI (ifx, icx…) never worked on our systems - so far.

We were also recommended to use GCC/OpenMPI but this is still on our list as it didn’t work right away.

RobMc · 7 January 2026 12:02

Sorry not a solution but I’ve had sporadic issues when scaling up my much smaller experiments (~200 cores). I narrowed it down to a fld_read call which the new ifx compiler seemed to be treating differently. What level of optimisation are you using? I believe there were some bugs that were fixed in the 2025 release, although I was still encountering problems and am now in the process of moving to GCC.

Topic		Replies	Views
[intel] Model hangs on initialization phase at `xios_close_context_definition` call Versions deps	8	1250	18 January 2023
NEMO4.2: execution time spikes for some ts v4.2.x XIOS	3	374	20 June 2024
[v4.2.x] Coupled mode NEMO-OASIS-XIOS freezes at `prism_initxios.x` task v4.2.x XIOS , OASIS	4	821	21 April 2022
Very poor scaling with OpenIFS + NEMO 4.2 + XIOS 3 (slow) v4.2.x XIOS	10	692	17 November 2023
Seg. fault after compiling XIOS and NEMO with openmpi 4.1 and intel 18 v4.2.x	2	444	17 October 2023

MPI freezes sporadically with intel oneAPI on a large configuration, often inside a call to XIOS

Related topics