NEMO4.2: execution time spikes for some ts

Hello,

Due to significant execution time difference between two of our HPCs, I am looking at how much time takes each individual time steps from the timing.output file, in function of the frequency of outputs (hourly, daily, monthly, for station only, which are daily) for each HPC (Bi, Tetralith). Turned out that activating hourly outputs (ssh_inst, in this case) makes the execution time to blow up (about x5).

NEMO4.2 is running with 96 cpus, XIOS with 16. On Bi, I’m running the experiment with mpirun -n 96 ./nemo.exe : -n 16 ./xios_server, on Tetralith it’s with srun --mpi pmi2 --multi-prog cpu_mapping, with the cpu mapping splitting equally the XIOS cores at the end of the available nodes. On Bi it’s compiled with Intel2018, and Intel2023 on Tetralith. The CPUs are different, but a normal time step takes about 0.28s on Bi, and 0.24s on Tetralith.

There is what I have:


y axis: execution time (s)
x axis: experiment progress (in simulated days)
Time step is 180s.

As you can see, in the daily case, I have daily spike of execution time every day, but only on Tetralith. It’s even worst in the hourly case, where I have spikes every hours, plus intermediate smaller spikes every 20mn or so (~1 or 2s/ts). All these spikes combined is what makes the running time very bad when hourly outputs are involved.

Anyone knows what can be the source of these spikes? They don’t really make sense to me. The whole point of using XIOS in detached mode is that NEMO sends the data every time step in the iom_put() call, and do not have to stop when XIOS is writing something.
Moreover, how can it be HPC dependent? I only showed the results for Tetralith here, as the hardware is similar to Bi, but I get the same problem with our brand new HPC.

I’m in discussion with our HPC support at the same time, but I wondered if someone from the NEMO community has any insight on this behaviour.

1 Like

Difficult to give precise advice without much more information about the configuration and machines. However, you are right, when running properly in detached mode, the XIOS servers should be preventing these spikes. Your systems are “similar” but not identical, so things to check:

  1. Your process placement in the second case. Make sure the ocean and xios processes are not running unintentionally on the same cores.
  2. Memory limits on both machines. If the second one has less memory per node, you may need to use more cores. XIOS will report its performance in the job log. Check for messages such as:
-> report :  Performance report : Ratio : 0.0006162 %
-> report :  Performance report : This ratio must be close to zero. Otherwise it may be usefull to increase buffer size or numbers of server
  1. Check for poor filesystem performance on the second machine. If it is taking too long to write the output files, then the model will eventually have to wait for XIOS to catch up
1 Like

Thanks for your answer.

Your process placement in the second case. Make sure the ocean and xios processes are not running unintentionally on the same cores.

That shouldn’t be possible with srun --multi-prog as we have to manually map the cpus to binaries.

Memory limits on both machines.

That’s the oddest past. Bi is the oldest and slowest machine of the lot. I have the same problem with our brand new machine (Freja) that is twice as powerful as Bi and has 3 times the memory/core (I didn’t show it in the previous figure). The network is different, with InfiniBand QRD for BI, and 100 gpbs OmniPath for the two others, but again, the winners are Tetralith & Freja… Yet it’s slower.

Check for messages

I had a look at the average ratio for all cpus, and this is what I got:

We can see that Bi has the one with the best ratios of the lot.

Maybe I poorly configured XIOS. I’m bad at figuring out what number of cpu/buffer I should give. My current set up is about 1 XIOS core for 6 NEMO core. With the outputs I have for each category corresponding to:

  • 1M: ~ 5MB of data in total
  • 1D: ~ 30MB data / day
  • 1H: ~ 90MB/day (or 3.75MB per hour)
  • STATIONS: ~ 0.11MB/day
    With 16 XIOS core, it shouldn’t be more than 2 or 3MB/days per core even with hourly outputs.

I tried with increasing my buffer size from 200000 (0.2MB) to 20000000 (20MB), and the buffer_server_factor_size from 2 to 4. It didn’t change anything.
And regardless of the config, how the same config can behave differently based on each HPC? It should be better with more memory/core and a better connectivity, not worst.

Check for poor filesystem performance on the second machine.

All of them are writing on the same filesystem that is cross mounted. That shouldn’t be a problem, but that’s one of the point I’m trying to explore with the HPC support.

If it is taking too long to write the output files, then the model will eventually have to wait for XIOS to catch up

Why NEMO has to wait? At the last time step, I understand, but at the end of a day? The sending of data to XIOS is done with non-blocking communication, isn’t it? So unless it reaches the sync_freq (1 month in my case) it shouldn’t have to wait. Or this frequency is ignored if the experiment lasts less than the frequency?

Coming back to my own topic ages later, but if anyone in the future ends up on this subject, there is what we found in the end: the problem was from HDF5/Netcdf4 on our new cluster.

It’s still very unclear why, as we don’t have a real solution to the problem, but the parallel HDF5/NetCDF4 version we have on the cluster hangs like crazy when using the parallel I/O in XIOS. Our workaround was to use the multiple_file writing option in XIOS. With that, no more hanging, and everything is fast again.