Very poor scaling with OpenIFS + NEMO 4.2 + XIOS 3 (slow)

Good afternoon

I recently did some test runs with a coupled model with OpenIFS 43r3 + NEMO 4.2 + XIOS 3. The resolutions are 100 km (OpenIFS) and eORCA025 L75, and output from both OpenIFS and NEMO are handled by XIOS.

The model scales relatively well when increasing the number of NEMO cores when no output is written, but when turning on 5-daily U,V,T,S, there is a significant slowdown and no scaling anymore (see attached figure).

The model was initially really slow, but I got some speedup by:

  • distributing 2 XIOS cores per compute node
  • increasing “buffer_size_factor” from 1 to 3 in iodef.xml (increasing to 4 had no impact)
    but the model is still scaling very poorly.

Is this to be expected, or does anyone have some hints on how to further speed up the model?
Or is ~5 SYPD around the best that can be done with eORCA025 L75?

I’m attaching my iodef.xml for reference.

Many thanks for any hints or help!
Joakim


scaling_foci4

I am not sure I understand what you do. You are testing ORCA025L75 scalability in coupled mode? Are you coupling with OASIS3-MCT? Are NEMO and OpenIFS running in sequetial or in parallelel? If they run in parallelel how do you check the load balance between the models?
Which exact version of xios3 are you using. xios3-beta? xios3?
How many XIOS cores are you using?
Are you using the one_file or the multiple_file mode?

Andrew Coward has reported a memory leak in XIOS3 which will affect you running longer simulations, so keep an eye out for fixes to XIOS3. He has also spent some time testing the gatherer/writer pools, so might be able to offer some advice on how to optimise these. I’ve not used XIOS3 myself, but in XIOS2 aside from the one_file multiple_file, the other key parameter for performance is the sync_freq. Try to set this reasonably long (e.g. sync_freq=“1mo”) to prevent NEMO waiting for XIOS to flush to disk.

The major leaks have just been fixed. XIOS3-trunk @2593 is looking much better. I’ve been testing an eORCA025 with NEMO main and XIOS3. I recommend the trunk version since you can use the new pools and services to take control over how the XIOS3 resources are used. I’ve been blogging here:

which provides an example (based on ORCA2_ICE_PISCES, but the concepts transfer). For eORCA025, I’ve tried using 52 xios3 servers (with 1019 ocean cores). The iodef.xml is below. This isn’t necessarily optimal but will complete a simulated year in just over an hour -and produce one file output.

I’m using the new RK3 time-stepping but even with MLF, you should be able to do better than 5 SPYD.

<?xml version="1.0"?>
<simulation>

<!-- ============================================================================================ -->
<!-- XIOS context                                                                                 -->
<!-- ============================================================================================ -->

<!-- XIOS3 -->
  <context id="xios" >
    <variable_definition>
      <variable_group id="buffer">
        <variable id="min_buffer_size" type="int">4000000</variable>
        <variable id="optimal_buffer_size" type="string">performance</variable>
      </variable_group>

      <variable_group id="parameters" >
        <variable id="using_server" type="bool">true</variable>
        <variable id="info_level" type="int">10</variable>
        <variable id="print_file" type="bool">false</variable>
        <variable id="using_server2" type="bool">false</variable>
        <variable id="transport_protocol" type="string" >p2p</variable>
        <variable id="using_oasis"      type="bool">false</variable>
      </variable_group>
    </variable_definition>
    <pool_definition>
     <pool name="Opool" nprocs="52">
      <service name="t5dwriter" nprocs="2"  type="writer"/>
      <service name="iwriter"   nprocs="2"  type="writer"/>
      <service name="t1ywriter" nprocs="4"  type="writer"/>
      <service name="t1mwriter" nprocs="4"  type="writer"/>
      <service name="u1mwriter" nprocs="2"  type="writer"/>
      <service name="u5dwriter" nprocs="2"  type="writer"/>
      <service name="u1ywriter" nprocs="2"  type="writer"/>
      <service name="tgatherer" nprocs="16" type="gatherer"/>
      <service name="igatherer" nprocs="4"  type="gatherer"/>
      <service name="ugatherer" nprocs="14" type="gatherer"/>
     </pool>
    </pool_definition>
  </context>

<!-- ============================================================================================ -->
<!-- NEMO  CONTEXT add and suppress the components you need                                       -->
<!-- ============================================================================================ -->

  <context id="nemo" default_pool_writer="Opool" default_pool_gatherer="Opool" src="./context_nemo.xml"/>       <!--  NEMO       -->

</simulation>

See the ORCA2_ICE_PISCES blog example for the other changes you’ll need. My current approach is to:

  • use tgather service to gather all T-Grid and W-Grid fields
  • use igather service to gather all ice fields
  • use ugather service to gather all U- and V-grid fields
  • use writer services dedicated to each grid and frequency ( t5dwriter, u5dwriter, t1mwriter, …etc.)

seems to be working nicely but there are plenty of other possibilities.

This is my results with the “main” as of today for a regional ocean-ice configuration at 1/36deg resolution and xios2 (if it helps)

Thanks everyone for the replies, especially Andrew for the link to the “blog” post :wink: One year of eORCA025 in around 1 hour sounds much more acceptable.

I’ve pulled the most recent XIOS3 trunk and also NEMO 4.2.1 (a small upgrade from before).
However, due to some new security settings on our HPC the model now refuses to compile so it’ll be a few days before I can report on anything.

Cheers
Joakim

PS

To clarify, here are my current settings:
Coupled model (OpenIFS 100km, 91 levels + NEMO 4.2 eORCA025 L75 + OASIS3-MCT5).
Lagged coupling, i.e. NEMO and OpenIFS use coupled fields from previous coupling step (1 hour).
I’ve checked the load balance using the LUCIA tool in OASIS and it reports quite a lot of waiting time for both OpenIFS and NEMO but none for XIOS.

Cores: 288 OpenIFS + 60 XIOS cores + 1 runoff mapper. Tested from 600 to 3500 cores for NEMO.
Hardware: Two 48-core Intel CPUs per node (96 cores per node).
XIOS settings: One file, 1 month sync frequency, 5-daily output of U,V,T,S + surface fluxes and sea ice.
I can’t find the exact XIOS revision we are using, but it was pulled from the trunk sometime in February 2023.

Also, the log file “xios_client_2000.out” (core 2000 is a NEMO core) reports:
→ report : Memory report : Context : server side : total memory used for buffer 345870 bytes
→ info : CContext: Context is finalized.
→ info : Client side context is finalized
→ report : Performance report : Whole time from XIOS init and finalize: 874.682 s
→ report : Performance report : total time spent for XIOS : 359.148 s
→ report : Performance report : time spent for waiting free buffer : 222.988 s
→ report : Performance report : Ratio : 25.4936 %
→ report : Performance report : This ratio must be close to zero. Otherwise it may be usefull to increase buffer size or numbers of server
→ report : Memory report : Minimum buffer size required : 378068 bytes
→ report : Memory report : increasing it by a factor will increase performance, depending of the volume of data wrote in file at each time step of the file

which I think means XIOS is not performing as well as it can.

With XIOS2 I have sometimes found adding:

      <variable_group id="buffer">
      <variable id="optimal_buffer_size"       type="string">memory</variable>
      <variable id="buffer_size_factor"        type="double">2.0</variable>
      </variable_group>

to the iodef.xml helpful, and with NEMO + XIOS in general it is worth trying to part-populate nodes. Our HPC (two Intel 8358, 32 core, CPUs per node) performs better if 1-2 cores per CPU are left idle. Our national facility, Archer2, has two AMD 7742, 64 core, CPUs per node, and there is a significant benefit to part-populating the nodes.

A brief update. I’ve also been running a 4.2.1 version of NEMO with XIOS3 (trunk@2593). This eORCA025 config has a slightly heavier IO load (some 1d fields, more 5d fields etc.); something closer to a CMIP6-type load. For this, I’ve had to revise my strategy and introduce more writer services and increase the size of the gatherers in order for XIOS to keep up with the run. It is obvious when it is struggling because you will start to see output of the heavier fields lagging behind the lighter ones. For example, after 6 months of simulation, I was seeing 1d files from month 6 but still waiting for monthly files from month 3. Eventually the back-log causes OOM issues. With 90 xios3 servers and enlarged pools and services (see iodef.xml below), the io keeps up with the model and it can maintain one-file output throughout. A nice illustration of the extra versatility the xios3 provides.

<?xml version="1.0"?>
<simulation>

<!-- ============================================================================================ -->
<!-- XIOS context                                                                                 -->
<!-- ============================================================================================ -->

<!-- XIOS3 -->
  <context id="xios" >
    <variable_definition>
      <variable_group id="buffer">
        <variable id="min_buffer_size" type="int">4000000</variable>
        <variable id="optimal_buffer_size" type="string">performance</variable>
      </variable_group>

      <variable_group id="parameters" >
        <variable id="using_server" type="bool">true</variable>
        <variable id="info_level" type="int">10</variable>
        <variable id="print_file" type="bool">false</variable>
        <variable id="using_server2" type="bool">false</variable>
        <variable id="transport_protocol" type="string" >p2p</variable>
        <variable id="using_oasis"      type="bool">false</variable>
      </variable_group>
    </variable_definition>
    <pool_definition>
     <pool name="Opool" nprocs="90">
      <service name="t1dwriter" nprocs="4"  type="writer"/>
      <service name="t5dwriter" nprocs="4"  type="writer"/>
      <service name="w5dwriter" nprocs="4"  type="writer"/>
      <service name="iwriter"   nprocs="4"  type="writer"/>
      <service name="t1mwriter" nprocs="4"  type="writer"/>
      <service name="w1mwriter" nprocs="4"  type="writer"/>
      <service name="t1ywriter" nprocs="4"  type="writer"/>
      <service name="w1ywriter" nprocs="4"  type="writer"/>
      <service name="u1dwriter" nprocs="4"  type="writer"/>
      <service name="u5dwriter" nprocs="4"  type="writer"/>
      <service name="u1mwriter" nprocs="4"  type="writer"/>
      <service name="u1ywriter" nprocs="4"  type="writer"/>
      <service name="tgatherer" nprocs="24" type="gatherer"/>
      <service name="igatherer" nprocs="4"  type="gatherer"/>
      <service name="ugatherer" nprocs="14" type="gatherer"/>
     </pool>
    </pool_definition>
  </context>

<!-- ============================================================================================ -->
<!-- NEMO  CONTEXT add and suppress the components you need                                       -->
<!-- ============================================================================================ -->

  <context id="nemo" default_pool_writer="Opool" default_pool_gatherer="Opool" src="./context_nemo.xml"/>       <!--  NEMO       -->

</simulation>

BTW this setup is running with 1019 ocean cores + 90 xios3 servers across 26, 64-core ice-lake nodes and taking about 1.6 hours per simulated year. This should be very close to your set-up.

Thanks for replies and hints, guys!

I updated to XIOS 3 (revision 2596), but then the model refused to run. After some debugging and following Andrew’s gitlab post carefully, I mananged to set up a run with OpenIFS + NEMO + OASIS + XIOS that compiles and runs.

However, the run freezes near the end. Not the last time step, but on day 30 in a 31-day run.
OASIS indicates that OpenIFS is waiting for SST from NEMO, i.e. that NEMO has frozen for some reason. There is no error in ocean.output, and there’s nothing in the log file from the system either (e.g. OOM etc).

This will take some work to figure out and it could very well be a problem with our system and not NEMO or XIOS. I’ll update this thread again if we get something running and if the scaling is better.

Cheers
Joakim

I am not sure it will help but a user reported that with XIOS3, you must replace "not used" by "oceanx" in the call to xios_initialize in src/OCE/nemogcm.F90

Thanks, Sebastien. I already did this.
/J