Hello,
I’m looking for advice / best practices on NetCDF formatting of NEMO input forcing files to reduce unnecessary I/O on I/O-limited, shared HPC systems (parallel filesystem, multiple users running at once).
Context
We run a physical + biogeochemical configuration, so we have many different forcing datasets (atmosphere, rivers, waves, biogeochemical boundary conditions, etc.). On our machine, sustained reads from $SCRATCH can become a bottleneck for everyone. I would like to better understand how NEMO reads forcing files and what file-formatting choices can help.
My expectation is that, in a typical setup, the model should mostly read forcing when it needs a new forcing record (depending on frequency/interpolation), rather than continuously pulling data from disk at a high rate throughout the run. In practice, I sometimes observe sustained disk activity that seems larger than expected (e.g. >100 MB/s).
While investigating, I noticed that certain NetCDF characteristics and preprocessing steps seem to influence the I/O pattern (sometimes a lot), but I don’t have a solid theoretical understanding of why.
What I would like to understand / questions
- How does NEMO read forcing in parallel?
-
Do all MPI ranks open/read forcing files?
-
Does each rank read only its subdomain, or are there collective reads / internal buffering?
-
Is there any known caching/buffering behavior in iom_get / fldread that could explain sustained reads?
- Chunking (NetCDF4/HDF5): what is recommended for input forcing?
-
Is there a recommended chunk layout for variables that are typically accessed one timestep/record at a time?
-
Any chunking patterns known to be “bad” and lead to large re-reads or inefficient access? For instance, it seemed to me that larger time chunks increase disk reading rates.
- Unlimited time dimension
- Compression
-
Can compression materially change how NEMO reads forcing (e.g. sustained reads because of decompression patterns)?
-
Is it recommended to avoid compressing coordinate variables (time/lat/lon), or does it not matter?
-
Any recommended “safe” compression settings for forcing files?
- File splitting / preprocessing tools
-
Does splitting forcing into yearly vs monthly vs daily files have a known impact on disk reading rates?
-
Are there known issues where tools like cdo splitmon / ncrcat / ncks silently change chunking, time axis settings, or variable layout in a way that affects NEMO I/O?
- Existing guidance / references
- Is there any documentation or previous forum thread that gives a checklist for input forcing file formatting (dimensions, order, chunking, compression, coordinates) for NEMO?
Any guidance, pointers, or “do this / avoid that” rules of thumb would be greatly appreciated.
Thank you!
Mathurin (mchoblet)
1 Like
Hi,
I can mostly answer to point 5.
Turning on compression generally forces chunking in Netcdf.
Cdo strongly prefers a chunking (as a default setting) to read and write total 2d(x-y) chunks. It is partially possible to influence chunking in the latest versions of cdo (>=2.5.2). Assume that cdo will rechunk your output file to the cdo default setting for now if you didn’t verify the behavior with your version and settings.
For fine-grain control over chunking, I use the latest versions of nco (ncpdq) and set the compression and chunking manually. You can change them on a per variable basis.
I would double check with “ncdump -hs” that the file does what it is supposed to in terms of chunking. I remember that its considered in some circles a bad practice to compress dimensions, but that might be outdated info.
In theory compression should reduce disk-io at the expense of cpu use, but there are odd surprises on real machines. For example: I can read compressed files but not write them with Xios on my machine.
I suspect that the reading patterns in Nemo are different if Xios is used for that file/field and depending on the Xios version and settings. With Xios3 there should be better controls on chunking (and compression), mostly influential to writing.
Therefore tuning the file system for the parallel access patterns of Xios is another option but might be temporary. If bursts are better you can consider caching the files for that period in a ram-disk or locally. Maybe there is a setting your cluster admin can help with in that regard. For Lustre there exist some options that can be set on a directory basis that influence parallel performance and caching.
The perfect numbers probably depend on your grid configuration and the number of processes & nodes.
Hope these thoughts are helpful for your optimization.
1 Like
I don’t have any answers but would be curious to hear from the developers or anyone who has insight on these questions. We have had issues running NEMO on our local cluster with jobs sometimes hanging (not crashing; just ‘frozen’ until the scheduler times it out) and memory usage always maxing out regardless of how high it is set, even though it never kills the job. The hanging issue turned out to be related to NFS file locking, and so we now run with this disabled which allows us to complete runs more reliably. The memory issue remains and our IT admins tell us that is due to page caching being included in the job scheduler’s memory tracking. I’m not an expert on these things but I suspect both issues could be related to the way forcing files are read during the run, or perhaps there is a better way for us to set them up.
1 Like
I can’t provide authoritative answers, but from my personal experience:
- If the data is uncompressed, I think that the fldread parallelism is fine – normally each process reads/buffers its own subdomain only.
- NEMO does not read coordinate fields.
- Input files should not be chunked in the (y,x) dimensions.
- Forcing files having time dimension unlimited is good practice (and was necessary at least in NEMO3.6, as far as I remember).
- It’s much faster and more efficient to provide NEMO with uncompressed files. If you start from compressed data, then whatever you do will involve decompression at some stage. What you get to decide is, at what stage is decompression done. When you feed NEMO a compressed file, it’s NEMO who is doing the decompression. That’s not cost/time effective because you’re using a heavily MPI-parallelised job (a NEMO run) for doing data decompression (essentially, I/O). IMO, it’s better to decompress all forcing fields needed for your current job (e.g., all forcings that will be read) as a pre-run task running with limited resources, then have NEMO read that, and once it’s done, delete the temporarily uncompressed copies.
- File splitting: my gut feeling is that if the time dimension is unlimited and the data’s uncompressed, then it doesn’t matter.
Take this with a pinch of salt – but still, I hope this helps.
2 Likes
Thanks a lot for the replies — very helpful.
I had the same impression regarding CDO: in my workflow it often seems to rewrite/rechunk files in ways that are not necessarily ideal for NEMO’s access pattern, so I’ll be more systematic about checking the layout with ncdump -hs after CDO operations.
The point about file locking (NFS) is also something I had never considered before, so I’ll keep that in mind if we see any “frozen” jobs or odd I/O behaviour on our side.
Finally, I haven’t really explored spatial chunking yet, its maximal (one single lat/lon chunk) as a default in my forcing data. For most of my forcings the grids are not huge, but for very high spatial resolution products (e.g. waves) it may be worth testing whether chunk shape affects performance/IO patterns in practice.
I will also test with fully decompressing some forcing data before running the model.
If anyone has additional insights, references, or concrete recommendations for “NEMO-friendly” input layout (chunking/compression/time axis), I’d be very happy to hear them.
Hi,
It’s been awhile since I’ve looked at this, but from memory:
As I understand it, each rank reads what it needs. So for BDY files, only a few ranks along the edge will open the files and load a subset corresponding to the tile’s BDY. For surface forcing, each rank reads a subset of what it needs (with or without on the fly interp), and similarly for RST/IC. I think this is about as efficient as it could be it terms of minimal wasted loading. If we could delegate this reading to XIOS, that could be better for the filesystem – one process loading data in large sequential chunks instead of N processes loading tiny bits – essentially the reverse of what it does for output.
If you have compression on, reading any point requires decompressing the chunk containing it – the netcdf library handles this part (called via fldread). For no chunking, that means the entire netcdf variable (or at least up to the point that you need, so I suppose amortized it’s half the variable). As you reduce the chunk size, the amount to read/decompress drops. If you chunked (x,y,t)=(nx,ny,1) for surface forcing – along time only – then you need to decompress one 2D slab per access, even if the rank needs just a 3x3 patch of points. If you chunk with (nx/10, ny/10, 1) then that reduces the chunk size to decompress 100-fold.
In terms of recommendations. No compression is likely fastest in most cases (at the expense of extra storage and/or the decompression prep task). For employing compression, I’ve found that chunking 2D surface forcing with (x,y,t)=(64x64x1) works pretty well. Since each rank loads just what it needs, this would often be one chunk, sometimes two or three if the particular tile straddles chunk boundaries. That chunk size is 16kB, big enough that the compression scheme can actually compress it, but small enough as to be rounding error in terms of time spent on decompressing (which is further amortized over the number of time steps between forcing records).
2 Likes
Interesting topic and replies, thanks everyone.
I once did a benchmark with NEMO 4.2.2, eORCA1, and ERA5 forcing files: uncompressed versus compressed at L1 with chunking along time only.
After reading the replies above, I’m not sure what to think about my chunking, but anyway, the result of the benchmark came as a surprise: NEMO ran 15 % faster with the compressed files. My interpretation was that it was faster to read less data and uncompress it.
2 Likes
I wouldn’t have bet on this… were you using pre-interpolated forcing files (to eORCA1, so allegedly quite small)? I wonder whether you’d get the same thing with, say, eORCA025… probably depends on local domain size and hpc architecture… 
I was using the ERA5 files on their native 0.25° grid, so I don’t think eORCA1 vs eORCA025 matters. But again, I was surprised by the result too.
1 Like