AGRIF - Subnest processor map

Dear All,

I’m trying to run AGRIF. With one child level (1_ ), it runs fine. But i have problems with the processor map when trying to run a subnest (so a child in the child).

I do not specify jpni/jpnj, at none of the levels (they are set to zero), and i get the following error at level 2_ (please check below). I’ve tried to change jpni/jpnj at level 2_, and a lot of other tricks, but nothing works. Can anyone help ?

Thanks in advance,

Robinson

_____________________

mpp_init:

  The number of mpi processes:   966
  exceeds the maximum number of subdomains (ocean+land) =   961
  defined by the following domain decomposition: jpni =   31 jpnj =   31
   You should: 
     - either prescribe your domain decomposition with the namelist variables
       jpni and jpnj to match the number of mpi process you want to use, 
       even IF it not the best choice...
     - or keep the automatic and optimal domain decomposition by picking up one
       of the number of mpi process proposed in the list bellow


                  For your information:
  list of the best partitions including land supression
  -----------------------------------------------------

nb_cores oce:   1849, land domains excluded:      0 ( 0.0%), largest oce domain:        49 (      7 x      7 )
nb_cores oce:   1643, land domains excluded:      0 ( 0.0%), largest oce domain:        54 (      6 x      9 )
nb_cores oce:   1548, land domains excluded:      0 ( 0.0%), largest oce domain:        56 (      7 x      8 )
nb_cores oce:   1431, land domains excluded:      0 ( 0.0%), largest oce domain:        60 (      6 x     10 )
nb_cores oce:   1333, land domains excluded:      0 ( 0.0%), largest oce domain:        63 (      7 x      9 )
nb_cores oce:   1296, land domains excluded:      0 ( 0.0%), largest oce domain:        64 (      8 x      8 )
nb_cores oce:   1272, land domains excluded:      0 ( 0.0%), largest oce domain:        66 (      6 x     11 )
nb_cores oce:   1161, land domains excluded:      0 ( 0.0%), largest oce domain:        70 (      7 x     10 )
nb_cores oce:   1116, land domains excluded:      0 ( 0.0%), largest oce domain:        72 (      8 x      9 )
nb_cores oce:   1032, land domains excluded:      0 ( 0.0%), largest oce domain:        77 (      7 x     11 )
nb_cores oce:    972, land domains excluded:      0 ( 0.0%), largest oce domain:        80 (      8 x     10 )

If you do not specify jpni and jpnj, NEMO will pick-up the best partition and try to match it with the number of proc you are providing by keeping some land subdomain if needed. For example, if you provide N cores, NEMO will compute jpni, jpnj, noce, land_exclude and land_keep with

noce is an optimal domain decomposition
noce = jpni * jpnj - land_exclude 
noce <= N
N = noce + land_keep 
0 <= land_keep <= land_exclude

See section 2.1 of Irrmann et al. 2022 for all details.

If NEMO is unable to find a suitable solution, because N = noce + land_keep would require land_keep > land_exclude, the model stops with the error message you got.

The solution is then to specify jpni and jpnj in the namelist. The simplest way is to define them such as N = jpni * jpnj. You will not use an optimal domain decomposition but the model will run. :face_with_diagonal_mouth:

We plan to fix this in the future release by allowing some idle cores when running NEMO, we will the have:

if land_keep <= land_exclude then
  noce = jpni * jpnj - land_keep 
else 
  noce = jpni * jpnj - land_exclude - n_idle_procs

Sébastien

Irrmann, G., Masson, S., Maisonnave, É., Guibert, D., & Raffin, E. (2022). Improving ocean modeling software NEMO 4.0 benchmarking and communication efficiency. Geoscientific Model Development, 15(4), 1567–1582. GMD - Improving ocean modeling software NEMO 4.0 benchmarking and communication efficiency

Merci Seb, I will try.

ok, now it works. So i write below how i did through an example, to help others to whom it may happen.

jpni*jpnj=966 is the total number of processor used to run Nemo, and none of the numbers suggested in the list when it crashes. I just searched for a division of 966 as squared as possible.

ln_listonly = .false. ! do nothing else than listing the best domain decompositions (with land domains suppression)
! ! if T: the largest number of cores tested is defined by max(mppsize, jpni*jpnj)
ln_nnogather = .false. ! activate code to avoid mpi_allgather use at the northfold
jpni = 42 ! number of processors following i (set automatically if < 1), see also ln_listonly = T
jpnj = 23 ! number of processors following j (set automatically if < 1), see also ln_listonly = T
nn_hls = 1 ! halo width (applies to both rows and columns)
nn_comm = 1 ! comm choice

Robinson, which version are you using? With NEMO 5 you must have nn_hls = 2.

I’m using Nemo 4.2.2

Ah now it crashes again, I saw on a webpage you answered to a similar question Sébastien, i get the error below, should i modify the code as in here (Ice suppression on child grid fails when writing restarts through XIOS (#604) · Issues · NEMO Workspace / Nemo · GitLab), although all levels should have sea ice indeed.

____________________________________________

iccontext.cpp terminate called after throwing an instance of ‘xios::CException’
void cxios_context_handle_create(xios::CContext **, const char *, int) 34
terminate called after throwing an instance of ‘xios::CException’
In file “iccontext.cpp”, function “void cxios_context_handle_create(xios::CContext **, const char *, int)”, line 54 → Context 2_nemo unknown

The error message (Context 2_nemo unknown) you get indicates that you probably missed to specify a context for your second level child (“subnest”) in iodef.xml.

Ah yes, indeed. I went a bit too quick. Thanks a lot Franziska.