Dear All,
I’m trying to run AGRIF. With one child level (1_ ), it runs fine. But i have problems with the processor map when trying to run a subnest (so a child in the child).
I do not specify jpni/jpnj, at none of the levels (they are set to zero), and i get the following error at level 2_ (please check below). I’ve tried to change jpni/jpnj at level 2_, and a lot of other tricks, but nothing works. Can anyone help ?
Thanks in advance,
Robinson
_____________________
mpp_init:
The number of mpi processes: 966
exceeds the maximum number of subdomains (ocean+land) = 961
defined by the following domain decomposition: jpni = 31 jpnj = 31
You should:
- either prescribe your domain decomposition with the namelist variables
jpni and jpnj to match the number of mpi process you want to use,
even IF it not the best choice...
- or keep the automatic and optimal domain decomposition by picking up one
of the number of mpi process proposed in the list bellow
For your information:
list of the best partitions including land supression
-----------------------------------------------------
nb_cores oce: 1849, land domains excluded: 0 ( 0.0%), largest oce domain: 49 ( 7 x 7 )
nb_cores oce: 1643, land domains excluded: 0 ( 0.0%), largest oce domain: 54 ( 6 x 9 )
nb_cores oce: 1548, land domains excluded: 0 ( 0.0%), largest oce domain: 56 ( 7 x 8 )
nb_cores oce: 1431, land domains excluded: 0 ( 0.0%), largest oce domain: 60 ( 6 x 10 )
nb_cores oce: 1333, land domains excluded: 0 ( 0.0%), largest oce domain: 63 ( 7 x 9 )
nb_cores oce: 1296, land domains excluded: 0 ( 0.0%), largest oce domain: 64 ( 8 x 8 )
nb_cores oce: 1272, land domains excluded: 0 ( 0.0%), largest oce domain: 66 ( 6 x 11 )
nb_cores oce: 1161, land domains excluded: 0 ( 0.0%), largest oce domain: 70 ( 7 x 10 )
nb_cores oce: 1116, land domains excluded: 0 ( 0.0%), largest oce domain: 72 ( 8 x 9 )
nb_cores oce: 1032, land domains excluded: 0 ( 0.0%), largest oce domain: 77 ( 7 x 11 )
nb_cores oce: 972, land domains excluded: 0 ( 0.0%), largest oce domain: 80 ( 8 x 10 )
If you do not specify jpni and jpnj, NEMO will pick-up the best partition and try to match it with the number of proc you are providing by keeping some land subdomain if needed. For example, if you provide N cores, NEMO will compute jpni, jpnj, noce, land_exclude and land_keep with
noce is an optimal domain decomposition
noce = jpni * jpnj - land_exclude
noce <= N
N = noce + land_keep
0 <= land_keep <= land_exclude
See section 2.1 of Irrmann et al. 2022 for all details.
If NEMO is unable to find a suitable solution, because N = noce + land_keep would require land_keep > land_exclude, the model stops with the error message you got.
The solution is then to specify jpni and jpnj in the namelist. The simplest way is to define them such as N = jpni * jpnj. You will not use an optimal domain decomposition but the model will run. 
We plan to fix this in the future release by allowing some idle cores when running NEMO, we will the have:
if land_keep <= land_exclude then
noce = jpni * jpnj - land_keep
else
noce = jpni * jpnj - land_exclude - n_idle_procs
Sébastien
Irrmann, G., Masson, S., Maisonnave, É., Guibert, D., & Raffin, E. (2022). Improving ocean modeling software NEMO 4.0 benchmarking and communication efficiency. Geoscientific Model Development, 15(4), 1567–1582. GMD - Improving ocean modeling software NEMO 4.0 benchmarking and communication efficiency
ok, now it works. So i write below how i did through an example, to help others to whom it may happen.
jpni*jpnj=966 is the total number of processor used to run Nemo, and none of the numbers suggested in the list when it crashes. I just searched for a division of 966 as squared as possible.
ln_listonly = .false. ! do nothing else than listing the best domain decompositions (with land domains suppression)
! ! if T: the largest number of cores tested is defined by max(mppsize, jpni*jpnj)
ln_nnogather = .false. ! activate code to avoid mpi_allgather use at the northfold
jpni = 42 ! number of processors following i (set automatically if < 1), see also ln_listonly = T
jpnj = 23 ! number of processors following j (set automatically if < 1), see also ln_listonly = T
nn_hls = 1 ! halo width (applies to both rows and columns)
nn_comm = 1 ! comm choice
Robinson, which version are you using? With NEMO 5 you must have nn_hls = 2.
Ah now it crashes again, I saw on a webpage you answered to a similar question Sébastien, i get the error below, should i modify the code as in here (Ice suppression on child grid fails when writing restarts through XIOS (#604) · Issues · NEMO Workspace / Nemo · GitLab), although all levels should have sea ice indeed.
____________________________________________
iccontext.cpp terminate called after throwing an instance of ‘xios::CException’
void cxios_context_handle_create(xios::CContext **, const char *, int) 34
terminate called after throwing an instance of ‘xios::CException’
In file “iccontext.cpp”, function “void cxios_context_handle_create(xios::CContext **, const char *, int)”, line 54 → Context 2_nemo unknown
The error message (Context 2_nemo unknown) you get indicates that you probably missed to specify a context for your second level child (“subnest”) in iodef.xml.
Ah yes, indeed. I went a bit too quick. Thanks a lot Franziska.