Set non-optimal domain decomposition and balance process distribution over CPU nodes

Dear all,
I run PISCES using v4.0.6 with XIOS 2.5. I have already prepared the dynamical fields using the previous version NEMO 3.6. The configuration has 797 procs (jpni x jpnj = 32 x 52) and I kept this domain decomposition in PISCES run as well. During several submission jobs the model “freezes” before the first time step (it isn’t killed, but it stops to write in ocean.output before read the dynamical fields), without killed the job.
However, when I use the optimal decomposition that suggested in ocean.output (55 x 31 = 793) this problem doesn’t seem to exist.
Is there any way to fix this problem or will I have to follow the new domain decomposition?

Note that this problem sometimes doesn’t exist when I change the nn_itend, but it seems as something random.

Specifying jpni and jpnj should not be a problem as long as you run the model on the appropriate number of cores (i.e. jpni*jpnj).
The strange behaviour of the model in your tests suggests that you may have a problem with the amount of memory your are trying to allocate.

  • Can you test to run the model on 797 procs (jpni x jpnj = 32 x 52) without xios (i.e. without key_iomput.
  • How do you distribute the XIOS processes among the NEMO processes? Did you try to spread the XIOS processes among the NEMO processes? For example put 1 or 2 XIOS processes on each node.

I put 19 nemo processes and 1 XIOS on each node (each node has 20 processes) and 2 XIOS on the last node. The administrator of the HPC system told me that he didn’t find any memory issue during the run, but I don’t know if it is any other way to check the problem.

So does that mean you are using 42 XIOS servers? You don’t mention which configuration you are running but if it is one of the eORCA grids then you may have latitude bands in Antarctica for which some of your XIOS servers have no sea points. This is another possible cause.

Dear Marilia,
did you solve this issue?
I am encountering similar troubles: the model freezes for no obvious reason at the end of the initialization phase.
Thanks,
Anne

Dear Anne,
I have not completely solved the problem, but I managed to overcome this issue in a way. I chose an appropriate domain decomposition according to the note:

Due to the different domain decompositions between XIOS and NEMO, if the total number of cores is larger than the number of grid points in the j direction then the model run will fail

And spread the XIOS processes among the NEMO processes. When the freeze happened again, I changed the number of nn_itend and the model was running without knowing why it happens!

Do you use xios in attached or detached (server) mode? Thanks for your reply, Anne