[v4.2.x] MPI error at open boundary

Dear all,

The actual CMEMS Mediterranean sea configuration is running with NEMO v3.6 and we are now testing the 4.2_RC. Simulations with closed boundaries are running but when we open the Atlantic boundaries several problems are coming. We appreciate any comment that might help us.

Here is a picture of the actual CMEMS Medsea configuration and the boundary options defined in the 3.6 version:

  1. We have a problem with domain decomposition. We tried several options for jpni/jpnj and the cases with jpni>1 gave this error:
Abort(738313486) on node 71 (rank 71 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x48cbb0d0, count=655, MPI_DOUBLE_PRECISION, src=69, tag=4, comm=0x84000006, status=0x1) failed
(unknown)(): Message truncated


jpni=1 / jpnj=72 / nn_comm=1 / ln_bdy=.true. / OK
jpni=2 / jpnj=36 / nn_comm=1 / ln_bdy=.false. / OK
jpni=2 / jpnj=36 / nn_comm=1 / ln_bdy=.true. / Fatal error in PMPI_Recv: Message truncated, error stack

We tried to define the boundary with segments (in case the coordinate.bdy.nc was wrong) but the same error appears when the northern and/or southern Atlantic boundaries are opened. There is no problem with the western boundary.

If it can help, this error does not appear with the trunk revision 14657.

Would you have suggestions for the choice of the domain decomposition or any idea about this error?

  1. If we follow the run using the decomposition with jpni=1, the model blow-up after a few times step. The model blows up only with the orlanski_npo option for dyn3d. The problem is not visible for dyn3d='frs' or 'orlanski'. It seems an MPI issue at point 1 and jpj-1 (see picture with 48 processors).

We are using the same coordinates.bdy and nambdy_dta files as for the CMEMS Medsea configuration with NEMO 3.6, could it be a problem?

  1. Another point is about the Dardanelles strait. Could someone explain to us why the Orlanski option cannot be used anymore when boundaries are on the interior of the computational domain ( see the message in bdyini: ' Orlanski is not safe when the open boundaries are on the interior of the computational domain’)?


1 Like

Which value do you use for nn_hls?
I am currently tracking a bug somewhere in BDY. I have it with some domain decomposition and nn_hls = 2. @jchanut also has the error with nn_hls = 1 but for me it is ok…

Regarding orlanski, as far as I am remembering, it is using diagonal points (like (i+1,j+j) or (i-1,j-j)) which are quite a nightmare to deal with in MPI and is not working if the MPI domain decomposition passes just along (or at 1 point of) your BDY.

I am using nn_hls=1.

nn_hls=2 does not change anything for the domain decomposition error and seems to solve the MPI error with orlanski_npo but gave quite different results at the boundary in comparison with NEMO v3.6.

Do you use the very last revision of the trunk? I tried an intermediate revision (like 5 revisions back?) which crashed.

I have just updated the revision to the last one and I have the same issue.

Please try r15345, hopefully it should solve the problem at least for the non-orlanski BDY

Unfortunately, the MPI error is still here when I activate the ln_bdy.

yes, it is better but there is still potential bug when the BDY rim is located on a corner of an MPI subdomain. I am working on it…

please try r15368, it should work…

The [15368] revision solves the issue. I have no more MPI errors with bdy. Thanks a lot!

I also had problems with different domain decompositions and was following this issue. I can confirm that the problem is solved with revision 15368.
Thank you very much @smasson!

yes, it took me a while but I think (hope) I finally got it! :sweat_smile:

1 Like

I have a similar issue for a different domain in UKMO/NEMO_4.0.4_mirror @14075. When a corner is required for BDY I get the exact same problem as the OP.

I have tried to introducing the fix in [15368] to r4.0.4 but I have had to substitute some variables to allow for differences between the versions:

nn_hls = 1
Nis0 --> 1 + nn_hls
Nie0 --> jpi - nn_hls
Njs0 --> 1 + nn_hls
Nje0 --> jpj - nn_hls

This implementation does seem to fix the corner issue with BDY, however I am concerned about possible issues I might have overlooked.

Any suggestions on how to properly apply the fix in [15368] to NEMO 4.0.4 would be greatly appreciated.