[v4.2.x] MPI error at open boundary

amoulin · 6 October 2021 15:00

Dear all,

The actual CMEMS Mediterranean sea configuration is running with NEMO v3.6 and we are now testing the 4.2_RC. Simulations with closed boundaries are running but when we open the Atlantic boundaries several problems are coming. We appreciate any comment that might help us.

Here is a picture of the actual CMEMS Medsea configuration and the boundary options defined in the 3.6 version:

We have a problem with domain decomposition. We tried several options for jpni/jpnj and the cases with jpni>1 gave this error:

Abort(738313486) on node 71 (rank 71 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack:
PMPI_Recv(173): MPI_Recv(buf=0x48cbb0d0, count=655, MPI_DOUBLE_PRECISION, src=69, tag=4, comm=0x84000006, status=0x1) failed
(unknown)(): Message truncated

Example:

jpni=1 / jpnj=72 / nn_comm=1 / ln_bdy=.true. / OK
jpni=2 / jpnj=36 / nn_comm=1 / ln_bdy=.false. / OK
jpni=2 / jpnj=36 / nn_comm=1 / ln_bdy=.true. / Fatal error in PMPI_Recv: Message truncated, error stack

We tried to define the boundary with segments (in case the coordinate.bdy.nc was wrong) but the same error appears when the northern and/or southern Atlantic boundaries are opened. There is no problem with the western boundary.

If it can help, this error does not appear with the trunk revision 14657.

Would you have suggestions for the choice of the domain decomposition or any idea about this error?

If we follow the run using the decomposition with jpni=1, the model blow-up after a few times step. The model blows up only with the orlanski_npo option for dyn3d. The problem is not visible for dyn3d='frs' or 'orlanski'. It seems an MPI issue at point 1 and jpj-1 (see picture with 48 processors).

We are using the same coordinates.bdy and nambdy_dta files as for the CMEMS Medsea configuration with NEMO 3.6, could it be a problem?

Another point is about the Dardanelles strait. Could someone explain to us why the Orlanski option cannot be used anymore when boundaries are on the interior of the computational domain ( see the message in bdyini: ' Orlanski is not safe when the open boundaries are on the interior of the computational domain’)?

Aimie

smasson · 6 October 2021 15:47

Which value do you use for nn_hls?
I am currently tracking a bug somewhere in BDY. I have it with some domain decomposition and nn_hls = 2. @jchanut also has the error with nn_hls = 1 but for me it is ok…

Regarding orlanski, as far as I am remembering, it is using diagonal points (like (i+1,j+j) or (i-1,j-j)) which are quite a nightmare to deal with in MPI and is not working if the MPI domain decomposition passes just along (or at 1 point of) your BDY.

amoulin · 6 October 2021 15:54

I am using nn_hls=1.

nn_hls=2 does not change anything for the domain decomposition error and seems to solve the MPI error with orlanski_npo but gave quite different results at the boundary in comparison with NEMO v3.6.

clem · 6 October 2021 17:24

Do you use the very last revision of the trunk? I tried an intermediate revision (like 5 revisions back?) which crashed.

amoulin · 7 October 2021 08:04

I have just updated the revision to the last one and I have the same issue.

smasson · 7 October 2021 15:13

Please try r15345, hopefully it should solve the problem at least for the non-orlanski BDY

amoulin · 8 October 2021 08:52

Unfortunately, the MPI error is still here when I activate the ln_bdy.

smasson · 8 October 2021 16:05

yes, it is better but there is still potential bug when the BDY rim is located on a corner of an MPI subdomain. I am working on it…

smasson · 14 October 2021 08:30

please try r15368, it should work…

amoulin · 14 October 2021 10:18

The [15368] revision solves the issue. I have no more MPI errors with bdy. Thanks a lot!

Franziska · 14 October 2021 10:35

I also had problems with different domain decompositions and was following this issue. I can confirm that the problem is solved with revision 15368.
Thank you very much @smasson!

smasson · 14 October 2021 10:48

yes, it took me a while but I think (hope) I finally got it!

afstyles · 8 November 2021 10:11

I have a similar issue for a different domain in UKMO/NEMO_4.0.4_mirror @14075. When a corner is required for BDY I get the exact same problem as the OP.

I have tried to introducing the fix in [15368] to r4.0.4 but I have had to substitute some variables to allow for differences between the versions:

nn_hls = 1
Nis0 --> 1 + nn_hls
Nie0 --> jpi - nn_hls
Njs0 --> 1 + nn_hls
Nje0 --> jpj - nn_hls

This implementation does seem to fix the corner issue with BDY, however I am concerned about possible issues I might have overlooked.

Any suggestions on how to properly apply the fix in [15368] to NEMO 4.0.4 would be greatly appreciated.

Topic		Replies	Views
New regional configuration setup: open boundaries Regional BDY	24	2809	21 October 2021
[v4.2.x] NEMO-OASIS-WRF fails with infinite SSH by increasing MPI tasks used by NEMO v4.2.x OASIS	5	567	8 July 2022
Set non-optimal domain decomposition and balance process distribution over CPU nodes Versions DOM , XIOS	7	635	24 January 2022
[v4.2.x][v4.0.x] Mixed layer depth `mldr10_3` not initialized Versions DIA , orca	13	546	30 March 2022
BDY segments with NEMO4.2+ v4.2.x BDY	8	329	21 March 2024

[v4.2.x] MPI error at open boundary

Related topics