MPI & restart inconsistencies with activated solar light penetration

r-hartmann · 11 September 2024 09:50

Hi everyone,
I recently started to use NEMO. Unfortunately, I obtain different results, depending on the number of MPI tasks used, whenever solar light penetration is activated (ln_traqsr = true, for both options ln_qsr_rgb and ln_qsr_2bd). Further, the results start to differ after a restart compared to a long run for a constant number of MPI tasks. Thus, these inconsistencies severely affect (i) the correct restatability (claimed in the NEMO manual) and (ii) the reproducability of my simulations.

As soon as light penetration is deactivated (ln_traqsr = false), the results are again numeriacally identical (for both, after restarts and for MPI task variations).

I encounter this problem in v4.2.1 (recently dowloaded via svn) as well as in some older v3.6. For v4.2.1, it also persists in independent whether Oasis is used or not and XIOS is used in atteched or detached mode. I tested it in different configurations, and it also occurs in the plain AMM12 example case of v4.2.1, when ln_traqsr is additionally set in the namelist_cfg.

Does anyone else encounter the same or similar problems? And does anyone have ideas how to resolve this problem? Is it due to (potentially) missing compiler falgs and options, the system architecture, or a bug in the source code?

Thanks in advance for your help!
Robert

dupontf · 12 September 2024 20:32

yes, we have found that problem as well using partial steps. It arises because there is a missing IF( lk_mpp ) CALL mpp_max( 'tra_qsr', nksr ) ! max over global domain after the local computation is done.

Hope this helps!

r-hartmann · 13 September 2024 11:51

Thanks a lot for your reply! Just to clarify, you are talking about subroutine tra_qsr, right after the SELECT block, before the temperature filed is updated? (traqsr.F90, l. 277)

dupontf · 13 September 2024 14:15

oh sorry, no. To be more precise, it has to be added after lines src/OCE/TRA/traqsr.F90 · 4.2.2 · NEMO Workspace / Nemo · GitLab, 425 and 435 for completeness (i.e., during the initialization process).

r-hartmann · 13 September 2024 16:33

Many thanks! This seems to fix it.
Cheers,
Robert

Antoine_Barthelemy · 17 October 2024 09:30

Hello!

@r-hartmann I wonder about the importance of this bug. What was the effect when you had more frequent restarts?

@dupontf Does the bug also impact a configuration with ln_zps = True? This issue (Reproducibility issue with ln_traqsr=.true. and ln_sco=.true. configurations (#372) · Issues · NEMO Workspace / Nemo · GitLab) and the way the fix is implemented suggest that it concerns only s or s-z coordinates, but you mention partial steps so I am confused.

Thanks in advance!

r-hartmann · 17 October 2024 12:15

Hi @Antoine_Barthelemy,

In my opinion it is crucial, as it can lead to unpredictable/not reproducable reults. Your results will differ after a the first restart compared to potential longer continous run. Perhaps only a little bit in the beginning but it is unpredictable how they will diverge over long time and/or with multiple restarts, since you are simulating a turbulent system. Meaning that anyone, who wants to reproduce your results with the identical input data, configuartion etc, needs to have the exact knowledge of your restart times (and MPI processes), as well.

Cheers,
Robert

dupontf · 17 October 2024 18:27

Bonjour @Antoine_Barthelemy , yes, it impacts also partial steps configurations, even though only the last level is modified (you have to be more unlucky than for sigma coordinates, but that happens too).

Thanks for the link to Reproducibility issue with ln_traqsr=.true. and ln_sco=.true. configurations (#372) · Issues · NEMO Workspace / Nemo · GitLab. I agree with @acc that it would be more elegant to add the call to mpp_max in trc_oce_ext_lev, with the caveat that I would not reduce it to the ln_sco case.

Topic		Replies	Views
[v4.0.x] Minor errors in routines trc{bc,nxt,rad}.F90 (TOP) and dtadyn.F90 (OFF) v4.0.x OFF , TOP	1	528	9 November 2022
Junk domain input at high processor number v4.2.x XIOS , OASIS	11	480	18 October 2024
[v4.2.x] NEMO-OASIS-WRF fails with infinite SSH by increasing MPI tasks used by NEMO v4.2.x OASIS	5	553	8 July 2022
REBUILD not working, Tiling in ocean.output and actual set of netCDF are not the same v4.0.x	5	65	14 July 2024
[v4.2.x][v4.0.x] Mixed layer depth `mldr10_3` not initialized Versions DIA , orca	13	539	30 March 2022

MPI & restart inconsistencies with activated solar light penetration

Related topics