MPI & restart inconsistencies with activated solar light penetration

Hi everyone,
I recently started to use NEMO. Unfortunately, I obtain different results, depending on the number of MPI tasks used, whenever solar light penetration is activated (ln_traqsr = true, for both options ln_qsr_rgb and ln_qsr_2bd). Further, the results start to differ after a restart compared to a long run for a constant number of MPI tasks. Thus, these inconsistencies severely affect (i) the correct restatability (claimed in the NEMO manual) and (ii) the reproducability of my simulations.

As soon as light penetration is deactivated (ln_traqsr = false), the results are again numeriacally identical (for both, after restarts and for MPI task variations).

I encounter this problem in v4.2.1 (recently dowloaded via svn) as well as in some older v3.6. For v4.2.1, it also persists in independent whether Oasis is used or not and XIOS is used in atteched or detached mode. I tested it in different configurations, and it also occurs in the plain AMM12 example case of v4.2.1, when ln_traqsr is additionally set in the namelist_cfg.

Does anyone else encounter the same or similar problems? And does anyone have ideas how to resolve this problem? Is it due to (potentially) missing compiler falgs and options, the system architecture, or a bug in the source code?

Thanks in advance for your help!
Robert

yes, we have found that problem as well using partial steps. It arises because there is a missing IF( lk_mpp ) CALL mpp_max( 'tra_qsr', nksr ) ! max over global domain after the local computation is done.

Hope this helps!

Thanks a lot for your reply! Just to clarify, you are talking about subroutine tra_qsr, right after the SELECT block, before the temperature filed is updated? (traqsr.F90, l. 277)

oh sorry, no. To be more precise, it has to be added after lines src/OCE/TRA/traqsr.F90 · 4.2.2 · NEMO Workspace / Nemo · GitLab, 425 and 435 for completeness (i.e., during the initialization process).

Many thanks! This seems to fix it.
Cheers,
Robert

Hello!

@r-hartmann I wonder about the importance of this bug. What was the effect when you had more frequent restarts?

@dupontf Does the bug also impact a configuration with ln_zps = True? This issue (Reproducibility issue with ln_traqsr=.true. and ln_sco=.true. configurations (#372) · Issues · NEMO Workspace / Nemo · GitLab) and the way the fix is implemented suggest that it concerns only s or s-z coordinates, but you mention partial steps so I am confused.

Thanks in advance!

Hi @Antoine_Barthelemy,

In my opinion it is crucial, as it can lead to unpredictable/not reproducable reults. Your results will differ after a the first restart compared to potential longer continous run. Perhaps only a little bit in the beginning but it is unpredictable how they will diverge over long time and/or with multiple restarts, since you are simulating a turbulent system. Meaning that anyone, who wants to reproduce your results with the identical input data, configuartion etc, needs to have the exact knowledge of your restart times (and MPI processes), as well.

Cheers,
Robert

1 Like

Bonjour @Antoine_Barthelemy , yes, it impacts also partial steps configurations, even though only the last level is modified (you have to be more unlucky than for sigma coordinates, but that happens too).

Thanks for the link to Reproducibility issue with ln_traqsr=.true. and ln_sco=.true. configurations (#372) · Issues · NEMO Workspace / Nemo · GitLab. I agree with @acc that it would be more elegant to add the call to mpp_max in trc_oce_ext_lev, with the caveat that I would not reduce it to the ln_sco case.

1 Like