Initialisation problem in the NEMO 4.0.4 & 4.0.7 version

iliab · 12 December 2024 15:31

Running a low resolution ocean model built with the Intel compiler “debugged” compilation options the following crash problem has been detected:

[gadi-cpu-clx-1929:327681:0:327681] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid: 327681) ====
0 0x0000000000012d10 funlockfile() :0
1 0x00000000007fda7e lbcnfd_mp_lbc_nfd_nogather_2d() /scratch/dp9/iib548/NEMO/304_4.1.4.dbg/preprocess-ocean/src/nemo/src/OCE/LBC/lbcnfd.F90:1133
2 0x00000000006abddc lbclnk_mp_mpp_nfd_2d_ptr() /scratch/dp9/iib548/NEMO/304_4.1.4.dbg/preprocess-ocean/src/nemo/src/OCE/LBC/lbclnk.F90:2758
3 0x00000000005e47eb lbclnk_mp_mpp_lnk_2d_ptr_() /scratch/dp9/iib548/NEMO/304_4.1.4.dbg/preprocess-ocean/src/nemo/src/OCE/LBC/lbclnk.F90:867
4 0x00000000005ad54a lbclnk_mp_lbc_lnk_2d_multi_() /scratch/dp9/iib548/NEMO/304_4.1.4.dbg/preprocess-ocean/src/nemo/src/OCE/LBC/lbclnk.F90:152
5 0x000000000316f756 geo2ocean_mp_angle_() /scratch/dp9/iib548/NEMO/304_4.1.4.dbg/preprocess-ocean/src/nemo/src/OCE/SBC/geo2ocean.F90:299
6 0x000000000314dec2 geo2ocean_mp_rot_rep_() /scratch/dp9/iib548/NEMO/304_4.1.4.dbg/preprocess-ocean/src/nemo/src/OCE/SBC/geo2ocean.F90:96
. . .

pointing out to the following sources

1129 IF ( .NOT. l_fast_exchanges ) THEN
1130 DO jl = 1, ipl; DO jk = 1, ipk
1131 DO ji = 1, endloop
1132 iju = jpiglo - ji - nimpp - nfiimpp(isendto(1),jpnj) + 3
1133 ptab(ji,nlcj-1) = psgn * ptab2(iju,ijpjp1,jk,jl)
1134 END DO
1135 END DO; END DO
1136 ENDIF

in lbcnfd.F90 where the ptab2(iju,ijpjp1,jk,jl) term has not been defined on some grid points and PEs.

Tracing back this uninitialized issue in the following call sequence

angle => lbc_lnk_multi (lbc_lnk_2d_multi) => load_ptr_2d
=> lbc_lnk_ptr (mpp_lnk_2d_ptr) => mpp_nfd_2d_ptr => lbc_nfd_nogather_2d (routine of the crash problem, file lbcnfd.F90)

I found that the problem is caused by the following setting

fs_2=2

used in the angle routine (file geo2ocean.F90) where an allocatable array gcosf is declared as

gcosf(jpi,jpj)

and not all elements of this array are set in the corresponding loop of

  DO jj = 2, jpjm1
     DO ji = fs_2, jpi   ! vector opt.

due to

#if defined key_vectopt_loop

define fs_2 1

define fs_jpim1 jpi

#else

define fs_2 2

define fs_jpim1 jpim1

#endif

settings from vectopt_loop_substitute.h90 but the whole array is used in the sources in some places.

Note that the building procedure passed to me for the purpose of optimising the NEMO sources does not use key_vectopt_loop pre-processor setting for compiling the model sources.

Just for curiosity I decided to check on what happens by using a pre-processor setting of key_vectopt_loop in the building procedure.

Unfortunately this change causes a different crash problem

forrtl: severe (408): fort: (2): Subscript #1 of the array TMASK has value 54 which is greater than the upper bound of 53

Image PC Routine Line Source
nemo-si3.exe 00000000042C360F Unknown Unknown Unknown
nemo-si3.exe 00000000017B062F dommsk_mp_dom_msk 212 dommsk.F90
nemo-si3.exe 0000000001724B5F domain_mp_dom_ini 149 domain.F90
nemo-si3.exe 000000000041EC2C nemogcm_mp_nemo_i 364 nemogcm.F90
nemo-si3.exe 000000000041BE99 nemogcm_mp_nemo_g 134 nemogcm.F90
nemo-si3.exe 000000000041BDA1 MAIN__ 18 nemo.f90

pointing out to the following sources

210          DO jj = 1, jpjm1
211             DO ji = 1, jpi   ! vector loop
212                umask(ji,jj,jk) = tmask(ji,jj  ,jk) * tmask(ji+1,jj  ,jk)
213                vmask(ji,jj,jk) = tmask(ji,jj  ,jk) * tmask(ji  ,jj+1,jk)
214             END DO

Note, that the same uninitialized problem has been found running a high resolution model.

Could you please advise on what is the nature of the problem and how it can be fixed?

Thank you.

smasson · 12 December 2024 16:44

I strongly suspect a bug of the north pole folding for the F-points (gcosf) . I completely rewrote this piece of code NEMO 4.2 and I suggest you use this version (or even better NEMO5.0 !!)
If you want to stick to NEMO4.0, you should try ln_nnogather = .false.. It should solve your problem but at a price of lower performances.

When you use key_vectopt_loop, as you saw, i-loops are going from 1 to jpi instead of 2 to jpi-1. This was facilitating the vectorisation on vector machines and maybe also on scalar machines??
When you address tmask(ji+1,jj ,jk) with i=jpi you are indeed out of bounds. The point (jpi+1,jj ,jk) is in fact the point (1,jj+1 ,jk) so you are doing an error in your computation but you are not addressing a point outside of the tmask array as jj < jpj (j-loop is jj = 2, jpj-1).

With this vectorisation trick, you are doing an error but this error won’t be fatal as you are addressing an existing value and the call to lbc_lnk, that is following the do loops, will correct the error you did in the halos points.
=> everything is ok and you will get exactly the same results with or without the key_vectopt_loop.

iliab · 15 December 2024 22:07

Thank you for your quick response and recommendations provided.

At this stage I can’t change the NEMO version as this version is going to be used operationally and I have been asked to work on the optimisation aspects for that version.

Unfortunately the recommended setting of ln_nnogather = .false. has not made any impact on the crash problem.

Based on this outcome I decided to investigate whether there are any other problems in the sources using key_vectopt_loop setting in the build.

I do agree with your explanation that with the used trick in the model sources, the actual over boundary access address is within the corresponding array. So after suppressing bounds checking for the dommsk.F90 sources, a number of similar problems have been found in several files such as domvvl.F90, geo2ocean.F90, sbcblk.F90 (details can be provided if there is an interest in this information).

Eventually another uninitialised problem was detected with the following crash problem:

[gadi-cpu-clx-2710:2320147:0:2320147] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:1427830) ====
0 0x0000000000012d10 funlockfile() :0
1 0x0000000001394b4d sbcblk_mp_blk_ice_tau() /scratch/dp9/iib548/NEMO/304_4.1.4.fix.dbg/preprocess-ocean/src/nemo/src/OCE/SBC/sbcblk.F90:768
2 0x000000000181eaf5 icesbc_mp_ice_sbc_tau() /scratch/dp9/iib548/NEMO/304_4.1.4.fix.dbg/preprocess-ocean/src/nemo/src/ICE/icesbc.F90:93
3 0x00000000014044b7 icestp_mp_ice_stp_() /scratch/dp9/iib548/NEMO/304_4.1.4.fix.dbg/preprocess-ocean/src/nemo/src/ICE/icestp.F90:161
4 0x0000000000ff992b sbcmod_mp_sbc_() /scratch/dp9/iib548/NEMO/304_4.1.4.fix.dbg/preprocess-ocean/src/nemo/src/OCE/SBC/sbcmod.F90:452
5 0x0000000002bad94a step_mp_stp_() /scratch/dp9/iib548/NEMO/304_4.1.4.fix.dbg/preprocess-ocean/src/nemo/src/OCE/step.F90:107
6 0x000000000041c050 nemogcm_mp_nemo_gcm_() /scratch/dp9/iib548/NEMO/304_4.1.4.fix.dbg/preprocess-ocean/src/nemo/src/OCE/nemogcm.F90:160
7 0x000000000041bda1 MAIN() /g/data/dp9/iib548/NEMO/tracing/extract/nemo/src/OCE/nemo.f90:18
8 0x000000000041bd22 main() ???:0
9 0x000000000003a7e5 libc_start_main() ???:0
10 0x000000000041bc2e _start() ???:0

forrtl: error (75): floating point exception
Image PC Routine Line Source
nemo-si3.exe 00000000040F414B Unknown Unknown Unknown
libpthread-2.28.s 000014F916FE7D10 Unknown Unknown Unknown
nemo-si3.exe 0000000001394B4D sbcblk_mp_blk_ice 768 sbcblk.F90
nemo-si3.exe 000000000181EAF5 icesbc_mp_ice_sbc 93 icesbc.F90
nemo-si3.exe 00000000014044B7 icestp_mp_ice_stp 161 icestp.F90
nemo-si3.exe 0000000000FF992B sbcmod_mp_sbc_ 452 sbcmod.F90
nemo-si3.exe 0000000002BAD94A step_mp_stp_ 107 step.F90
nemo-si3.exe 000000000041C050 nemogcm_mp_nemo_g 160 nemogcm.F90
nemo-si3.exe 000000000041BDA1 MAIN__ 18 nemo.f90

pointing out to the following sources

764 DO jj = 2, jpjm1 ! U & V-points (same as ocean).
765 DO ji = 1, jpi ! vect. opt.
766 ! take care of the land-sea mask to avoid “pollution” of coastal stress. p[uv]taui used in frazil and rheology
767 zztmp1 = 0.5_wp * ( 2. - umask(ji,jj,1) ) * MAX( tmask(ji,jj,1),tmask(ji+1,jj ,1) )
768 zztmp2 = 0.5_wp * ( 2. - vmask(ji,jj,1) ) * MAX( tmask(ji,jj,1),tmask(ji ,jj+1,1) )
769 utau_ice(ji,jj) = zztmp1 * ( utau_ice(ji,jj) + utau_ice(ji+1,jj ) )
770 vtau_ice(ji,jj) = zztmp2 * ( vtau_ice(ji,jj) + vtau_ice(ji ,jj+1) )
771 END DO
772 END DO

The latest problem has not been investigated as it takes an enormous amount of time tracking similar problems in the NEMO sources.

smasson · 27 December 2024 15:38

I don’t really understand what you are doing. key_vectopt_loop is an old optimisation key that was used on vector machines a kind of machine which is no more existing (except very rare exceptions). key_vectopt_loop should have very limited impact on scalar machines. You should not use it. As I explained, key_vectopt_loop will generate “false” out of bounds errors. That’s what you get line 768 in sbcblk_mp_blk_ice_tau. If you switch on key_vectopt_loop, you will get this kind of error in every do loop using key_vectopt_loop.
For your information, NEMO 4.0 was not tested with debugging options.
NEMO 4.2 is the first version we ran with a complete set of debugging options of ifort and gfortran. Speaking of optimisation, you should tell people who have asked you to optimise an older version of NEMO that NEMOv5, which was released recently, can be up to 2.5 times faster than NEMO v4. see v5.0 · NEMO Workspace / Nemo · GitLab

iliab · 13 January 2025 20:58

I have not worked before with the NEMO sources. So my experience is close to 0. But in the the far past (20-25 years ago) I optimised numerous applications for NEC vector systems. I decided to check for curiosity on what kind of performance benefits are provided with the key_vectopt_loop option.

Thank you for your advice of not using this option.

It is my “golden” rule that before making any optimisation changes and expecting of not making any impact on the produced results to check on whether there are any problems in the sources. For this purpose before doing any optimisation I check on whether there are problems in the sources using compiler debugged options. The main reason working this way is that if there are problems in the sources then after making optimisation changes there is a possibility that these changes may trigger bugs hidden before.

I will pass your information about the NEMO 5.0 speed to our people who are running the NEMO model.

Thanks again for your information.

Topic		Replies	Views
[v4.0.x] Northfold crashes at initialization with debug options v4.0.x LBC	2	376	1 April 2022
Segmentation fault on nemogcm.f90 and nemo.f90 Versions arch , deps	22	1094	19 August 2022
Nemo hangs when running with updated impi/netcdf/hdf5 versions	11	117	13 October 2024
[v.4.0.x] Instability error and "p4zopt.f90" crash issue that occur on the same day v4.0.x P4Z , PISCES	0	318	28 June 2022
Start NEMO 5.0 simulation from NEMO 4.2.2 restarts (including AGRIF) AGRIF	24	85	26 February 2025

Initialisation problem in the NEMO 4.0.4 & 4.0.7 version

define fs_2 1

define fs_jpim1 jpi

define fs_2 2

define fs_jpim1 jpim1

Related topics