Seg. fault after compiling XIOS and NEMO with openmpi 4.1 and intel 18

Hi,

I appreciate this probably isn’t an easy one without giving a lot of information about my environment and compilation settings, but I’m at a loss so any suggestions would be welcome…

I’m using openmpi 4.1.0 with intel 18 compilers (gcc 6.3.0 for C and ifort 18.0.5 for Fortran) to compile XIOS trunk and NEMO 4.2.0.
Compilation of XIOS is successful and test_client.exe runs without issue.
I can also build NEMO experiments without issue, and have compiled one based on AMM12 reference,

./makenemo -m linux_ifort -n AMM12_TEST -r AMM12 -j 6

with arch/arch-linux_ifort.fcm

%NCDF_INC -I%NCDF_HOME/include
%NCDF_LIB -lm -L%NCDF_HOME/lib -lnetcdff -lnetcdf -L%HDF5_HOME/lib -lhdf5_hl -lhdf5 -lxml2 -lsz -lz -ldl -lbz2
%XIOS_INC -I%XIOS_HOME/inc
%XIOS_LIB -L%XIOS_HOME/lib -lxios
%OASIS_INC -I%OASIS_HOME/build/lib/mct -I%OASIS_HOME/build/lib/psmile.MPI1
%OASIS_LIB -L%OASIS_HOME/lib -lpsmile.MPI1 -lmct -lmpeu -lscrip
%CPP cpp
%FC mpifort -static-intel
%FCFLAGS -align dcommons -assume byterecl -fp-speculation=safe -i4 -r8 -g -O0
%FFLAGS %FCFLAGS
%LD mpifort -static-intel
%LDFLAGS -lstdc++ -lpthread
%FPPFLAGS -P -traditional
%AR ar
%ARFLAGS rs
%MK make
%USER_INC %XIOS_INC %OASIS_INC %NCDF_INC
%USER_LIB %XIOS_LIB %OASIS_LIB %NCDF_LIB
%CC mpicc
%CFLAGS -fPIC -pthread -std=c++11 -g -O0

But it fails at initialisation with a segmentation fault:
When executing nemo with memory checker I immediately get :

valgrind --max-stackframe=18095616 --track-origins=yes nemo
==35777== Memcheck, a memory error detector
==35777== Copyright (C) 2002-2022, and GNU GPL’d, by Julian Seward et al.
==35777== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==35777== Command: nemo
==35777==
==35777== Conditional jump or move depends on uninitialised value(s)
==35777== at 0x1BBC2BD: intel_sse2_strcpy (in nemo.exe)
==35777== by 0x1B5AA83: for__open_proc (in nemo.exe)
==35777== by 0x1AFBBD6: for_open (in nemo.exe)
==35777== by 0xAE84DC: lib_mpp_mp_ctl_opn
(lib_mpp.f90:2587)
==35777== by 0x4566F9: nemogcm_mp_nemo_init
(nemogcm.f90:247)
==35777== by 0x455055: nemogcm_mp_nemo_gcm_ (nemogcm.f90:119)
==35777== by 0x455018: MAIN__ (nemo.f90:18)
==35777== by 0x454FCD: main (in nemo.exe)
==35777== Uninitialised value was created by a stack allocation
==35777== at 0x1B5A91D: for__open_proc (in nemo.exe)
==35777==
==35777== Conditional jump or move depends on uninitialised value(s)
==35777== at 0x1BBC2BD: intel_sse2_strcpy (in nemo.exe)
==35777== by 0x1AF9EBD: for__add_to_lf_table (in nemo.exe)
==35777== by 0x1B5C2CC: for__open_proc (in nemo.exe)
==35777== by 0x1AFBBD6: for_open (in nemo.exe)
==35777== by 0xAE84DC: lib_mpp_mp_ctl_opn
(lib_mpp.f90:2587)
==35777== by 0x4566F9: nemogcm_mp_nemo_init
(nemogcm.f90:247)
==35777== by 0x455055: nemogcm_mp_nemo_gcm_ (nemogcm.f90:119)
==35777== by 0x455018: MAIN__ (nemo.f90:18)
==35777== by 0x454FCD: main (in nemo.exe)
==35777== Uninitialised value was created by a stack allocation
==35777== at 0x1B5A91D: for__open_proc (in nemo.exe)
==35777==
==35777== Invalid read of size 1
==35777== at 0x1B21E73: skip_nml_buffer (in nemo.exe)
==35777== by 0x1B1F9FC: for__nml_lex (in nemo.exe)
==35777== by 0x1B1890D: for_read_seq_nml (in nemo.exe)
==35777== by 0x1B17ECA: for_read_int_nml (in nemo.exe)
==35777== by 0x632F8A: wet_dry_mp_wad_init_ (wet_dry.f90:114)
==35777== by 0x45847A: nemogcm_mp_nemo_init_ (nemogcm.f90:345)
==35777== by 0x455055: nemogcm_mp_nemo_gcm_ (nemogcm.f90:119)
==35777== by 0x455018: MAIN__ (nemo.f90:18)
==35777== by 0x454FCD: main (in nemo.exe)
==35777== Address 0x2ab82ee9 is 0 bytes after a block of size 6,393 alloc’d
==35777== at 0x4C2D0AF: malloc (vg_replace_malloc.c:381)
==35777== by 0x1B292AD: for__get_vm (in nemo.exe)
==35777== by 0x1B1AD66: for_read_seq_nml (in nemo.exe)
==35777== by 0x1B17ECA: for_read_int_nml (in nemo.exe)
==35777== by 0x632F8A: wet_dry_mp_wad_init_ (wet_dry.f90:114)
==35777== by 0x45847A: nemogcm_mp_nemo_init_ (nemogcm.f90:345)
==35777== by 0x455055: nemogcm_mp_nemo_gcm_ (nemogcm.f90:119)
==35777== by 0x455018: MAIN__ (nemo.f90:18)
==35777== by 0x454FCD: main (in nemo.exe)
==35777==
==35777== Warning: bad signal number 0 in sigaction()
==35777== Warning: ignored attempt to set SIGKILL handler in sigaction();
==35777== the SIGKILL signal is uncatchable
==35777== Warning: ignored attempt to set SIGSTOP handler in sigaction();
==35777== the SIGSTOP signal is uncatchable
==35777== Warning: ignored attempt to set SIGRT32 handler in sigaction();
==35777== the SIGRT32 signal is used internally by Valgrind
==35777== Warning: bad signal number 0 in sigaction()
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
nemo.exe 0000000001AEB24D for__signal_handl Unknown Unknown
libpthread-2.24.s 00000000051D10C0 Unknown Unknown Unknown
nemo.exe 00000000016BE1A3 Unknown Unknown Unknown
nemo.exe 0000000000A313FA Unknown Unknown Unknown
nemo.exe 0000000000A05CFC Unknown Unknown Unknown
nemo.exe 0000000000A0B275 Unknown Unknown Unknown
nemo.exe 00000000008585B1 Unknown Unknown Unknown
nemo.exe 00000000008500D8 Unknown Unknown Unknown
nemo.exe 000000000082987D Unknown Unknown Unknown
nemo.exe 000000000045848F Unknown Unknown Unknown
nemo.exe 0000000000455056 Unknown Unknown Unknown
nemo.exe 0000000000455019 Unknown Unknown Unknown
nemo.exe 0000000000454FCE Unknown Unknown Unknown
libc-2.24.so 0000000006C8C2E1 __libc_start_main Unknown Unknown
nemo.exe 0000000000454E9A Unknown Unknown Unknown
==35777==
==35777== HEAP SUMMARY:
==35777== in use at exit: 913,521,516 bytes in 11,910 blocks
==35777== total heap usage: 274,736 allocs, 262,826 frees, 1,058,062,826 bytes allocated
==35777==
==35777== LEAK SUMMARY:
==35777== definitely lost: 6,522 bytes in 4 blocks
==35777== indirectly lost: 79 bytes in 2 blocks
==35777== possibly lost: 909,488,630 bytes in 165 blocks
==35777== still reachable: 4,026,285 bytes in 11,739 blocks
==35777== suppressed: 0 bytes in 0 blocks
==35777== Rerun with --leak-check=full to see details of leaked memory
==35777==

The ocean.output ends like:

==>>> Read vertical mesh in AMM_R12_sco_domcfg file
zgr_read : read the vertical coordinates in AMM_R12_sco_domcfg file
~~~~~~~~
iom_nf90_open ~~~ open existing file: AMM_R12_sco_domcfg.nc
in READ mode
—> AMM_R12_sco_domcfg.nc OK
read top_level (rec: 1) in AMM_R12_sco_domcfg.nc ok
read bottom_level (rec: 1) in AMM_R12_sco_domcfg.nc ok
read e3t_1d (rec: 1) in AMM_R12_sco_domcfg.nc ok
read e3w_1d (rec: 1) in AMM_R12_sco_domcfg.nc ok

The following libraries were also installed using the same compilers
szip 2.1.1
hdf5 1.12.0 --enable-parallel --disable-shared
netcdf-c 4.9.0 --enable-parallel-tests --disable-shared
netcdf-fortran 4.6.0 --enable-parallel-tests --disable-shared

And running make check on netcdf-fortran confirms all the tests pass.
I have also tried installing shared libraries instead of static for all that’s worth.

Any help is appreciated, thanks.

I’ve managed to pin down that the segmentation fault occurs when executing netcdf::nf90​_get​_var​_3d​_eightbytereal ​(netcdf​_expanded.F90:2484)​ during a read of AMM_R12_sco_domcfg.nc because of unaligned memory access.
My best guess is that netcdf and nemo are using different type sizes, even though this should be set by compilation flags and checked by the compiler. Assuming that the difference originates from using a combination of gcc and ifort, is there any advice on fixing type sizes to be the same when building netcdf of nemo?

I’ve tried setting environment variables as in NetCDF: Building the NetCDF-4.2 and later Fortran libraries (ucar.edu) but those didn’t seem to have an effect.

Update for anyone following this topic:
I’ve gotten around the issue by recompiling openmpi and the other libraries using gcc and gfortran compilers.

My best guess at the issue: our installation of openmpi with intel compilers used “-auto” to store local non-SAVEd variables on the run-time stack. This overrides the default behaviour some fortran programs expect where re-entering a function can re-use the previous values of a variable.