Configuration crashes for new OASIS-MCT 4.0: malloc(): memory corruption (fast)

Dear all,

I am experiencing issues running a nested NEMO3.6 simulation coupled with ECHAM-6.3.05p2 via OASIS3-MCT 4.0. For output, we are using XIOS2.5 versions r1910 and r2497. We have upgraded from OASIS3-MCT 3.6 to a new OASIS and XIOS version, because of compiler problems with this configuration.

The model crashes at the beginning of the simulation at ts=1. ECHAM is waiting for input files from OASIS, so everything seems fine on that side. However, NEMO throws an error message indicating memory corruption during memory allocation. The backtrace shows it crashes in iom_p3d after the malloc(): memory corruption (fast) error appears.

Additionally, XIOS generates various error messages, but they are not consistent. The Awaiting data of size statement appears to me to be incorrect.

It seems to me that the problem lies with XIOS, but I welcome any suggestions. Thank you a lot in advance!

All the best, Tronje

XIOS error messages:

Error [void CGrid::inputField(const  CArray<double,n>& field, CArray<double,1>& stored) const] : In file '/home/shktkeme/esm/models/foci-agrif_mops_oasismct4/xios/inc/grid.hpp', line 381 -> [ Awaiting data of size = 1, Received data size = 48438 ] The data array does not have the right size! Grid = grid_T_3D

> Error [const CCalendar& CDate::getRelCalendar(void) const] : In file '/home/shktkeme/esm/models/foci-agrif_mops_oasismct4/xios/src/date.cpp', line 149 -> Invalid state: The date is not associated with any calendar.

Error [void CGrid::inputField(const  CArray<double,n>& field, CArray<double,1>& stored) const] : In file '/home/shktkeme/esm/models/foci-agrif_mops_oasismct4/xios/inc/grid.hpp', line 381 -> [ Awaiting data of size = 1, Received data size = 1053 ] The data array does not have the right size! Grid = grid_T_2D

> Error [void CGrid::inputField(const  CArray<double,n>& field, CArray<double,1>& stored) const] : In file '/home/shktkeme/esm/models/foci-agrif_mops_oasismct4/xios/inc/grid.hpp', line 381 -> [ Awaiting data of size = 0, Received data size = 48438 ] The data array does not have the right size! Grid = grid_T_3D

log error messages:

line:2004089
 642: *** Error in `/scratch/usr/shktkeme/esm-experiments/agrifmopstest/run_19000101-19000101/work/./oceanx': malloc(): memory corruption (fast): 0x0000000003bf0670 ***
642: ======= Backtrace: =========
 642: /lib64/libc.so.6(+0x7f474)[0x2aaaae690474]
 642: /lib64/libc.so.6(+0x82bb0)[0x2aaaae693bb0]
 642: /lib64/libc.so.6(__libc_malloc+0x4c)[0x2aaaae69678c]
 642: /sw/compiler/gcc/9.3.0/skl/lib64/libstdc++.so.6(_Znwm+0x15)[0x2aaaaad743f5]
 642: /scratch/usr/shktkeme/esm-experiments/agrifmopstest/run_19000101-19000101/work/./oceanx[0x12467fc]
 ...
642: ======= Memory map: ========
 642: 00400000-01cef000 r-xp 00000000 3d8:ad14e 1116896249929355080            /scratch/usr/shktkeme/esm-experiments/agrifmopstest/run_19000101-19000101/work/oceanx
 642: 01eef000-01ef1000 r--p 018ef000 3d8:ad14e 1116896249929355080            /scratch/usr/shktkeme/esm-experiments/agrifmopstest/run_19000101-19000101/work/oceanx
 642: 01ef1000-027bb000 rw-p 018f1000 3d8:ad14e 1116896249929355080            /scratch/usr/shktkeme/esm-experiments/agrifmopstest/run_19000101-19000101/work/oceanx
 642: 027bb000-1359f000 rw-p 00000000 00:00 0                                  [heap]
 642: 2aaaaaaab000-2aaaaaacd000 r-xp 00000000 00:1d 205111                     /usr/lib64/ld-2.17.so
 642: 2aaaaaacd000-2aaaaaacf000 r-xp 00000000 00:00 0                          [vdso]
 642: 2aaaaaacf000-2aaaaaae3000 rw-p 00000000 00:00 0 
 642: 2aaaaaae3000-2aaaaaae4000 r--s dabbad0003420000 00:05 18058              /dev/hfi1_0
...
2005706  642: forrtl: error (76): Abort trap signal
2005707  642: Image              PC                Routine            Line        Source
2005708  642: oceanx             0000000001630DF4  Unknown               Unknown  Unknown
2005709  642: libpthread-2.17.s  00002AAAADEEA630  Unknown               Unknown  Unknown
2005710  642: libc-2.17.so       00002AAAAE647387  gsignal               Unknown  Unknown
...
2005716  642: libstdc++.so.6.0.  00002AAAAAD743F5  _Znwm                 Unknown  Unknown
2005717  642: oceanx             00000000012467FC  Unknown               Unknown  Unknown
...
2005725  642: oceanx             00000000009F5285  Unknown               Unknown  Unknown
2005726  642: oceanx             00000000006C58ED  iom_mp_iom_p3d_          1524  iom.f90
2005727  642: oceanx             00000000004FC068  trcwri_my_trc_mp_         157  trcwri_my_trc.f90
2005728  642: oceanx             00000000004FBD1F  trcwri_mp_sub_loo         184  trcwri.f90
2005729  642: oceanx             00000000004FBD05  trcwri_mp_trc_wri         137  trcwri.f90
2005730  642: oceanx             00000000004DFDA2  trcstp_mp_sub_loo         200  trcstp.f90
2005731  642: oceanx             00000000004DFAA1  trcstp_mp_trc_stp          99  trcstp.f90
2005732  642: oceanx             000000000044A92E  step_mp_sub_loop_         425  step.f90
2005733  642: oceanx             0000000000449F93  step_mp_stp_              106  step.f90
2005734  642: oceanx             000000000057A51F  agrif_util_mp_agr         581  modutil.f90
2005735  642: oceanx             000000000044B71C  step_mp_sub_loop_         533  step.f90
2005736  642: oceanx             0000000000449F93  step_mp_stp_              106  step.f90
...
2030996 srun: error: bcn1291: task 642: Aborted

iom.f90:

   SUBROUTINE iom_p3d( cdname, pfield3d )
      use Agrif_Types, only : Agrif_tabvars

      character(*), intent(in) :: cdname
      real(wp), intent(in), dimension(:,:,:) :: pfield3d

      CALL xios_send_field(cdname, pfield3d)
   END SUBROUTINE iom_p3d

Which variable are you trying to write in trcwri_my_trc.f90 at line 157 ?
It looks that the error comes from this line and that you are trying to write a variable that is not properly defined in your xml files : 2D vs 3D grid in the xml vs. in the code ?

We have here 11 passive tracers for the output as 3D fields. 10 of these tracers should be requested via XIOS. I have put a WRITE statement in the model code after and before line 157 in trcwri_my_trc.f90 and all 11 variables are sent to XIOS. So all variables are send 3 times to XIOS. After this I have decided to turn of the output for these variables in file_def.xml.
There are 2 new errors :confused: which I will describe in the next post.

In the model code:

   LOGICAL, PUBLIC, PARAMETER ::   lk_my_trc     = .TRUE.   !: PTS flag 
   INTEGER, PUBLIC, PARAMETER ::   jp_my_trc     =  11       !: number of PTS tracers 
   INTEGER, PUBLIC, PARAMETER ::   jp_my_trc_2d  =  2       !: additional 2d output arrays ('key_trc_diaadd') !DE changed from 0 to 2 as in LP code
   INTEGER, PUBLIC, PARAMETER ::   jp_my_trc_3d  =  0       !: additional 3d output arrays ('key_trc_diaadd')
   INTEGER, PUBLIC, PARAMETER ::   jp_my_trc_trd =  0       !: number of sms trends for MY_TRC

   ! assign an index in trc arrays for each PTS prognostic variables
   INTEGER, PUBLIC, PARAMETER ::   jpmyt1 = jp_lm + 1     !: 1st MY_TRC tracer (water age)
   INTEGER, PUBLIC, PARAMETER ::   jppo4 = jp_lm + 2     !: 2st MY_TRC tracer !DE this and the tracers below are bio-geochemitry tracer
   INTEGER, PUBLIC, PARAMETER ::   jpdop = jp_lm + 3     !: 3nd MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpoxy = jp_lm + 4     !: 4rd MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpphy = jp_lm + 5     !: 5th MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpzoo = jp_lm + 6     !: 6th MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpdet = jp_lm + 7     !: 7th MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpdin = jp_lm + 8     !: 8th MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpdic = jp_lm + 9     !: 9th MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpalk = jp_lm + 10     !: 10th MY_TRC tracer
   INTEGER, PUBLIC, PARAMETER ::   jpmyt11 = jp_lm + 11    !: 11th MY_TRC tracer !DE ideal tracer to test mass conservstion

In file_def.xml as well file_def_agrix.xml

         <file id="file11" name_suffix="_ptrc_T" description="transient tracers" enabled=".TRUE.">
             <field field_ref="AGE_d" name="votrcage" />
             <field field_ref="O2" name="O2" />
             <field field_ref="DIN" name="DIN" />
             <field field_ref="DIC" name="DIC" />
             <field field_ref="ALK" name="ALK" />
             <field field_ref="DOP" name="DOP" />
             <field field_ref="PO4" name="PO4" />
             <field field_ref="PHY" name="PHY" />
             <field field_ref="ZOO" name="ZOO" />
             <field field_ref="DET" name="DET" />
        </file>

iodef.xml

  <grid_definition>    
     <grid id="grid_T_2D" >    <domain id="grid_T" />                           </grid>
     <grid id="grid_T_3D" >    <domain id="grid_T" />    <axis id="deptht" />   </grid>
...

field_def.xml

     <field_group id="ptrc_T" grid_ref="grid_T_3D">
       <!-- tracers : variables available with key_my_trc -->
       <field id="AGE" long_name="water mass age" unit="sec" />
       <field id="AGE_d" long_name="water mass age" unit="days" > AGE / 86400. </field >
       <field id="IDEAL" long_name="ideal tracer, no souces or sinks" unit="none" />

       <!-- NPZD-O : variables available with key_my_trc -->
       <field id="PO4"       long_name="Phosphate Concentration"                 unit="umol/L" />
       <field id="DOP"       long_name="DOP Concentration"                     unit="umol/L" />
       <field id="O2"        long_name="Oxygen Concentration"                  unit="umol/L" />
       <field id="PHY"       long_name="Phytoplankton Concentration"           unit="umol/L" />
       <field id="ZOO"       long_name="Zooplankton Concentration"             unit="umol/L" />
       <field id="DET"       long_name="Detritus Concentration"                unit="umol/L" />
       <field id="DIC"       long_name="DIC Concentration"                     unit="umol/kg" />
       <field id="DICP"       long_name="DIC Concentration pre-ind"            unit="umol/kg" />
       <field id="ALK"       long_name="Alkalinity"                            unit="umol/kg" />
       <field id="DIN"       long_name="Nitrate concentration"                 unit="umol/L" />
       <field id="ciz"         long_name="penetration of visible light"            unit="W/m2"     />
       <field id="DICm3"   long_name="DIC in mmol per m3"                  unit="umol/L"     />

domain_def.xml

   <domain_definition>
     <domain_group id="grid_T">
       <domain id="grid_T" long_name="grid T"/>
       <!--   My zoom: example of hand defined zoom   -->
       <domain id="myzoom" domain_ref="grid_T" >
        <zoom_domain id="myzoom"  ibegin="10" jbegin="10" ni="5" nj="5" />
       </domain>

I tried another run without having the output defined for the passive tracers in file_def.xml. There are 2 new errors. One in the echam model and the other one in the nemo model:

In the nemo model in line 812 of the trc_sms_my_trc.f90 file: 812 area_nest = glob_sum( e1e2t(:,:) * tmask(:,:,1) ). In this line the global sum is calculated on the Nest grid.

Another one occurred when ECHAM tried to initialize the communication with OASIS.

 ! Initialize parallel I/O with CDI and return communicator for model
      ! compute PEs
#ifdef __prism  /* coupled */
#ifdef __oa3mct
      p_all_comm = pioInit(p_global_comm, nprocio, iomode, pio_namespace, &
                           partInFlate, pio_uncouple)
      IF (p_all_comm == p_comm_null) RETURN