Seg. fault crash on a channel configuration based from AMM12

Hi,

I am trying to set up an idealised configuration in nemo v4.0.4 that is loosely based on the channel model detailed here

I have also followed this guide

on setting up xios and nemo on Monsoon. I have also successfully run one of the example configurations, GYRE_PISCES, so it looks like everything is working as it should be.

For some context, I have based the channel model on nemo’s example config AMM12, since the channel model will be ocean based. To set up the channel model, I have defined a bathy_meter.nc file and then used the tools/DOMAINcfg to define a domain_cfg.nc file. This is then used to calculate the background and forcing states, which are put into the namelist_cfg file. So in this idealised config I am setting ln_read_cfg = .true. and cn_domcfg = "domain_cfg" under &namcfg, which is not the case in GYRE_PISCES.

So, what isn’t working. Well, the idealised configuration is crashing. See some of the testing.output below

ModuleCmd_Switch.c(172):ERROR:152: Module 'PrgEnv-cray/5.2.82' is currently not loaded
ModuleCmd_Switch.c(172):ERROR:152: Module 'intel/15.0.0.090' is currently not loaded
 _ __   ___ _ __ ___   ___           
| '_ \ / _ \ '_ ' _ \ / _ \          
| | | |  __/ | | | | | (_) |         
|_| |_|\___|_| |_| |_|\___/  v4.0.4  
Application 181022084 is crashing. ATP analysis proceeding...
atpFrontend.exe: main: retrieveRawMBT:: recv of BT_HERE_IS_BACKTRACE failed

atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
............ 
atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
atpAppSigHandler timed out waiting for shutdown. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06872] [c7-2c2s6n0] [Tue Nov  8 16:04:36 2022] PE RANK 22 exit signal Segmentation fault
[NID 06872] 2022-11-08 16:04:36 Apid 181022084: initiated application termination
Application 181022084 exit codes: 139
Application 181022084 exit signals: Killed
Application 181022084 resources: utime ~10s, stime ~17s, Rss ~51432, inblocks ~208435, outblocks ~4
going into postprocessing stage...
============================= PBS epilogue =============================

The line atpAppSigHandler timed out waiting for shutdown. Re-raising signal. repeats for each process, which is 60 in total. The two module load errors also appear at the top of the GYRE_PISCES testing.output. Why these errors appear anyway is not clear to me, since they are swapped out for more recent modules early on.

Interestingly, after running ./nemo, some output files …grid_T_0000.nc, …grid_W_0000.nc do materialise, but disappear once the run completely crashes.

I would really appreciate some help getting this configuration up and running. I’m happy to provide more details if need be.

Cheers

Hei,

Are you running in mpp/batch mode ? or interactively ?

/robinson

The ./nemo command is being executed in a batch job script like

#!/bin/bash
#!
##submit_nemo.pbs
#PBS -q normal
#PBS -A user
#PBS -l select=3 
#PBS -l walltime=04:00:00
#PBS -N UNAGI 
#PBS -o testing.output 
#PBS -e testing.error 
#PBS -j oe
#PBS -V

cd $HOME/NEMO/NEMO_4.0.4_mirror/cfgs/UNAGI/EXP_R100/

export OCEANCORES=60 #Number of cores used (30 per node running NEMO) 
export XIOSCORES=2 #Number of cores to run XIOS (=number of nodes running NEMO)

ulimit -c unlimited
ulimit -s unlimited

aprun -b -n $XIOSCORES -N 2 ./xios_server.exe : -n $OCEANCORES -N 30 ./nemo

It is difficult to say what causes the crash, but in terms of method I would get back to the running configuration, and modify one thing at a time (and test if it runs each time).

Do you have an ocean.output file?
If yes, do you have “E R R O R” inside?
If not the model did not start. You may have issues within iodef.xml or the nammpp part of namelist_cfg

Thanks for your responses. Following @robinson simple advice, I went back to a simple model, GYRE_PISCES, and made sure that this ran, which it did. I then modified files accordingly to set up a simple channel model, which is also now running.

2 Likes

I guess it is too late but there is already a channel configuration which comes with the code: tests/CANAL