What does these parameter mean?

Hi, everyone. I’m a beginner. I want to use NEMO-4.2.2 as a workload to test my machine, which has 80 physical cores and 160 logical cores.

I use OpenMPI and the GYRE_PISCES configuration to test. And I ran the command

mpirun --use-hwthread-cpus -n 160 ./nemo

However, my terminal dispaly an ERROR:

Abort(123) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 123) - process 0
Abort(123) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 123) - process 0
Abort(123) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 123) - process 0

prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

Process name: [prterun-black-1592854@1,10] Exit code: 123

I only modified some parameters of namlist_cfg and namlist_ref. The namlist_cfg, namlist_ref and the ocean.output are following:

The modified part of namlist_cfg is: nn_GYRE, jpiglo, jpjglo, rn_Dt

!-----------------------------------------------------------------------
&namusr_def ! GYRE user defined namelist
!-----------------------------------------------------------------------
nn_GYRE = 4 ! GYRE resolution [1/degrees]
ln_bench = .true. ! ! =T benchmark with gyre: the gridsize is kept constant
jpiglo = 1440
jpjglo = 720
jpkglo = 31 ! number of model levels
/
!-----------------------------------------------------------------------
&namdom ! time and space domain
!-----------------------------------------------------------------------
ln_linssh = .true. ! =T linear free surface ==>> model level are fixed in time
!
rn_Dt = 1200. ! time step for the dynamics
/

The modified part of namlist_ref is: jpni, jpnj

!-----------------------------------------------------------------------
&nammpp ! Massively Parallel Processing
!-----------------------------------------------------------------------
ln_listonly = .false. ! do nothing else than listing the best domain decompositions (with land domains suppression)
! ! if T: the largest number of cores tested is defined by max(mppsize, jpni*jpnj)
ln_nnogather = .true. ! activate code to avoid mpi_allgather use at the northfold
jpni = 10 ! number of processors following i (set automatically if < 1), see also ln_listonly = T
jpnj = 16 ! number of processors following j (set automatically if < 1), see also ln_listonly = T
nn_hls = 1 ! halo width (applies to both rows and columns)
nn_comm = 1 ! comm choice
/

And the ocean.output is following:

AAAAAAAA

par_kind : wp = Working precision = dp = double-precision



 ===>>> : E R R O R

         ===========

misspelled variable in namelist namusr_def in configuration namelist iostat =  5010

   
usr_def_nam  : read the user defined namelist (namusr_def) in namelist_cfg
Namelist namusr_def : GYRE case
   GYRE used as Benchmark (=T)                      ln_bench  =  T
   inverse resolution & implied domain size         nn_GYRE   =            4
   Ni0glo = 30*nn_GYRE                              Ni0glo =          122
   Nj0glo = 20*nn_GYRE                              Nj0glo =           82
   number of model levels                           jpkglo =            0

Namelist nammpp
   processor grid extent in i                            jpni =           10
   processor grid extent in j                            jpnj =           16
   avoid use of mpi_allgather at the north fold  ln_nnogather =  T
   halo width (applies to both rows and columns)       nn_hls =            1
   choice of communication method                     nn_comm =            1

mpp_init:

  The chosen domain decomposition   10 x   16 with     159 land subdomains
     - uses a total of    1 mpi process
     - has mpi subdomains with a maximum size of (jpi =   15, jpj =    8, jpi*jpj =     120)
  The best domain decompostion    1 x    1 with       0 land subdomains
     - uses a total of    1 mpi process
     - has mpi subdomains with a maximum size of (jpi =  124, jpj =   84, jpi*jpj =   10416)

 ===>>> : E R R O R

         ===========

   With this specified domain decomposition: jpni =   10 jpnj =   16
   we can eliminate only    0 land mpi subdomains therefore
   the number of ocean mpi subdomains ( 160) exceed the number of MPI processes:   1

    ==>>> There is the list of best domain decompositions you should use:



                  For your information:
  list of the best partitions including land supression
  -----------------------------------------------------

nb_cores oce:      1, land domains excluded:      0 ( 0.0%), largest oce domain:     10416 (    124 x     84 )

Can someone tell me what these parameters mean? How should I modify the parameters?

Thank you very much.

1 Like

Exit code 123 means that NEMO detected at least 1 error. As you saw, errors are reported in the ocean.output file.

In your case, as written in your ocean.output file, NEMO detected 2 errors:

  1. misspelled variable in namelist namusr_def in configuration namelist : which can be rephrased by: you have a typo somewhere in one of the variables name in the namusr_def block on the file nameilist_cfg.
jpiglo = 1440
jpjglo = 720

are not part of namusr_def and should not be there.

  1. the number of ocean mpi subdomains ( 160) exceed the number of MPI processes: 1 : it says that you run NEMO on 1 MPI process instead of 160. In other words, you command mpirun --use-hwthread-cpus -n 160 ./nemo is not working as its attributes 1 MPI task to NEMO instead of 160… I would try mpirun -np 80 ./nemo (use only the physical cores if you want good performances).

Thank you for your reply. I just deleted jpiglo, jpjglo and used the command

mpirun -n 80 ./nemo

run the application again.
However, it still display ERROR message:

Abort(123) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 123) - process 0
Abort(123) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 123) - process 0 …

prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

Process name: [prterun-black-2885641@1,43] Exit code: 123

I have two conjectures:

  1. The parametersjpni, jpnj I set it wrongly. If so, how should I set these two parameters?
  2. The MPI process doesn’t match the physical core I’m trying to run, as the oputput shows uses a total of 1 mpi process but I use 80 physical cores. If so, how should I get the MPI process actually running from 1 to 80?

Looking forward to your reply.

You must look at the error message in ocean.output file.
You can keep the default definition of jpni and jpnj to 0.

I looked at the ocean.output file. Part of the document reads as follows:

mpp_init:

  The chosen domain decomposition    1 x    1 with    -155 land subdomains
     - uses a total of  156 mpi process
     - has mpi subdomains with a maximum size of (jpi =  124, jpj =   84, jpi*jpj =   10416)
  The best domain decompostion   13 x   12 with       0 land subdomains
     - uses a total of  156 mpi process
     - has mpi subdomains with a maximum size of (jpi =   12, jpj =    9, jpi*jpj =     108)

 ===>>> : W A R N I N G

         ===============


    ==> You could therefore have smaller mpi subdomains with the same number of mpi processes

    ---   YOU ARE WASTING CPU...   ---


  The number of mpi processes:   156
  exceeds the maximum number of subdomains (ocean+land) =     1
  defined by the following domain decomposition: jpni =    1 jpnj =    1
   You should: 
     - either properly prescribe your domain decomposition with jpni and jpnj
       in order to be consistent with the number of mpi process you want to use
       even IF it not the best choice...
     - or use the automatic and optimal domain decomposition and pick up one of
       the domain decomposition proposed in the list bellow


                  For your information:
  list of the best partitions including land supression
  -----------------------------------------------------

Therefore, I think I may have a problem with domain decomposition. The solution is either prescribe my domain decomposition with jpni and jpnj or use the automatic and optimal domain decomposition.
I prefer the latter but I don’t know how to do.

Can you help me? Thank you very much.

As I said, set jpni and jpnj to 0.

I run it successfully. Thank you very much. :pray: