REBUILD not working, Tiling in ocean.output and actual set of netCDF are not the same

Good evening

I have recently installed nemo 4.0 on a local Ubuntu machine. The code runs perfectly and outputs everything correctly in np different .nc files. The same code works as expected on other systems. However, rebuild_nemo is not capable of rebuilding the files. I have noticed that the ocean.output file gives me a different layout of domain decomposition than the one actually outputted, so I am suspecting that there is actually a correlation between these two things. Below a copy-paste of the ocean.output figure

MPI Message Passing MPI - domain lay out over processors

    defines mpp subdomains
       jpni =            5
       jpnj =            3

       sum ilci(i,1) =          100  jpiglo =           92
       sum ilcj(1,j) =           66  jpjglo =           62

           ****************************************************************
           *              *               *               *               *
         3 *   20  x 22   *    20  x 22   *    20  x 22   *    20  x 22   *
           *         10   *          11   *          12   *          13   *
           *              *               *               *               *
           ****************************************************************
           *              *               *               *               *
         2 *   20  x 22   *    20  x 22   *    20  x 22   *    20  x 22   *
           *          5   *           6   *           7   *           8   *
           *              *               *               *               *
           ****************************************************************
           *              *               *               *               *
         1 *   20  x 22   *    20  x 22   *    20  x 22   *    20  x 22   *
           *          0   *           1   *           2   *           3   *
           *              *               *               *               *
           ****************************************************************
                   1               2               3               4

           ****************
           *              *
         3 *   20  x 22   *
           *         14   *
           *              *
           ****************
           *              *
         2 *   20  x 22   *
           *          9   *
           *              *
           ****************
           *              *
         1 *   20  x 22   *
           *          4   *
           *              *
           ****************
                   5

    resulting internal parameters :
       nproc  =            0
       nowe   =           -1    noea  =             1
       nono   =            5    noso  =            -1
       nbondi =           -1
       nbondj =           -1
       npolj  =            0
     l_Iperio =  F
     l_Jperio =  F
       nlci   =           20
       nlcj   =           22
       nimpp  =            1
       njmpp  =            1
       nreci  =            2
       nrecj  =            2
       nn_hls =            1

 mpp_init_ioipsl :   iloc  =           20          22
 ~~~~~~~~~~~~~~~     iabsf =            1           1
                     ihals =            0           0
                     ihale =            1           1

As you can see, according to ocean.output I have 20x22 grid points subdomains. However, the domain is actually “sliced” along only one axis. Below, the result of ncdump -h output_0000.nc

dimensions:
	axis_nbounds = 2 ;
	x = 92 ;
	y = 5 ;
	deptht = 31 ;
	time_counter = UNLIMITED ; // (1080 currently)
[...]
// global attributes:
		:name = "GYRE_5d_00010101_00151230_grid_T" ;
		:description = "ocean T grid variables" ;
		:title = "ocean T grid variables" ;
		:Conventions = "CF-1.6" ;
		:timeStamp = "2024-Jul-10 16:03:44 GMT" ;
		:uuid = "001c09ff-ba70-43fc-871b-37186c6833bf" ;
		:ibegin = 0 ;
		:ni = 92 ;
		:jbegin = 0 ;
		:nj = 5 ;
		:DOMAIN_number_total = 15 ;
		:DOMAIN_number = 0 ;
		:DOMAIN_dimensions_ids = 2, 3 ;
		:DOMAIN_size_global = 92, 62 ;
		:DOMAIN_size_local = 92, 5 ;
		:DOMAIN_position_first = 1, 1 ;
		:DOMAIN_position_last = 92, 5 ;
		:DOMAIN_halo_size_start = 0, 0 ;
		:DOMAIN_halo_size_end = 0, 0 ;
		:DOMAIN_type = "box" ;

When I execute rebuild_nemo, It gives me

 Rebuilding the following files:
 GYRE_5d_00010101_00151230_grid_W_0000.nc
 [...]
 GYRE_5d_00010101_00151230_grid_W_0013.nc
 GYRE_5d_00010101_00151230_grid_W_0014.nc
 Size of global arrays:           92          62
 Finding rebuild dimensions from the first file...
 Rebuilding across dimensions x and y
 Copying attribute name into destination file...
 Copying attribute description into destination file...
 Copying attribute title into destination file...
 Copying attribute Conventions into destination file...
 Copying attribute timeStamp into destination file...
 Copying attribute uuid into destination file...
 Copying attribute ibegin into destination file...
 Copying attribute ni into destination file...
 Copying attribute jbegin into destination file...
 Copying attribute nj into destination file...
 Writing new file_name attribute
 Writing new TimeStamp attribute
 ERROR! : NetCDF: Name contains illegal characters
 *** NEMO rebuild failed ***

Any reason for this behavior and any idea on how to solve it (not involving writing myself a custom rebuild_nemo)?

thanks to all

Ciao,

But why don’t you use xios ?

Because I never had the necessity for xios to be fair, and I have no one that can help me setting up the configuration with xios in my team. I want to stress that the code worked as expected on all other machines in the past 4 years, so I really was wondering what kind of behavior is this, since it is unexpected.

I have just tested and this behavior happens for increased resolutions as well, to the point that if I run the GYRE setup with 80 cores I have slices with only 2 points

		:name = "GYRE_5d_01110101_01211230_grid_T" ;
		:description = "ocean T grid variables" ;
		:title = "ocean T grid variables" ;
		:Conventions = "CF-1.6" ;
		:timeStamp = "2024-Jul-12 08:46:13 GMT" ;
		:uuid = "425e9412-419b-4863-89d1-e850b65ef33d" ;
		:ibegin = 0 ;
		:ni = 272 ;
		:jbegin = 82 ;
		:nj = 2 ;
		:DOMAIN_number_total = 80 ;
		:DOMAIN_number = 30 ;
		:DOMAIN_dimensions_ids = 2, 3 ;
		:DOMAIN_size_global = 272, 182 ;
		:DOMAIN_size_local = 272, 2 ;
		:DOMAIN_position_first = 1, 83 ;
		:DOMAIN_position_last = 272, 84 ;
		:DOMAIN_halo_size_start = 0, 0 ;
		:DOMAIN_halo_size_end = 0, 0 ;
		:DOMAIN_type = "box" ;

while the ocean output shows that the subdomain are 36x20. While this could be seen as “not a big problem” solvable with a custom rebuild, it has also a different impact on the communication, because the communicated buffer in the “tiled” scheme is (for each vertical layer) roughly 2\times (36 + 20) = 112 while in the slices actually share 2\times272=544 points, which makes the communication much longer.

Any suggestion where to look to investigate this issue?

Well, it turns out this particular machine does not like to be told the hostfile, so instead of doing

mpirun -np X -hostfile hostfile ./nemo

(where the second hostile contains localhost slots=X) I tried a simple

mpirun -np X ./nemo

and then the slices are coherent and rebuilds work. So I mark the topic as solved, but still if someone wants to give me suggestions on this behavior, it would be appreciated.
Cheers!

Hello.

Ok. have you tried to use REBUILD package not REBUILD_NEMO. When you download nemo source code, you have 2 packages to rebuild the nemo files.

1- Rebuild_NEMO
2-rebuild
Both are located in tools directory. Try to use rebuild package. It is an old package but it works even if it is slow.

rebuild_nemo does not always work as it is expected so i recommend to use rebuild package instead.

Best

Saeed