Inexplicable XIOS troubles

Good afternoon to all,

This is a more of a long shot than a real help request, and I am sorry if this post does not respect the formats of this community, but me and a colleague are experiencing troubles with XIOS since last week.

In two words, NEMO stalls when calling XIOS to save the output. This has started to happen out of the blue last week, with installations that were previously completely functioning. Furthermore, me and my colleague do not work on the same cluster and yet it happened simultaneously. I would like to stress that the two cluster we work on are not even maintained by the same people/institutions.

Anyone else experiencing this unexpected stalling? Anyone wants to take a shot in the dark and give us path to debug this? Any hint is appreciated, as usual.

(the workaround is clearly compilation without the key_xios key and that is working as usual, but we kind of liked not having 84 different files to rebuild and we kind of hope to keep it that way)

Let me know,
Francesco

Hi Francesco,

we had similar issues a few times, always at least coincidently with software updates on the HPC system.

Last occurrence was related to updates to the respective lustre file-system. The issue was solved by switching back and forth between different versions of the lustre-driver - unfortunately I could not get reliable information on which version caused the problem and which is robust. I suggest to reach out to your system admins and ask for changes around the time at which the problem occurred.

Although I cannot say whether this was related, but as you mention different systems being affected at the same time: when we last encountered it, another HPC center also announced issues with their lustre-system as well - which I think could be related to common driver updates.

Another indicator for not being a NEMO/XIOS issue on its own was that we saw this with different NEMO and XIOS versions and revisions at the same time.

I hope this gives some hints.

Best wishes

Franziska