-
Notifications
You must be signed in to change notification settings - Fork 204
Description
What is wrong?
I started a year-long run on 1 May 2021. Increasing the number of retries allowed it to run through mid-November, but it only saved restart files through 18 August 2021.
I increased restart_interval_gfs in config.base to 24, but this caused the model to crash.
I decreased restart_interval_gfs in config.base to 21, which seems to work fine, and should produce restart files through October.
What should have happened?
The model should be able to read lines longer than 1024 characters. In the meantime, it would be nice to allow a restart_interval of 24 hours.
With a version of the code from just after the Rocky8 switchover, I was able to get the model to run through early December on a regular basis, so I suspect this behavior is new.
What machines are impacted?
Hera
Steps to reproduce
- Compile a recent version of global-workflow
- Create an experiment with
FHMAXset to at least 2640. I did an ATM-only experiment to reduce run-time. - Check for restart files after hour 2616
Additional information
It would be nice if the model produced a nice error message if the last restart_interval value was less than the preceding one, before all the MPI_Terminate notifications.
Do you have a proposed solution?
Investigating the problem indicated that the last restart file written was the last one listed in the first 1024 characters of the line in the model_configure file
Further investigation led me to this ESMF constant, which defines the length of the line buffer in ESMF, which it doesn't appear to update in the ESMF_ConfigGetString calls used in ESMF_ConfigGetLen to count how many entries are on the line.
It appears the lack of crash with restart_interval=12 and restart_interval=21 but presence with restart_interval=24 is because ESMF only reads in the first 1023 characters of the line, and uses the last character of the buffer as a sentinel (not sure if that's a null character or EOL character). I will be testing that hypothesis with restart_interval=40. (EDIT: restart_interval=40 run proceeding well, with restart files available in December).