Skip to content

UFS will not write more than three hundred restart files #2761

@DWesl

Description

@DWesl

What is wrong?

I started a year-long run on 1 May 2021. Increasing the number of retries allowed it to run through mid-November, but it only saved restart files through 18 August 2021.

I increased restart_interval_gfs in config.base to 24, but this caused the model to crash.
I decreased restart_interval_gfs in config.base to 21, which seems to work fine, and should produce restart files through October.

What should have happened?

The model should be able to read lines longer than 1024 characters. In the meantime, it would be nice to allow a restart_interval of 24 hours.

With a version of the code from just after the Rocky8 switchover, I was able to get the model to run through early December on a regular basis, so I suspect this behavior is new.

What machines are impacted?

Hera

Steps to reproduce

  1. Compile a recent version of global-workflow
  2. Create an experiment with FHMAX set to at least 2640. I did an ATM-only experiment to reduce run-time.
  3. Check for restart files after hour 2616

Additional information

It would be nice if the model produced a nice error message if the last restart_interval value was less than the preceding one, before all the MPI_Terminate notifications.

Do you have a proposed solution?

Investigating the problem indicated that the last restart file written was the last one listed in the first 1024 characters of the line in the model_configure file

Further investigation led me to this ESMF constant, which defines the length of the line buffer in ESMF, which it doesn't appear to update in the ESMF_ConfigGetString calls used in ESMF_ConfigGetLen to count how many entries are on the line.

It appears the lack of crash with restart_interval=12 and restart_interval=21 but presence with restart_interval=24 is because ESMF only reads in the first 1023 characters of the line, and uses the last character of the buffer as a sentinel (not sure if that's a null character or EOL character). I will be testing that hypothesis with restart_interval=40. (EDIT: restart_interval=40 run proceeding well, with restart files available in December).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions