ww3_tp2.6 regression test hanging #1350

sbanihash · 2025-01-23T14:55:56Z

Describe the bug
ww3_tp2.6 regression test stalls and reaches time limit. Log file shows that the model hangs at this stage:

        output dates out of run dates : Track point output deactivated
        output dates out of run dates : Nesting data deactivated
        output dates out of run dates : Partitioned wave field data deactivated
        output dates out of run dates : Restart files second request deactivated
   Wave model ...

slurmstepd: error: *** JOB 5132044 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 5132044.36 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
#############################

The error seems to have entered the code after PR#1333

To Reproduce
Run ww3_tp2.6: (I had run matrix05 and encountered the issue)
run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6

Expected behavior
The run will stall and reach time limit.

Log file:
matrix05_out.txt

The text was updated successfully, but these errors were encountered:

JessicaMeixner-NOAA · 2025-01-23T18:36:44Z

@thesser1 reported this: #1333 (comment)

JessicaMeixner-NOAA · 2025-01-25T13:56:33Z

@thesser1 when you get the error:

Rank 220 [Mon Jan 13 18:09:49 2025] [c1-0c1s12n1] Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(405): message sizes do not match across processes in the collective routine: Received 1 but expected 18

Is there any chance you get a line number associated with this?

JessicaMeixner-NOAA · 2025-01-27T21:12:40Z

Okay I think I had the right idea, wrong execution of the fix. Trying again @thesser1. I'll post a PR after I've run tests this time... but hopefully will have something for you tomorrow.

sbanihash added the bug Something isn't working label Jan 23, 2025

JessicaMeixner-NOAA linked a pull request Jan 26, 2025 that will close this issue

bug fix to have save point weight file be different name #1357

Open

4 tasks

JessicaMeixner-NOAA mentioned this issue Jan 27, 2025

Improve oasis grid writing #1354

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ww3_tp2.6 regression test hanging #1350

ww3_tp2.6 regression test hanging #1350

sbanihash commented Jan 23, 2025

JessicaMeixner-NOAA commented Jan 23, 2025

JessicaMeixner-NOAA commented Jan 25, 2025

JessicaMeixner-NOAA commented Jan 27, 2025

ww3_tp2.6 regression test hanging #1350

ww3_tp2.6 regression test hanging #1350

Comments

sbanihash commented Jan 23, 2025

JessicaMeixner-NOAA commented Jan 23, 2025

JessicaMeixner-NOAA commented Jan 25, 2025

JessicaMeixner-NOAA commented Jan 27, 2025