Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ww3_tp2.6 regression test hanging #1350

Open
sbanihash opened this issue Jan 23, 2025 · 3 comments · May be fixed by #1357
Open

ww3_tp2.6 regression test hanging #1350

sbanihash opened this issue Jan 23, 2025 · 3 comments · May be fixed by #1357
Labels
bug Something isn't working

Comments

@sbanihash
Copy link
Collaborator

Describe the bug
ww3_tp2.6 regression test stalls and reaches time limit. Log file shows that the model hangs at this stage:

        output dates out of run dates : Track point output deactivated
        output dates out of run dates : Nesting data deactivated
        output dates out of run dates : Partitioned wave field data deactivated
        output dates out of run dates : Restart files second request deactivated
   Wave model ...

slurmstepd: error: *** JOB 5132044 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 5132044.36 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
#############################

The error seems to have entered the code after PR#1333

To Reproduce
Run ww3_tp2.6: (I had run matrix05 and encountered the issue)
run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6

Expected behavior
The run will stall and reach time limit.

Log file:
matrix05_out.txt

@sbanihash sbanihash added the bug Something isn't working label Jan 23, 2025
@JessicaMeixner-NOAA
Copy link
Collaborator

@thesser1 reported this: #1333 (comment)

@JessicaMeixner-NOAA
Copy link
Collaborator

@thesser1 when you get the error:

Rank 220 [Mon Jan 13 18:09:49 2025] [c1-0c1s12n1] Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(405): message sizes do not match across processes in the collective routine: Received 1 but expected 18

Is there any chance you get a line number associated with this?

@JessicaMeixner-NOAA
Copy link
Collaborator

Okay I think I had the right idea, wrong execution of the fix. Trying again @thesser1. I'll post a PR after I've run tests this time... but hopefully will have something for you tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants