You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
ww3_tp2.6 regression test stalls and reaches time limit. Log file shows that the model hangs at this stage:
output dates out of run dates : Track point output deactivated
output dates out of run dates : Nesting data deactivated
output dates out of run dates : Partitioned wave field data deactivated
output dates out of run dates : Restart files second request deactivated
Wave model ...
slurmstepd: error: *** JOB 5132044 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 5132044.36 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
#############################
The error seems to have entered the code after PR#1333
To Reproduce
Run ww3_tp2.6: (I had run matrix05 and encountered the issue)
run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6
Expected behavior
The run will stall and reach time limit.
Rank 220 [Mon Jan 13 18:09:49 2025] [c1-0c1s12n1] Fatal error in MPIR_CRAY_Bcast_Tree: Other MPI error, error stack:
MPIR_CRAY_Bcast_Tree(405): message sizes do not match across processes in the collective routine: Received 1 but expected 18
Is there any chance you get a line number associated with this?
Okay I think I had the right idea, wrong execution of the fix. Trying again @thesser1. I'll post a PR after I've run tests this time... but hopefully will have something for you tomorrow.
Describe the bug
ww3_tp2.6 regression test stalls and reaches time limit. Log file shows that the model hangs at this stage:
slurmstepd: error: *** JOB 5132044 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 5132044.36 ON h11c53 CANCELLED AT 2025-01-13T23:27:44 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
#############################
The error seems to have entered the code after PR#1333
To Reproduce
Run ww3_tp2.6: (I had run matrix05 and encountered the issue)
run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6
Expected behavior
The run will stall and reach time limit.
Log file:
matrix05_out.txt
The text was updated successfully, but these errors were encountered: