Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix to have save point weight file be different name #1357

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

JessicaMeixner-NOAA
Copy link
Collaborator

Pull Request Summary

A bug fix for #1350

Description

On some machines, for the unstructured grid cases such as:
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6
processor 1 was so much faster than other processors, that the NetCDF file writting out the point output existed for some processors, but not all. This was causing the model to then hang. We did not see this on every machine.

To fix this issue, I have renamed the output file to a different file name. On hercules with intel, this fixed the issue. Additional testing to ensure this fixes everyones issue is needed.

Issue(s) addressed

Commit Message

bug fix to have save point weight file be different name

Check list

Testing

  • How were these changes tested?

Currently have just run one test on hercules with intel, additional testing to follow.

  • Are the changes covered by regression tests? (If not, why? Do new tests need to be added?)
  • Have the matrix regression tests been run (if yes, please note HPC and compiler)?
  • Please indicate the expected changes in the regression test output, (Note the list of known non-identical tests.)
  • Please provide the summary output of matrix.comp (matrix.Diff.txt, matrixCompFull.txt and matrixCompSummary.txt):

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - Can you try this bugfix on your machine?

I should have more info and test results on my end in tomorrow.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - I was incorrect about this bug-fix. It worked once, but didn't after that. I'm closing this PR, I don't think it's worth trying. I'll keep you posted.

@thesser1
Copy link
Collaborator

thesser1 commented Jan 27, 2025 via email

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1: I have now run this on 3 machines with intel and all the regtests went through without hanging. I was getting hangs on some of these machines before. I forgot to hit submit on one machine with gnu, so those are running now along w/the compare scripts.

I think this is ready to test. This is basically the same fix as yesterday except I renamed the wrong filename (the input instead of the output), so I just had a bug in my bugfix but all the comments/descriptions above are the same.

I think this is worth trying on your end now, but I'll continue to keep you posted on my end on the output of the last gnu run and the comparison outputs if you'd like to wait for those before trying on your end.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - I think this is ready for you to test if you don't mind.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Okay - I'm still getting hangs with ww3_ufs1.1/work_unstr_a on hercules with gnu, but no other machine/compiler is hanging anymore for me though. So I think this might be an unrelated issue, but not 100% sure.

@thesser1 - It would be a helpful data point to know how this branch goes on your end if you have time.

@thesser1
Copy link
Collaborator

Sorry for the delay @JessicaMeixner-NOAA. I will setup and test now.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Thanks Ty! I'm not sure if it'll work or not, but would definitely appreciate you taking time to test - and if it doesn't work any/all error information would be really helpful.

@thesser1
Copy link
Collaborator

thesser1 commented Jan 29, 2025

Not sure this helps you, but on one computer where tp2.6 was failing, it is now running with your fix. On the other computer where tp2.6 failed, the code is still failing with your fix. I did recheck on that computer when I roll the commit back, the regtest runs on that system. Both systems are using intel compiler.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Not sure this helps you, but on one computer where tp2.6 was failing, it is now running with your fix. On the other computer where tp2.6 failed, the code is still failing with your fix. I did recheck on that computer when I roll the commit back, the regtest runs on that system. Both systems are using intel compiler.

Any chance you have error messages with line numbers and other details? Thanks again for running things.

I'll continue to prioritize getting a fix for this.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 - I have pushed some additional fixes. I haven't fully tested everything, but it's looking like i'm getting passed previous errors. I should have my testing info by tomorrow morning, if not sooner. I'll post here if I find anything negative to report as soon as I come across it.

@thesser1
Copy link
Collaborator

thesser1 commented Feb 5, 2025 via email

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@thesser1 I have not had any errors in any of my tests, including on hercules with GNU where I saw some issues before. For the comparisons that have run, everything looks fine. The last of the compare scripts are running this morning and I'll post results this afternoon. I'm hopeful this PR will fix the issues you have been seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ww3_tp2.6 regression test hanging
2 participants