-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fix to have save point weight file be different name #1357
base: develop
Are you sure you want to change the base?
bug fix to have save point weight file be different name #1357
Conversation
@thesser1 - Can you try this bugfix on your machine? I should have more info and test results on my end in tomorrow. |
@thesser1 - I was incorrect about this bug-fix. It worked once, but didn't after that. I'm closing this PR, I don't think it's worth trying. I'll keep you posted. |
Ok, I was just setting it up for testing. Thanks for keeping me in the
loop.
Ty
…On Mon, Jan 27, 2025 at 8:26 AM Jessica Meixner ***@***.***> wrote:
@thesser1 <https://github.com/thesser1> - I was incorrect about this
bug-fix. It worked once, but didn't after that. I'm closing this PR, I
don't think it's worth trying. I'll keep you posted.
—
Reply to this email directly, view it on GitHub
<#1357 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAU2O3AR35OMB73DBDDHUC32MYXXJAVCNFSM6AAAAABV43CO7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJVG42TSMZVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@thesser1: I have now run this on 3 machines with intel and all the regtests went through without hanging. I was getting hangs on some of these machines before. I forgot to hit submit on one machine with gnu, so those are running now along w/the compare scripts. I think this is ready to test. This is basically the same fix as yesterday except I renamed the wrong filename (the input instead of the output), so I just had a bug in my bugfix but all the comments/descriptions above are the same. I think this is worth trying on your end now, but I'll continue to keep you posted on my end on the output of the last gnu run and the comparison outputs if you'd like to wait for those before trying on your end. |
@thesser1 - I think this is ready for you to test if you don't mind. |
Okay - I'm still getting hangs with ww3_ufs1.1/work_unstr_a on hercules with gnu, but no other machine/compiler is hanging anymore for me though. So I think this might be an unrelated issue, but not 100% sure. @thesser1 - It would be a helpful data point to know how this branch goes on your end if you have time. |
Sorry for the delay @JessicaMeixner-NOAA. I will setup and test now. |
Thanks Ty! I'm not sure if it'll work or not, but would definitely appreciate you taking time to test - and if it doesn't work any/all error information would be really helpful. |
Not sure this helps you, but on one computer where tp2.6 was failing, it is now running with your fix. On the other computer where tp2.6 failed, the code is still failing with your fix. I did recheck on that computer when I roll the commit back, the regtest runs on that system. Both systems are using intel compiler. |
Any chance you have error messages with line numbers and other details? Thanks again for running things. I'll continue to prioritize getting a fix for this. |
@thesser1 - I have pushed some additional fixes. I haven't fully tested everything, but it's looking like i'm getting passed previous errors. I should have my testing info by tomorrow morning, if not sooner. I'll post here if I find anything negative to report as soon as I come across it. |
I will submit my tests today.
Thanks
Ty.
…On Wed, Feb 5, 2025 at 9:04 AM Jessica Meixner ***@***.***> wrote:
@thesser1 <https://github.com/thesser1> - I have pushed some additional
fixes. I haven't fully tested everything, but it's looking like i'm getting
passed previous errors. I should have my testing info by tomorrow morning,
if not sooner. I'll post here if I find anything negative to report as soon
as I come across it.
—
Reply to this email directly, view it on GitHub
<#1357 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAU2O3ET7WUWQFS76KKCD5L2OIK5BAVCNFSM6AAAAABV43CO7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZWHE2DAOBZHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@thesser1 I have not had any errors in any of my tests, including on hercules with GNU where I saw some issues before. For the comparisons that have run, everything looks fine. The last of the compare scripts are running this morning and I'll post results this afternoon. I'm hopeful this PR will fix the issues you have been seeing. |
Pull Request Summary
A bug fix for #1350
Description
On some machines, for the unstructured grid cases such as:
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p srun -n 24 ../model ww3_tp2.6
processor 1 was so much faster than other processors, that the NetCDF file writting out the point output existed for some processors, but not all. This was causing the model to then hang. We did not see this on every machine.
To fix this issue, I have renamed the output file to a different file name. On hercules with intel, this fixed the issue. Additional testing to ensure this fixes everyones issue is needed.
Issue(s) addressed
Commit Message
bug fix to have save point weight file be different name
Check list
Testing
Currently have just run one test on hercules with intel, additional testing to follow.