Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion error with boundary planes #1525

Closed
10 tasks
hgopalan opened this issue Mar 7, 2025 · 6 comments
Closed
10 tasks

Assertion error with boundary planes #1525

hgopalan opened this issue Mar 7, 2025 · 6 comments
Labels
bug:amr-wind Something isn't working

Comments

@hgopalan
Copy link
Contributor

hgopalan commented Mar 7, 2025

Bug description

I was running an inflow-outflow simulation with boundary plane data from precursor and got an assertion error. Interestingly, one of the cases ran fine while the other crashed.

4 GPU Nodes on Kestrel - Crashed
5 GPU Nodes on Kestrel - Ran till completion

Steps to reproduce

I can share the case file in Kestrel.

Steps to reproduce the behavior:

  1. Compiler used
    • GCC
    • LLVM
    • oneapi (Intel)
    • [X ] nvcc (NVIDIA)
    • rocm (AMD)
    • with MPI
    • other:
  2. Operating system
    • [X ] Linux
    • OSX
    • Windows
    • other (do tell ;)):
  3. Hardware:
    • CPU
    • [ X] GPU
  4. Machine details ():

Kestrel

module purge
module load binutils
module load PrgEnv-nvhpc
module load cray-libsci/22.12.1.1
module load cmake
module load cmake/3.27.9
module load cray-python
module load netcdf-fortran/4.6.1-oneapi
module load craype-x86-genoa
module load craype-accel-nvidia90

export MPICH_GPU_SUPPORT_ENABLED=0
export CUDAFLAGS="-L/nopt/nrel/apps/gpu_stack/libraries-gcc/06-24/linux-rhel8-zen4/gcc-12.3.0/hdf5-1.14.3-zoremvtiklvvkbtr43olrq3x546pflxe/lib -I/nopt/nrel/apps/gpu_stack/libraries-gcc/06-24/linux-rhel8-zen4/gcc-12.3.0/hdf5-1.14.3-zoremvtiklvvkbtr43olrq3x546pflxe/include -lhdf5 -lhdf5_hl -I${MPICH_DIR}/include -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_nvidia90} ${PE_MPICH_GTL_LIBS_nvidia90}"
export CXXFLAGS="-L/nopt/nrel/apps/gpu_stack/libraries-gcc/06-24/linux-rhel8-zen4/gcc-12.3.0/hdf5-1.14.3-zoremvtiklvvkbtr43olrq3x546pflxe/lib -I/nopt/nrel/apps/gpu_stack/libraries-gcc/06-24/linux-rhel8-zen4/gcc-12.3.0/hdf5-1.14.3-zoremvtiklvvkbtr43olrq3x546pflxe/include -lhdf5 -lhdf5_hl -I${MPICH_DIR}/include -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_nvidia90} ${PE_MPICH_GTL_LIBS_nvidia90}"
export HDF5_USE_FILE_LOCKING=FALSE
export MPICH_OFI_SKIP_NIC_SYMMETRY_TEST=1

  1. Input file attachments

  2. Error (paste or attach):```
    terminate called after throwing an instance of 'std::runtime_error'
    what(): Assertion `(m_in_times[0] <= time + constants::LOOSE_TOL) && (time < m_in_times.back() + constants::LOOSE_TOL)
    ' failed, file "/projects/total/codes/main/amr-wind/amr-wind/wind_energy/ABLBoundaryPlane.cpp", line 1067

7. If this is a segfault, a stack trace from a debug build (paste or attach):

## Expected behavior
<!-- A clear and concise description of what is expected behavior. -->

## AMR-Wind information
<!-- Please provide as much detail as possible including git commit. The best information is a snapshot of the AMR-Wind header. -->

==============================================================================
AMR-Wind (https://github.com/exawind/amr-wind)

AMR-Wind version :: v3.4.0
AMR-Wind Git SHA :: 38d1b9f
AMReX version :: 25.01-16-g92d35c2c8163

Exec. time :: Wed Mar 5 11:57:16 2025
Build time :: Feb 12 2025 06:44:50
C++ compiler :: GNU 8.5.0

MPI :: ON (Num. ranks = 16)
GPU :: ON (Backend: CUDA)
OpenMP :: OFF

Enabled third-party libraries:
NetCDF 4.9.2


## Additional context
<!-- Screenshots, related issues, etc -->
@hgopalan hgopalan added the bug:amr-wind Something isn't working label Mar 7, 2025
@mpolimeno
Copy link
Contributor

Been chatting with @hgopalan offline about this. Appears to be an old issue which occurs less frequently than it used to, thanks to a number of relevant fixes (see here, for reference).
Keeping it open so we are aware it still occurs sometimes, however.

@marchdf
Copy link
Contributor

marchdf commented Mar 10, 2025

Is constants::LOOSE_TOL still too tight? Can you print out the values that are failing?

@hgopalan
Copy link
Contributor Author

It is hard to say since the same boundary plane dataset run with one set of GPUs and failed with another set. Wondering if it can be a Kestrel specific issue.

@mpolimeno
Copy link
Contributor

Potential improvement: assert -> abort statement to run better diagnostics

@hgopalan
Copy link
Contributor Author

Had a repeat today. Happens more often on CPU. I have been mostly running on GPU before the current issues with queue and did not observe it.

@hgopalan
Copy link
Contributor Author

hgopalan commented Apr 7, 2025

I will close this for now and reopen in future, if required.

@hgopalan hgopalan closed this as completed Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:amr-wind Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants