Skip to content

Conversation

@jeremyfirst22
Copy link
Contributor

Bugfix

There is some relevant conversation on where this fix is most appropriate in #103.

&Aq[Component * 7 * N], N);

ScaLBL_DeviceBarrier();
req1[0] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to

  1. make all the calls to ScaLBL_D3Q19_Pack in a row
  2. make a single call to ScaLBL_DeviceBarrier()
  3. all of the calls to MPI_COMM_SCALBL.Isend in a row
    This way there is only one synchronization point, which should be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is the way that SendD3Q19AA, TriSendD3Q7AA, and SendHalo are structured.

I wasn't sure if there was a reason for SendD3Q7AA (from this commit) and BiSendD3Q7AA (from fe6f38a) using the interwoven packing-sending structure or not, and didn't want to introduce bugs while trying to fix this one.

If you're confident that these two routines can separate packing and sending, I'll update this commit to use this structure for both this function and BiSendD3Q7AA. That should put all sending routines in the same structure.

Copy link
Collaborator

@JamesEMcClure JamesEMcClure Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confident that that this will work.

The D3Q19 distributions all have buffers that are explicitly created to hold the data needed for each timestep. If you use the same communicator (ScaLBL_Comm) to send multiple distributions (like the BiSendD3Q7 ) then you need to synchronize so that you don't overwrite the first distribution with the second. In the multi-component diffusion cases you might have an arbitrarily large number of D3Q7 distributions, for example.

In general, the fastest way to catch communication errors is to run the following

https://github.com/OPM/LBPM/blob/master/tests/TestCommD3Q19.cpp

If you ever build with a new version of MPI it is a good idea to run this first until you have tuned the compile flags and the launcher flags. Certain things aren't part of the MPI standard so you can't design software behavior around them. The GPU conventions are almost always configurable at compile time and / or runtime based on flags, so you can't get around twiddling these.

This test was used to debug LBPM communications on lots of large supercomputers with several thousands of GPU (both AMD and NVIDIA).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants