Optimise how the CshiftMap table is calculated and copied to the device #476
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the performance of
Cshift
by optimising the way theCshift_table
is calculated and moved to the device:std::pair<int, int>
per index, since those two ints are the same number +lo
orro
. So anstd::vector<int>
is saved instead.Cshift_local
andCshift_comms
.Copy_plane
andCopy_plane_permute
. For the rest of the functions (Gather
s andScatter
s) inCshift_common.h
,Cshift_table
remains the same, so some template functions have been added to accommodate both cases while avoiding code duplication.The improvement in performance varies quite a bit depending on the MPI distribution scheme (see table below), but it is significant nonetheless for cases heavy in gauge calculations.
The PR also contains changes from PR #465 and #471 which were used to asses the performance gains in the WilsonFlow and sp2n test cases.
Four test cases were run on Tursa:
with up to 3 different MPI configurations:
--mpi 1.1.1.1
(no MPI)--mpi 1.1.1.4
(1 node)--mpi 1.1.2.4
(2 nodes).Based on the table below, test case 3 shows no measurable difference before and after the changes, as expected for a setup with not much contribution from gauge action calculations (and thus not many calls to
Cshift
).But the other test cases show improvements, from mild ones for test cases 1 and 2, where the runtime is roughly split equally between gauge and fermion actions, to significant (~18% improvement for the no-MPI case) for test case 4, which is gauge-dominated.
The above improvements are compounded when the changes in this PR are considered on top of the ones in #473 , giving an improvement of ~32% for test case 4, no MPI and ~18% for test case 4, single-node MPI.
This table compares the develop branch at hash 3d01486 , with this branch:
--grid
--mpi
develop
(s)cshift-map-optimise
(s)