Rearchitecture: Xe epilogue #621
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR builds on #573, adding a
CollectiveEpiloguewith support for the new block 2D copy atoms.The existing epilogue implementation was mostly rewritten, as it had many hardcoded assumptions and limitations:
The new implementation removes all these restrictions.
Its API is also somewhat different, mostly in ways that more closely match the SM90 epilogues:
ComputeVectorLenconstexpr variable. Currently this is set to operate on one MMA atom's worth of accumulator data at a time, but if we want to make it user-configurable like the NV epilogues (where it's a template parameter for the dispatch policy), that's possible.operator().The new implementation glues together C/D loads and compute with reorders, so it can support efficient data type and layout conversions outside of the epilogue computation.