-
Notifications
You must be signed in to change notification settings - Fork 110
TL/UCP: Add all-reduce ring alogorithm #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Can one of the admins verify this patch? |
Working on Gtest |
Added the gtest,
|
04e5531
to
d8396c6
Compare
75c3b77
to
97ddcc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however I do not see the tests passing the CI logs. Can you push a new commit to Trigger the CI? @janjust do we know why the Jenkin CI hasnt been triggered?
I think the algo might run into a segfault because the executor is not initialized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the alg!
@janjust do we know why the Jenkin CI hasnt been triggered? |
4c4d9a8
to
4d0548b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double check the gtest, it may not have been running the ring alg during the test
4d0548b
to
5ffe341
Compare
e120c3f
to
5bd39b5
Compare
|
Scale context – 128 ranks / 4 nodes / 32 PPN |
Signed-off-by: Armen Ratner <[email protected]>
- Add tests for various data types and reduction operations - Test edge cases with non-power-of-two team sizes and odd message sizes - Test persistent operations for stability - Test with different memory types (HOST, CUDA, CUDA_MANAGED where available) Signed-off-by: Armen Ratner <[email protected]>
5bd39b5
to
d11bff0
Compare
What
This PR adds a new ring-based Allreduce algorithm (named
"ring"
) to the UCP transport layer within UCC. It introduces:allreduce_ring.c
implementing the ring-based method.Makefile.am
) to include the new file.allreduce.[ch]
), including a new enum valueUCC_TL_UCP_ALLREDUCE_ALG_RING
, new function prototypes, and references in the algorithm registration.allreduce_ring.c
that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.Why ?
A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:
How ?
The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:
File Additions/Modifications:
allreduce_ring.c
: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions viaucc_dt_reduce
.Makefile.am
: Includes the new file in the build.allreduce.c/allreduce.h
: Adds the new"ring"
algorithm ID and associated function prototypes.Implementation Details:
num_chunks
, typically equal to the number of ranks. Each chunk is passed around the ring (sendto
/recvfrom
) and reduced in a scratch buffer.scratch
buffer is allocated per rank to hold incoming chunk data before reduction.Code Flow:
ucc_dt_reduce
on each incoming portion.By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.