TL/UCP: Add all-reduce ring alogorithm #1082

armratner · 2025-02-25T19:15:08Z

What

This PR adds a new ring-based Allreduce algorithm (named "ring") to the UCP transport layer within UCC. It introduces:

A new source file allreduce_ring.c implementing the ring-based method.
Modifications to the build system (Makefile.am) to include the new file.
Updates to the Allreduce interface (allreduce.[ch]), including a new enum value UCC_TL_UCP_ALLREDUCE_ALG_RING, new function prototypes, and references in the algorithm registration.
The ring-based algorithm’s logic (init, start, progress, and finalize) in allreduce_ring.c that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.

Why ?

A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:

Improved scalability for certain message sizes.
A straightforward method for ring-style communication patterns common in distributed HPC and AI workloads.

How ?

The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:

File Additions/Modifications:
- allreduce_ring.c: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions via ucc_dt_reduce.
- Makefile.am: Includes the new file in the build.
- allreduce.c/allreduce.h: Adds the new "ring" algorithm ID and associated function prototypes.
Implementation Details:
- Data is divided into num_chunks, typically equal to the number of ranks. Each chunk is passed around the ring (sendto/recvfrom) and reduced in a scratch buffer.
- A scratch buffer is allocated per rank to hold incoming chunk data before reduction.
- The algorithm ensures all chunks complete one round in the ring, then finalizes once the entire data is fully reduced on each rank.
Code Flow:
- Init: Sets up the ring task, scratch buffer, and references to the team’s executor.
- Start: Posts initial sends/receives and enqueues the progress function.
- Progress: Drives the ring of sends/receives chunk by chunk, calling ucc_dt_reduce on each incoming portion.
- Finalize: Cleans up (frees scratch space and finishes the task).

By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.

swx-jenkins3 · 2025-02-25T20:01:06Z

Can one of the admins verify this patch?

armratner · 2025-02-25T20:35:52Z

Working on Gtest

src/components/tl/ucp/allreduce/allreduce.c

src/components/tl/ucp/allreduce/allreduce_ring.c

src/components/tl/ucp/allreduce/allreduce.h

src/components/tl/ucp/tl_ucp_coll.h

src/components/tl/ucp/allreduce/allreduce_ring.c

armratner · 2025-03-06T22:12:53Z

Added the gtest,

Data Type and Operation Coverage
- INT32 with sum operation
- FLOAT32 with sum operation
- INT32 with product operation
- INT32 with max operation
- INT32 with min operation
- FLOAT64 with sum operation
Standard Test (ring)
- Tests the ring algorithm with standard configurations:
- Different data sizes (8, 65536, 123567 elements)
- Both in-place and non-in-place operations
- Different memory types (HOST, CUDA, CUDA_MANAGED where available)
- 3 iterations per configuration to ensure stability
Edge Case Test (ring_edge_cases)
- Tests the ring algorithm with non-power-of-two team sizes (3, 7, 13)
- Tests edge cases with message sizes:
- Empty message (0 elements)
- Single element (1 element)
- Small odd sizes (3, 17 elements)
Persistent Operation Test (ring_persistent)
- Tests the algorithm's behavior in persistent mode
- Uses a consistent buffer size (1024 elements)
- Runs 5 iterations to verify consistency

samnordmann

LGTM, however I do not see the tests passing the CI logs. Can you push a new commit to Trigger the CI? @janjust do we know why the Jenkin CI hasnt been triggered?

I think the algo might run into a segfault because the executor is not initialized

src/components/tl/ucp/allreduce/allreduce_ring.c

src/components/tl/ucp/allreduce/allreduce.c

src/components/tl/ucp/allreduce/allreduce_ring.c

nsarka

Thanks for the alg!

src/components/tl/ucp/allreduce/allreduce_ring.c

janjust · 2025-04-15T19:39:44Z

@janjust do we know why the Jenkin CI hasnt been triggered?
Not sure - we had issues with jenkins for a bit now, but working with Artem and Andrii to resolve it - should be good now, we can just retrigger.

src/components/tl/ucp/allreduce/allreduce.h

src/components/tl/ucp/allreduce/allreduce_ring.c

test/gtest/coll/test_allreduce.cc

nsarka

Please double check the gtest, it may not have been running the ring alg during the test

src/components/tl/ucp/allreduce/allreduce_ring.c

armratner · 2025-04-23T20:32:45Z

| Message Size

(MB)	UCC Ring (µs)	UCC Knomial (µs)	OMPI Tuned (µs)
1	2 287.61	579.78	3 710.72
2	2 656.15	983.58	7 397.25
4	4 380.01	3 531.90	15 340.96
8	11 228.21	12 728.49	32 010.70
16	26 984.31	31 163.98	70 183.02
32	50 100.87	69 753.03	154 048.94
64	100 628.53	145 987.70	314 291.89
128	200 764.59	295 283.81	638 286.61
256	405 462.40	590 662.48	—
512	992 232.12	1 182 699.49	—

armratner · 2025-04-24T15:35:08Z

Scale context – 128 ranks / 4 nodes / 32 PPN

Signed-off-by: Armen Ratner <[email protected]>

- Add tests for various data types and reduction operations - Test edge cases with non-power-of-two team sizes and odd message sizes - Test persistent operations for stability - Test with different memory types (HOST, CUDA, CUDA_MANAGED where available) Signed-off-by: Armen Ratner <[email protected]>

nsarka requested review from Sergei-Lebedev, janjust and samnordmann February 27, 2025 16:09

samnordmann reviewed Mar 5, 2025

View reviewed changes

wfaderhold21 added the Ready-for-Review label Mar 5, 2025

armratner force-pushed the all_reduce_ring_new branch from 04e5531 to d8396c6 Compare March 7, 2025 04:03

janjust force-pushed the all_reduce_ring_new branch from 75c3b77 to 97ddcc2 Compare March 20, 2025 16:08

janjust requested a review from nsarka March 20, 2025 16:08

janjust requested a review from samnordmann April 10, 2025 15:25

samnordmann reviewed Apr 15, 2025

View reviewed changes

src/components/tl/ucp/allreduce/allreduce_ring.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

nsarka reviewed Apr 15, 2025

View reviewed changes

src/components/tl/ucp/allreduce/allreduce_ring.c Show resolved Hide resolved

src/components/tl/ucp/allreduce/allreduce_ring.c Outdated Show resolved Hide resolved

ikryukov reviewed Apr 15, 2025

View reviewed changes

src/components/tl/ucp/allreduce/allreduce.h Show resolved Hide resolved

armratner force-pushed the all_reduce_ring_new branch from 4c4d9a8 to 4d0548b Compare April 15, 2025 20:38

armratner requested review from samnordmann, nsarka and ikryukov April 15, 2025 20:44

nsarka reviewed Apr 15, 2025

View reviewed changes

Sergei-Lebedev reviewed Apr 16, 2025

View reviewed changes

armratner force-pushed the all_reduce_ring_new branch from 4d0548b to 5ffe341 Compare April 17, 2025 02:38

armratner requested review from nsarka and Sergei-Lebedev April 17, 2025 02:39

armratner force-pushed the all_reduce_ring_new branch 2 times, most recently from e120c3f to 5bd39b5 Compare April 23, 2025 15:49

armratner added 2 commits May 15, 2025 10:29

TL/UCP: Add all-reduce ring alogrithm

f982a4b

Signed-off-by: Armen Ratner <[email protected]>

janjust force-pushed the all_reduce_ring_new branch from 5bd39b5 to d11bff0 Compare May 15, 2025 15:29

TL/UCP: Add all-reduce ring alogorithm #1082

Are you sure you want to change the base?

TL/UCP: Add all-reduce ring alogorithm #1082

Uh oh!

Conversation

armratner commented Feb 25, 2025

What

Why ?

How ?

Uh oh!

swx-jenkins3 commented Feb 25, 2025

Uh oh!

armratner commented Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

armratner commented Mar 6, 2025

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsarka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

janjust commented Apr 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsarka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

armratner commented Apr 23, 2025

Uh oh!

armratner commented Apr 24, 2025

Uh oh!

Uh oh!

nsarka left a comment •

edited

Loading

nsarka left a comment •

edited

Loading