Skip to content

TL/UCP: Add all-reduce ring alogorithm #1082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

armratner
Copy link

What

This PR adds a new ring-based Allreduce algorithm (named "ring") to the UCP transport layer within UCC. It introduces:

  • A new source file allreduce_ring.c implementing the ring-based method.
  • Modifications to the build system (Makefile.am) to include the new file.
  • Updates to the Allreduce interface (allreduce.[ch]), including a new enum value UCC_TL_UCP_ALLREDUCE_ALG_RING, new function prototypes, and references in the algorithm registration.
  • The ring-based algorithm’s logic (init, start, progress, and finalize) in allreduce_ring.c that manages per-rank scratch buffers, chunk-based sending/receiving, and reduction.

Why ?

A ring-based Allreduce can be more efficient for large message sizes, especially on relatively simple or homogeneous network topologies. It complements existing Allreduce algorithms (e.g., knomial, sliding window, DBT) by providing:

  • Improved scalability for certain message sizes.
  • A straightforward method for ring-style communication patterns common in distributed HPC and AI workloads.

How ?

The ring algorithm splits the input data into chunks, then circulates these chunks around the ring of ranks. Each rank performs local partial reductions on received data and passes it along. The main changes include:

  1. File Additions/Modifications:

    • allreduce_ring.c: Implements the ring-based send/recv steps, in-place or out-of-place usage, and partial data reductions via ucc_dt_reduce.
    • Makefile.am: Includes the new file in the build.
    • allreduce.c/allreduce.h: Adds the new "ring" algorithm ID and associated function prototypes.
  2. Implementation Details:

    • Data is divided into num_chunks, typically equal to the number of ranks. Each chunk is passed around the ring (sendto/recvfrom) and reduced in a scratch buffer.
    • A scratch buffer is allocated per rank to hold incoming chunk data before reduction.
    • The algorithm ensures all chunks complete one round in the ring, then finalizes once the entire data is fully reduced on each rank.
  3. Code Flow:

    • Init: Sets up the ring task, scratch buffer, and references to the team’s executor.
    • Start: Posts initial sends/receives and enqueues the progress function.
    • Progress: Drives the ring of sends/receives chunk by chunk, calling ucc_dt_reduce on each incoming portion.
    • Finalize: Cleans up (frees scratch space and finishes the task).

By adding this ring-based approach, UCC gains a more complete suite of collective algorithms for Allreduce, allowing users and internal heuristics to pick the best method based on message size, topology, and system capabilities.

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@armratner
Copy link
Author

Working on Gtest

@armratner
Copy link
Author

Added the gtest,

  1. Data Type and Operation Coverage
    • INT32 with sum operation
    • FLOAT32 with sum operation
    • INT32 with product operation
    • INT32 with max operation
    • INT32 with min operation
    • FLOAT64 with sum operation
  2. Standard Test (ring)
    - Tests the ring algorithm with standard configurations:
    - Different data sizes (8, 65536, 123567 elements)
    - Both in-place and non-in-place operations
    - Different memory types (HOST, CUDA, CUDA_MANAGED where available)
    - 3 iterations per configuration to ensure stability
  3. Edge Case Test (ring_edge_cases)
    - Tests the ring algorithm with non-power-of-two team sizes (3, 7, 13)
    - Tests edge cases with message sizes:
    - Empty message (0 elements)
    - Single element (1 element)
    - Small odd sizes (3, 17 elements)
  4. Persistent Operation Test (ring_persistent)
    - Tests the algorithm's behavior in persistent mode
    - Uses a consistent buffer size (1024 elements)
    - Runs 5 iterations to verify consistency

@armratner armratner force-pushed the all_reduce_ring_new branch from 04e5531 to d8396c6 Compare March 7, 2025 04:03
@janjust janjust force-pushed the all_reduce_ring_new branch from 75c3b77 to 97ddcc2 Compare March 20, 2025 16:08
@janjust janjust requested a review from nsarka March 20, 2025 16:08
@janjust janjust requested a review from samnordmann April 10, 2025 15:25
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however I do not see the tests passing the CI logs. Can you push a new commit to Trigger the CI? @janjust do we know why the Jenkin CI hasnt been triggered?

I think the algo might run into a segfault because the executor is not initialized

Copy link
Collaborator

@nsarka nsarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the alg!

@janjust
Copy link
Collaborator

janjust commented Apr 15, 2025

@janjust do we know why the Jenkin CI hasnt been triggered?
Not sure - we had issues with jenkins for a bit now, but working with Artem and Andrii to resolve it - should be good now, we can just retrigger.

Copy link
Collaborator

@nsarka nsarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check the gtest, it may not have been running the ring alg during the test

@armratner armratner force-pushed the all_reduce_ring_new branch from 4d0548b to 5ffe341 Compare April 17, 2025 02:38
@armratner armratner force-pushed the all_reduce_ring_new branch 2 times, most recently from e120c3f to 5bd39b5 Compare April 23, 2025 15:49
@armratner
Copy link
Author

| Message Size
Emphasising the Latency Jump at 8 MB

(MB) UCC Ring (µs) UCC Knomial (µs) OMPI Tuned (µs)
1 2 287.61 579.78 3 710.72
2 2 656.15 983.58 7 397.25
4 4 380.01 3 531.90 15 340.96
8 11 228.21 12 728.49 32 010.70
16 26 984.31 31 163.98 70 183.02
32 50 100.87 69 753.03 154 048.94
64 100 628.53 145 987.70 314 291.89
128 200 764.59 295 283.81 638 286.61
256 405 462.40 590 662.48
512 992 232.12 1 182 699.49

@armratner
Copy link
Author

Scale context – 128 ranks / 4 nodes / 32 PPN

armratner added 2 commits May 15, 2025 10:29
- Add tests for various data types and reduction operations
- Test edge cases with non-power-of-two team sizes and odd message sizes
- Test persistent operations for stability
- Test with different memory types (HOST, CUDA, CUDA_MANAGED where available)

Signed-off-by: Armen Ratner <[email protected]>
@janjust janjust force-pushed the all_reduce_ring_new branch from 5bd39b5 to d11bff0 Compare May 15, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants