Support vectorized local reduction for p2p-based ReduceScatter overlap #1452

erhoo82 · 2025-02-04T04:29:05Z

Description

Vectorized load/store for p2p-based ReduceScatter overlap.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Vectorize the loads/stores of local reduction after TP ReduceScatter for (1) FP8 comm > FP32 reduce > BF16 out, (2) BF16 comm > FP32 reduce > BF16 out

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Sangkug Lym <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-02-20T21:40:09Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

+#pragma unroll
+  for (int input_id = 1; input_id < num_inputs; ++input_id) {
+    loader.load(tid + num_aligned_elements_per_input * input_id, tot_input_size);
+#pragma unroll
+    for (int i = 0; i < nvec; ++i) {
+      accum_buf[i] += static_cast<float>(loader.separate()[i]) * (*scale);
+      if (input_id == num_inputs - 1) {
+        storer.separate()[i] = static_cast<half_dtype>(accum_buf[i]);
+      }
+    }
+  }


Correctness bug when num_inputs == 1

Unrolling loop over num_inputs is not necessary since it's not known at compile-time

Suggested change

#pragma unroll

for (int input_id = 1; input_id < num_inputs; ++input_id) {

loader.load(tid + num_aligned_elements_per_input * input_id, tot_input_size);

#pragma unroll

for (int i = 0; i < nvec; ++i) {

accum_buf[i] += static_cast<float>(loader.separate()[i]) * (*scale);

if (input_id == num_inputs - 1) {

storer.separate()[i] = static_cast<half_dtype>(accum_buf[i]);

}

}

}

for (int input_id = 1; input_id < num_inputs; ++input_id) {

loader.load(tid + num_aligned_elements_per_input * input_id, tot_input_size);

#pragma unroll

for (int i = 0; i < nvec; ++i) {

accum_buf[i] += static_cast<float>(loader.separate()[i]) * (*scale);

}

}

#pragma unroll

for (int i = 0; i < nvec; ++i) {

storer.separate()[i] = static_cast<half_dtype>(accum_buf[i]);

}

Same issue in reduce_bf16_cuda.

timmoon10 · 2025-02-20T21:43:18Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

 }

 template <typename fp8type>
 void reduce_fp8_in_bf16_out(void *inputs, void *output, float *scale, int num_inputs,
                            int input_size, cudaStream_t stream) {
+  const int nvec = 32;


Since we're using this as a template arg, better to make it explicit that it is known at compile-time:

Suggested change

const int nvec = 32;

constexpr int nvec = 32;

Same issue in reduce_bf16.

timmoon10 · 2025-02-20T21:47:28Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

+  transformer_engine::VectorizedLoader<fp8type, nvec, true> loader(inputs_fp8, tot_input_size);
+  transformer_engine::VectorizedStorer<half_dtype, nvec, true> storer(output_half, input_size);
+
+  const size_t tid = threadIdx.x + blockDim.x * blockIdx.x;


Do we handle the case where the block size doesn't neatly divide the input size? Maybe we can fix with something like:

Suggested change

const size_t tid = threadIdx.x + blockDim.x * blockIdx.x;

const size_t tid = threadIdx.x + blockDim.x * blockIdx.x;

if (tid >= num_aligned_elements_per_input) {

return;

}

Alternatively, we can change how we configure the CUDA blocks:

size_t num_threads = MAX_THREADS / 4; assert(num_aligned_elements_per_input % num_threads == 0); size_t num_blocks = num_aligned_elements_per_input / num_threads; dim3 block(num_threads); dim3 grid(num_blocks);

Same issue in reduce_bf16_cuda.

Signed-off-by: Sangkug Lym <[email protected]>

erhoo82 · 2025-02-24T21:11:31Z

@timmoon10
Can you review this one again?

BTW, regarding the lint error, nvec should be compile-time constant?

erhoo82 requested a review from ksivaman February 5, 2025 16:48

Support vectorized local reduction for p2p-based ReduceScatter overlap

4f697a9

Signed-off-by: Sangkug Lym <[email protected]>

timmoon10 changed the base branch from release_v2.0 to main February 7, 2025 23:56

timmoon10 force-pushed the rs_local_red branch from febe2ec to 4f697a9 Compare February 7, 2025 23:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

47bb178

for more information, see https://pre-commit.ci

timmoon10 requested changes Feb 20, 2025

View reviewed changes

cleanup

b1ad009

Signed-off-by: Sangkug Lym <[email protected]>

erhoo82 force-pushed the rs_local_red branch from fb891d8 to b1ad009 Compare February 22, 2025 00:17

timmoon10 self-requested a review February 26, 2025 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vectorized local reduction for p2p-based ReduceScatter overlap #1452

Support vectorized local reduction for p2p-based ReduceScatter overlap #1452

erhoo82 commented Feb 4, 2025

timmoon10 Feb 20, 2025

timmoon10 Feb 20, 2025

timmoon10 Feb 20, 2025 •

edited

Loading

erhoo82 commented Feb 24, 2025

Support vectorized local reduction for p2p-based ReduceScatter overlap #1452

Are you sure you want to change the base?

Support vectorized local reduction for p2p-based ReduceScatter overlap #1452

Conversation

erhoo82 commented Feb 4, 2025

Description

Type of change

Changes

Checklist:

timmoon10 Feb 20, 2025

Choose a reason for hiding this comment

timmoon10 Feb 20, 2025

Choose a reason for hiding this comment

timmoon10 Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

erhoo82 commented Feb 24, 2025

timmoon10 Feb 20, 2025 •

edited

Loading