CUDA Packet Scatter Reduce of f16x2 #151

DoeringChristian · 2025-06-11T13:31:17Z

This PR enables us to render a specialized version of the PacketScatter op in CUDA when a reduction operation is specified, with f16 types. It uses the red.global.add.noftz.f16x2 instruction.

wjakob · 2025-06-11T13:39:38Z

Potentially relevant: sm_90+ supports red.global.add.v2.f16x2 and red.global.add.v4.f16x2 (so 4 or 8 FP16 accumulations in one instruction). This should be available on RTX5090 and similar Blackwell GPUs.

merlinND · 2025-06-12T14:59:01Z

src/cuda_packet.cpp

+                fmt("        @$v ", mask);
+            else
+                put("        ");
+            fmt("red.global.$s.noftz.f16x2 [%rd3+$u], %tmp;\n", op,


Are you sure this works when op != ".add"?
Looking carefully at the PTX docs for red, it seems that non-vector variants of red only support add for f16 and f16x2. Do you agree or am I misreading it?

If there are only 1 or 2 values left to reduce (e.g. packets of size 2 , 6, 10, etc), I don't think we could use the vector variants?

merlinND · 2025-06-12T14:59:06Z

src/cuda_packet.cpp

 void jitc_cuda_render_scatter_packet(const Variable *v, const Variable *ptr,
                                     const Variable *index, const Variable *mask) {
    bool is_masked = !mask->is_literal() || mask->literal != 1;
    PacketScatterData *psd = (PacketScatterData *) v->data;
    const std::vector<uint32_t> &values = psd->values;
    const Variable *v0 = jitc_var(values[0]);

+    // Handle non-Identitiy reduction case
+    if (psd->op != ReduceOp::Identity){


src/cuda_packet.cpp

merlinND · 2025-06-12T15:01:28Z

src/cuda_packet.cpp

+    if (v0->type != (uint32_t)VarType::Float16)
+        jitc_fail("Packeted scatter reductions are only supported with f16 "
+                  "variables.");


Move this check earlier + include the name of the type you actually got in the error message (type_name[v0->type] I think)

merlinND · 2025-06-12T15:02:30Z

src/cuda_packet.cpp

@@ -15,6 +15,10 @@
 #include "op.h"
 #include "log.h"

+static const char *reduce_op_name[(int) ReduceOp::Count] = {


Would it be possible to not re-define this? (vs cuda_scatter.cpp).
Having it duplicated in separate locations makes it error-prone when the enum changes.
Although I see it's already defined another time in llvm_scatter.cpp, so maybe it's not easy to have it only once.

merlinND · 2025-06-12T15:04:23Z

src/cuda_packet.cpp

+                                            const Variable *mask) {
+    bool is_masked = !mask->is_literal() || mask->literal != 1;
+    PacketScatterData *psd = (PacketScatterData *) v->data;
+    const std::vector<uint32_t> &values = psd->values;


I wonder if this could use dr::vector?

It's fine to use std::vector. I just prefer dr::vector in header files, so that we don' t have to pull in STL code everywhere.

src/cuda_packet.cpp

…packed functions Specialization for vector scatter reductions on sm_90 Improved failure messages Removed include of llvm in cuda Cleanup packet scatter reduce Improved packet scatter reduce gating

wjakob

This is cool, some feedback from me. The first two are from an earlier partial review and may no longer apply. (It says the files are "Outdated").

src/cuda_packet.cpp

wjakob · 2025-06-11T13:35:45Z

src/cuda_packet.cpp

+                byte_offset);
+        }
+    } else {
+        jitc_fail("jitc_cuda_render_scatter_reduce_packet(): Number of elements not supported for reduction.");


Could this cause some existing reductions to fail, or does the logic in drjit-core preclude this case from being reachable?

wjakob · 2025-06-11T13:37:10Z

src/op.cpp

@@ -2639,6 +2639,11 @@ uint32_t jitc_var_scatter_packet(size_t n, uint32_t target_,
                             mode == ReduceMode::NoConflicts ||
                             (mode == ReduceMode::Auto &&
                              target_info.size <= llvm_expand_threshold));
+        } else if (op == ReduceOp::Add && backend == JitBackend::CUDA) {
+            use_packet_op = (mode == ReduceMode::Expand ||


We don't have ReduceMode::Expand in CUDA.

wjakob · 2025-07-18T08:11:00Z

src/cuda_packet.cpp

+                                            const Variable *mask) {
+    bool is_masked = !mask->is_literal() || mask->literal != 1;
+    PacketScatterData *psd = (PacketScatterData *) v->data;
+    const std::vector<uint32_t> &values = psd->values;


It's fine to use std::vector. I just prefer dr::vector in header files, so that we don' t have to pull in STL code everywhere.

wjakob · 2025-07-18T08:11:37Z

src/cuda_packet.cpp

+                  "elements not supported for reduction.");
+
+    if (ts->compute_capability >= 90) {
+        // Use the new `red.global.v2` instructions. This enables both min & max


Is this comment up-to-date? .v2 is only two elements, and we're doing wider ones AFAIK.

src/cuda_packet.cpp

wjakob · 2025-07-18T08:13:18Z

src/op.cpp

@@ -1905,11 +1905,11 @@ void jitc_var_gather_packet(size_t n, uint32_t src_, uint32_t index, uint32_t ma
    auto [var_info, index_v, mask_v] =
        jitc_var_check("jit_var_gather_packet", index, mask);

-    if ((n & (n-1)) || n == 1)
+    if (n == 1)
        jitc_raise("jitc_var_gather_packet(): vector size must be a power of two "


Error message seems out of date now.

wjakob · 2025-07-18T08:13:37Z

src/op.cpp

        jitc_raise("jitc_var_gather_packet(): vector size must be a power of two "
                   "and >= 1 (got %zu)!", n);

-    if ((src_info.size & (n-1)) != 0 && src_info.size != 1)
+    if (src_info.size % 2 != 0 && src_info.size != 1)
        jitc_raise("jitc_var_gather_packet(): source r%u has size %u, which is not "


Error message seems out of date now.

wjakob · 2025-07-18T08:15:55Z

src/op.cpp

        }
    }

+    // If the packet size is not divisible by two we cannot use packet ops.
+    use_packet_op = use_packet_op && n > 1 && n % 2 == 0;


If we move the logic to code generation, perhaps it's easier to not even have special handling for the n=1 case here.

merlinND reviewed Jun 12, 2025

View reviewed changes

merlinND mentioned this pull request Jun 12, 2025

Fixes for 50-series GPUs (CUDA backend) #152

Merged

3 tasks

DoeringChristian force-pushed the scatter-reduce-f16x2 branch 5 times, most recently from 2e4993e to 2620cc7 Compare June 18, 2025 14:17

DoeringChristian force-pushed the scatter-reduce-f16x2 branch 5 times, most recently from bd67bc4 to 27702ea Compare July 9, 2025 15:18

DoeringChristian added 6 commits July 16, 2025 13:40

Added option for f16x2 scatter reductions using the existing scatter_…

33b9177

…packed functions Specialization for vector scatter reductions on sm_90 Improved failure messages Removed include of llvm in cuda Cleanup packet scatter reduce Improved packet scatter reduce gating

Relax packet scatter splitting condition

34a88f4

Fixed jitc_var_scatter_packet

8ab60b5

Fixed packet size inference

87f3f63

Fixed max_packet_size inference

f7f9de5

Improved packeting for gather_packet

e9c4a5d

DoeringChristian force-pushed the scatter-reduce-f16x2 branch from a45d7d8 to e9c4a5d Compare July 16, 2025 11:40

DoeringChristian marked this pull request as ready for review July 18, 2025 07:09

wjakob reviewed Jul 18, 2025

View reviewed changes

DoeringChristian added 4 commits July 18, 2025 14:00

Moved cuda packet splitting into codegen

3531169

Improved packeting for odd sizes

99f05a5

Formatting

939a5fd

Small fix

5b99b06

CUDA Packet Scatter Reduce of f16x2 #151

Are you sure you want to change the base?

CUDA Packet Scatter Reduce of f16x2 #151

Uh oh!

Conversation

DoeringChristian commented Jun 11, 2025

Uh oh!

wjakob commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wjakob left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wjakob commented Jun 11, 2025 •

edited

Loading

wjakob left a comment •

edited

Loading