Fixes for 50-series GPUs (CUDA backend) #152

merlinND · 2025-06-12T14:03:23Z

Reductions

Somewhat related to #151.

~~For f16 or f16x2, only .add is supported by the red instruction.~~
~~Mark other reduction operations as unsupported. Note this means that DrJit will now raise an exception in cases where it was previously "fine".~~

~~TODO: actually raise an exception in the unsupported cases, right now it's triggering a confusing error message.~~

Edit: @wjakob implemented the scatter reduce operations in a difference way that is actually supported (included in this PR).

Compress / prefix reduction

Symptom: on 50-series GPUs, any usage of dr::compress() with an input of size larger than 4096 would hang with 100% GPU usage.

After looking at this Cub implementation, it seems that our implementations of the decoupled-lookback pattern in the Compress and Prefix Reduction kernels needed a __threadfence_block().

Edit: @wjakob proposed a different implementation using volatile store / loads, which is more targeted.

Testing

On a 5080 with driver 575.57.08:

DrJit-core test suite (except test_graphviz but that's unrelated and already failing on master)
Drjit test suite
Mitsuba 3 test suite

resources/compress.cuh

src/op.cpp

Fix usage of the `.v2` variant. Prefer the `.f16x2` variant for `.add` since it is more broadly supported.

We probably do not need volatile stores in order to guarantee progress in decoupled lookback-style algorithms.

merlinND commented Jun 12, 2025

View reviewed changes

resources/compress.cuh Outdated Show resolved Hide resolved

src/op.cpp Outdated Show resolved Hide resolved

merlinND self-assigned this Jun 12, 2025

merlinND requested review from wjakob and njroussel June 12, 2025 14:15

merlinND mentioned this pull request Jun 12, 2025

Ray tracing hangs or crashes with 50-series GPUs (driver 575.xx) NVlabs/sionna#913

Closed

wjakob added 2 commits June 16, 2025 14:01

Built-in kernels: fix strong load/stores not being used

c048ce8

Scatter reduce: fix min, max op on recent GPUs

5ac6f14

merlinND force-pushed the fix-cuda-50-series branch from 3d8744e to 5ac6f14 Compare June 16, 2025 12:53

merlinND added 2 commits June 17, 2025 12:13

Scatter reduce: fix float16 variant on CUDA

c8e99c6

Fix usage of the `.v2` variant. Prefer the `.f16x2` variant for `.add` since it is more broadly supported.

CUDA reductions: use only volatile loads

29113be

We probably do not need volatile stores in order to guarantee progress in decoupled lookback-style algorithms.

merlinND force-pushed the fix-cuda-50-series branch from 3603c28 to 29113be Compare June 17, 2025 10:13

wjakob marked this pull request as ready for review June 17, 2025 11:10

wjakob merged commit 5c3dabc into master Jun 17, 2025
5 checks passed

wjakob deleted the fix-cuda-50-series branch June 17, 2025 11:11

merlinND mentioned this pull request Jul 18, 2025

Path Solver Freezes On GPU NVlabs/sionna-rt#26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes for 50-series GPUs (CUDA backend) #152

Fixes for 50-series GPUs (CUDA backend) #152

Uh oh!

merlinND commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fixes for 50-series GPUs (CUDA backend) #152

Fixes for 50-series GPUs (CUDA backend) #152

Uh oh!

Conversation

merlinND commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reductions

Compress / prefix reduction

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merlinND commented Jun 12, 2025 •

edited

Loading