Skip to content

Fixes for 50-series GPUs (CUDA backend) #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 17, 2025
Merged

Fixes for 50-series GPUs (CUDA backend) #152

merged 4 commits into from
Jun 17, 2025

Conversation

merlinND
Copy link
Member

@merlinND merlinND commented Jun 12, 2025

Reductions

Somewhat related to #151.

For f16 or f16x2, only .add is supported by the red instruction.
Mark other reduction operations as unsupported. Note this means that DrJit will now raise an exception in cases where it was previously "fine".

TODO: actually raise an exception in the unsupported cases, right now it's triggering a confusing error message.

Edit: @wjakob implemented the scatter reduce operations in a difference way that is actually supported (included in this PR).

Compress / prefix reduction

Symptom: on 50-series GPUs, any usage of dr::compress() with an input of size larger than 4096 would hang with 100% GPU usage.

After looking at this Cub implementation, it seems that our implementations of the decoupled-lookback pattern in the Compress and Prefix Reduction kernels needed a __threadfence_block().

Edit: @wjakob proposed a different implementation using volatile store / loads, which is more targeted.

Testing

On a 5080 with driver 575.57.08:

  • DrJit-core test suite (except test_graphviz but that's unrelated and already failing on master)
  • Drjit test suite
  • Mitsuba 3 test suite

@merlinND merlinND force-pushed the fix-cuda-50-series branch from 3d8744e to 5ac6f14 Compare June 16, 2025 12:53
merlinND added 2 commits June 17, 2025 12:13
Fix usage of the `.v2` variant.
Prefer the `.f16x2` variant for `.add` since it is more broadly supported.
We probably do not need volatile stores in order to guarantee progress in decoupled lookback-style algorithms.
@merlinND merlinND force-pushed the fix-cuda-50-series branch from 3603c28 to 29113be Compare June 17, 2025 10:13
@wjakob wjakob marked this pull request as ready for review June 17, 2025 11:10
@wjakob wjakob merged commit 5c3dabc into master Jun 17, 2025
5 checks passed
@wjakob wjakob deleted the fix-cuda-50-series branch June 17, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants