Skip to content

Expand CUDA intrinsics coverage to 120+ operations#8

Merged
mivertowski merged 1 commit into
mainfrom
claude/expand-cuda-intrinsics-015QNYA1vEFCNZoLMdLafWQ8
Dec 11, 2025
Merged

Expand CUDA intrinsics coverage to 120+ operations#8
mivertowski merged 1 commit into
mainfrom
claude/expand-cuda-intrinsics-015QNYA1vEFCNZoLMdLafWQ8

Conversation

@mivertowski
Copy link
Copy Markdown
Owner

This commit significantly expands the CUDA codegen transpiler's intrinsic coverage from 45+ to 120+ GPU intrinsics across 13 categories:

New Intrinsics Added:

Synchronization (3 new)

  • sync_threads_count, sync_threads_and, sync_threads_or

Atomic Operations (5 new)

  • atomic_and, atomic_or, atomic_xor, atomic_inc, atomic_dec

Math Functions (8 new)

  • trunc, fmod, remainder, copysign, cbrt, hypot, plus warp_size

Trigonometric (7 new)

  • asin, acos, atan, atan2, sincos, sinpi, cospi

Hyperbolic (6 new)

  • sinh, cosh, tanh, asinh, acosh, atanh

Exponential/Logarithmic (14 new)

  • exp2, exp10, expm1, log2, log10, log1p, ldexp, scalbn, ilogb
  • lgamma, tgamma, erf, erfc, erfinv, erfcinv

Classification (8 new)

  • isnan, isinf, isfinite, isnormal, signbit, nextafter, fdim, nan

Warp Operations (8 new)

  • warp_match_any, warp_match_all
  • warp_reduce_add/min/max/and/or/xor

Bit Manipulation (8 new)

  • popc, clz, ctz, ffs, brev, byte_perm
  • funnel_shift_left, funnel_shift_right

Memory Operations (3 new)

  • ldg, prefetch_l1, prefetch_l2

Special Functions (13 new)

  • rcp, fdividef, saturate
  • j0, j1, jn, y0, y1, yn (Bessel functions)
  • normcdf, normcdfinv, cyl_bessel_i0, cyl_bessel_i1

Clock/Timing (3 new)

  • clock, clock64, nanosleep

3D Stencil Support

  • Added pos.up(buf) and pos.down(buf) for 3D volumetric kernels
  • Full 3D at() offset support: pos.at(buf, dx, dy, dz)

DSL Module

  • Comprehensive CPU fallback implementations for all intrinsics
  • 20+ new tests for DSL functions

Testing

  • 171 total tests (up from 143)
  • New test coverage for all intrinsic categories
  • Tests for intrinsic flags, categories, and CUDA output

Documentation

  • Updated README with complete intrinsic reference
  • Updated CLAUDE.md with new capabilities

This commit significantly expands the CUDA codegen transpiler's intrinsic
coverage from 45+ to 120+ GPU intrinsics across 13 categories:

## New Intrinsics Added:

### Synchronization (3 new)
- sync_threads_count, sync_threads_and, sync_threads_or

### Atomic Operations (5 new)
- atomic_and, atomic_or, atomic_xor, atomic_inc, atomic_dec

### Math Functions (8 new)
- trunc, fmod, remainder, copysign, cbrt, hypot, plus warp_size

### Trigonometric (7 new)
- asin, acos, atan, atan2, sincos, sinpi, cospi

### Hyperbolic (6 new)
- sinh, cosh, tanh, asinh, acosh, atanh

### Exponential/Logarithmic (14 new)
- exp2, exp10, expm1, log2, log10, log1p, ldexp, scalbn, ilogb
- lgamma, tgamma, erf, erfc, erfinv, erfcinv

### Classification (8 new)
- isnan, isinf, isfinite, isnormal, signbit, nextafter, fdim, nan

### Warp Operations (8 new)
- warp_match_any, warp_match_all
- warp_reduce_add/min/max/and/or/xor

### Bit Manipulation (8 new)
- popc, clz, ctz, ffs, brev, byte_perm
- funnel_shift_left, funnel_shift_right

### Memory Operations (3 new)
- ldg, prefetch_l1, prefetch_l2

### Special Functions (13 new)
- rcp, fdividef, saturate
- j0, j1, jn, y0, y1, yn (Bessel functions)
- normcdf, normcdfinv, cyl_bessel_i0, cyl_bessel_i1

### Clock/Timing (3 new)
- clock, clock64, nanosleep

## 3D Stencil Support
- Added pos.up(buf) and pos.down(buf) for 3D volumetric kernels
- Full 3D at() offset support: pos.at(buf, dx, dy, dz)

## DSL Module
- Comprehensive CPU fallback implementations for all intrinsics
- 20+ new tests for DSL functions

## Testing
- 171 total tests (up from 143)
- New test coverage for all intrinsic categories
- Tests for intrinsic flags, categories, and CUDA output

## Documentation
- Updated README with complete intrinsic reference
- Updated CLAUDE.md with new capabilities
@mivertowski mivertowski merged commit a91a823 into main Dec 11, 2025
5 of 7 checks passed
@mivertowski mivertowski deleted the claude/expand-cuda-intrinsics-015QNYA1vEFCNZoLMdLafWQ8 branch December 11, 2025 09:14
mivertowski added a commit that referenced this pull request Dec 11, 2025
- Add CUDA Codegen Intrinsics Expansion section to CHANGELOG
- Update README with 120+ intrinsics count and 3D stencil patterns
- Update docs/13-cuda-codegen.md with complete intrinsics reference
- Fix clippy excessive_precision warnings in dsl.rs erf() function
- Format code with cargo fmt

Changes from merged PR #8:
- Expanded GPU intrinsics from ~45 to 120+ operations
- Added 11 atomic operations (and, or, xor, inc, dec, etc.)
- Added 3D stencil intrinsics (up, down, at with dz)
- Added warp match/reduce operations (Volta+/SM 8.0+)
- Added bit manipulation, memory, special, and timing ops
- Updated tests from 143 to 171

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants