-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Implement CUDA backend support for MetalFish to enable GPU-accelerated NNUE evaluation on NVIDIA GPUs, matching the functionality of the existing Metal backend for Apple Silicon.
Background
MetalFish currently uses Apple's Metal framework for GPU acceleration on macOS. To support Linux and Windows users with NVIDIA GPUs, we need a CUDA implementation of the GPU backend.
Requirements
Core Implementation
-
CUDA Backend (
src/gpu/cuda/cuda_backend.cu,cuda_backend.h)- Implement
CUDABuffer,CUDAKernel,CUDACommandEncoder,CUDABackendclasses - Mirror the interface defined in
src/gpu/backend.h - Support unified memory (managed memory) for newer GPUs
- Multi-stream support for parallel kernel execution
- Implement
-
CUDA NNUE Kernels (
src/gpu/cuda/kernels/nnue_kernels.cu,nnue_kernels.h)- Feature extraction (HalfKA, Threat features)
- Feature transformer (full, incremental, optimized)
- Network layers (FC0, FC1, FC2)
- Fused forward pass kernel
- PSQT accumulation
- Match efficiency of Metal kernels using:
- Warp-level optimizations (
__shfl_*intrinsics) - Shared memory for feature indices
- Sparse input optimization (skip zero values)
- Vectorized memory access (
int4)
- Warp-level optimizations (
Build System
- CMakeLists.txt Updates
find_package(CUDAToolkit)integration- Conditional compilation with
-DUSE_CUDA=ON - CUDA architecture detection/specification
- Proper linking of CUDA libraries
Testing
- CUDA Tests (
tests/test_cuda.cpp)- Backend initialization
- Buffer management
- Kernel execution
- NNUE integration tests
- Performance benchmarks
CI/CD
- GitHub Actions Workflow (
.github/workflows/ci.yml)- Ubuntu CUDA build job
- CUDA kernel syntax linting
- Note: Free tier doesn't have GPUs, so tests requiring actual GPU execution should be skipped or mocked
Reference Implementation
The Metal implementation can be used as a reference:
src/gpu/metal/metal_backend.mm- Backend implementationsrc/gpu/metal/kernels/nnue_full.metal- Kernel implementationssrc/gpu/gpu_constants.h- Shared constants
Technical Notes
-
Driver API vs Runtime API: The CUDA Driver API (
libcuda.so) requires an actual NVIDIA GPU with drivers. For CI compatibility, consider using only the Runtime API (cudart) or making driver API usage optional. -
NVRTC: Runtime kernel compilation requires both NVRTC and the Driver API. Pre-compiled kernels (compiled by nvcc at build time) are preferred for CI compatibility.
-
Architecture Support: Target common architectures: Volta (70), Turing (75), Ampere (80, 86), Ada Lovelace (89), Hopper (90).
Environment Requirements
- CUDA Toolkit 12.0+ (or 13.x for latest features)
- NVIDIA GPU with Compute Capability 7.0+
- Linux or Windows