Skip to content

Add CUDA GPU Backend Support for NVIDIA GPUs #7

@NripeshN

Description

@NripeshN

Summary

Implement CUDA backend support for MetalFish to enable GPU-accelerated NNUE evaluation on NVIDIA GPUs, matching the functionality of the existing Metal backend for Apple Silicon.

Background

MetalFish currently uses Apple's Metal framework for GPU acceleration on macOS. To support Linux and Windows users with NVIDIA GPUs, we need a CUDA implementation of the GPU backend.

Requirements

Core Implementation

  • CUDA Backend (src/gpu/cuda/cuda_backend.cu, cuda_backend.h)

    • Implement CUDABuffer, CUDAKernel, CUDACommandEncoder, CUDABackend classes
    • Mirror the interface defined in src/gpu/backend.h
    • Support unified memory (managed memory) for newer GPUs
    • Multi-stream support for parallel kernel execution
  • CUDA NNUE Kernels (src/gpu/cuda/kernels/nnue_kernels.cu, nnue_kernels.h)

    • Feature extraction (HalfKA, Threat features)
    • Feature transformer (full, incremental, optimized)
    • Network layers (FC0, FC1, FC2)
    • Fused forward pass kernel
    • PSQT accumulation
    • Match efficiency of Metal kernels using:
      • Warp-level optimizations (__shfl_* intrinsics)
      • Shared memory for feature indices
      • Sparse input optimization (skip zero values)
      • Vectorized memory access (int4)

Build System

  • CMakeLists.txt Updates
    • find_package(CUDAToolkit) integration
    • Conditional compilation with -DUSE_CUDA=ON
    • CUDA architecture detection/specification
    • Proper linking of CUDA libraries

Testing

  • CUDA Tests (tests/test_cuda.cpp)
    • Backend initialization
    • Buffer management
    • Kernel execution
    • NNUE integration tests
    • Performance benchmarks

CI/CD

  • GitHub Actions Workflow (.github/workflows/ci.yml)
    • Ubuntu CUDA build job
    • CUDA kernel syntax linting
    • Note: Free tier doesn't have GPUs, so tests requiring actual GPU execution should be skipped or mocked

Reference Implementation

The Metal implementation can be used as a reference:

  • src/gpu/metal/metal_backend.mm - Backend implementation
  • src/gpu/metal/kernels/nnue_full.metal - Kernel implementations
  • src/gpu/gpu_constants.h - Shared constants

Technical Notes

  1. Driver API vs Runtime API: The CUDA Driver API (libcuda.so) requires an actual NVIDIA GPU with drivers. For CI compatibility, consider using only the Runtime API (cudart) or making driver API usage optional.

  2. NVRTC: Runtime kernel compilation requires both NVRTC and the Driver API. Pre-compiled kernels (compiled by nvcc at build time) are preferred for CI compatibility.

  3. Architecture Support: Target common architectures: Volta (70), Turing (75), Ampere (80, 86), Ada Lovelace (89), Hopper (90).

Environment Requirements

  • CUDA Toolkit 12.0+ (or 13.x for latest features)
  • NVIDIA GPU with Compute Capability 7.0+
  • Linux or Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions