Add CUDA GPU Backend Support for NVIDIA GPUs

## Summary

Implement CUDA backend support for MetalFish to enable GPU-accelerated NNUE evaluation on NVIDIA GPUs, matching the functionality of the existing Metal backend for Apple Silicon.

## Background

MetalFish currently uses Apple's Metal framework for GPU acceleration on macOS. To support Linux and Windows users with NVIDIA GPUs, we need a CUDA implementation of the GPU backend.

## Requirements

### Core Implementation

- [ ] **CUDA Backend** (`src/gpu/cuda/cuda_backend.cu`, `cuda_backend.h`)
  - Implement `CUDABuffer`, `CUDAKernel`, `CUDACommandEncoder`, `CUDABackend` classes
  - Mirror the interface defined in `src/gpu/backend.h`
  - Support unified memory (managed memory) for newer GPUs
  - Multi-stream support for parallel kernel execution

- [ ] **CUDA NNUE Kernels** (`src/gpu/cuda/kernels/nnue_kernels.cu`, `nnue_kernels.h`)
  - Feature extraction (HalfKA, Threat features)
  - Feature transformer (full, incremental, optimized)
  - Network layers (FC0, FC1, FC2)
  - Fused forward pass kernel
  - PSQT accumulation
  - Match efficiency of Metal kernels using:
    - Warp-level optimizations (`__shfl_*` intrinsics)
    - Shared memory for feature indices
    - Sparse input optimization (skip zero values)
    - Vectorized memory access (`int4`)

### Build System

- [ ] **CMakeLists.txt Updates**
  - `find_package(CUDAToolkit)` integration
  - Conditional compilation with `-DUSE_CUDA=ON`
  - CUDA architecture detection/specification
  - Proper linking of CUDA libraries

### Testing

- [ ] **CUDA Tests** (`tests/test_cuda.cpp`)
  - Backend initialization
  - Buffer management
  - Kernel execution
  - NNUE integration tests
  - Performance benchmarks

### CI/CD

- [ ] **GitHub Actions Workflow** (`.github/workflows/ci.yml`)
  - Ubuntu CUDA build job
  - CUDA kernel syntax linting
  - Note: Free tier doesn't have GPUs, so tests requiring actual GPU execution should be skipped or mocked

## Reference Implementation

The Metal implementation can be used as a reference:
- `src/gpu/metal/metal_backend.mm` - Backend implementation
- `src/gpu/metal/kernels/nnue_full.metal` - Kernel implementations
- `src/gpu/gpu_constants.h` - Shared constants

## Technical Notes

1. **Driver API vs Runtime API**: The CUDA Driver API (`libcuda.so`) requires an actual NVIDIA GPU with drivers. For CI compatibility, consider using only the Runtime API (`cudart`) or making driver API usage optional.

2. **NVRTC**: Runtime kernel compilation requires both NVRTC and the Driver API. Pre-compiled kernels (compiled by nvcc at build time) are preferred for CI compatibility.

3. **Architecture Support**: Target common architectures: Volta (70), Turing (75), Ampere (80, 86), Ada Lovelace (89), Hopper (90).

## Environment Requirements

- CUDA Toolkit 12.0+ (or 13.x for latest features)
- NVIDIA GPU with Compute Capability 7.0+
- Linux or Windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA GPU Backend Support for NVIDIA GPUs #7

Summary

Background

Requirements

Core Implementation

Build System

Testing

CI/CD

Reference Implementation

Technical Notes

Environment Requirements

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add CUDA GPU Backend Support for NVIDIA GPUs #7

Description

Summary

Background

Requirements

Core Implementation

Build System

Testing

CI/CD

Reference Implementation

Technical Notes

Environment Requirements

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions