Skip to content

Conversation

@LucasBoTang
Copy link
Collaborator

Summary

This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.

Main Changes

  • Replaced preconditioner.cu with preconditioner.c
  • Modified initialize_solver_state in solver.cu for GPU preconditioner integration

Implementation Details

  • The matrix is stored in CSR format, with an additional row ID array to enable efficient row-wise scaling (A[i,j] *= E[i]) without extra lookups.
  • Added an auxiliary array recording the mapping of each A element to its corresponding position in Aᵀ, enabling synchronized scaling of A and Aᵀ without atomics or additional CSR/CSC conversions.

Next Step

  • Benchmark GPU vs CPU preconditioner runtime before merging

Note

Reduce_bound_norm_sq_atomic currently relies on atomicAdd(double*) for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.

Would it be preferable to:

  • Switch to a portable single-block shared-memory reduction (no atomics), or
  • Redesign the reduction kernel
  • Keep the current implementation and require sm_60+?

@LucasBoTang
Copy link
Collaborator Author

LucasBoTang commented Nov 16, 2025

This update fixes two issues in the GPU preconditioner:

  1. Corrected objective/bound rescaling: The previous GPU code applied the wrong scaling to the bounds and objective. This caused very long PDHG iterations. Now, all scaling is applied correctly on the GPU.

  2. Improved reduce_bound_norm_sq_kernel: Replaced atomic accumulation with a shared-memory block reduction. This removes the atomic overhead and makes the result consistent and fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant