Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

Open
pavel-shmakov opened this issue Oct 15, 2024 · 19 comments
Open

Comments

@pavel-shmakov
Copy link

pavel-shmakov commented Oct 15, 2024

We've encountered an issue where cufinufft.nufft3d1 outputs wildly incorrect results for very specific inputs and only on certain GPUs. This can be reproduced by running the following code on an A100 GPU:

import torch
import cufinufft

points = torch.load("points.pt")
values = torch.load("values.pt")
spectrum = cufinufft.nufft3d1(
        *points,
        values,
        (64, 64, 64),
        isign=-1,
        eps=1e-06
)
print(torch.linalg.norm(spectrum).item())

Here's an archive with points.pt and values.pt: inputs.zip

The value is many orders of magnitude greater than it should be. It also grows quickly with decreasing eps.

Notes:

  • We reproduced this both for cufinufft 2.2.0 and 2.3.0.
  • Reproduced on A100, but not on A10G. We haven't tried other GPUs.
  • The "blow-up" happens for specific grid sizes: from 61 to 64, while for 60, 65 and beyond it goes back to normal. This is for float32 inputs; for float64, we saw a "blow-up" for grid size 32.
  • We compiled cufinufft from sources to investigate further, but surprisingly couldn't reproduce the bug. We've tried compiling from master and v2.3.X as well as various compilation options. If you could point us to the compilation options with which the release version of libcufinufft.so is built, that would be helpful, and we can investigate further!
@pavel-shmakov
Copy link
Author

Smaller reproducer with just one point:

batch_size = 32
v = torch.tensor([[1] for i in range(batch_size)], dtype=torch.complex64, device="cuda")
p = torch.tensor([[0], [0], [0]], dtype=torch.float32, device='cuda')
spectrum = cufinufft.nufft3d1(*p, v, (64, 64, 64), eps=1e-6)

The spectra should be 1 everywhere, which it is for batch_size < 16. For batch_size >= 16 it starts misbehaving.

@DiamonDinoia
Copy link
Collaborator

What happens if we use GM instead of SM? https://finufft.readthedocs.io/en/latest/c_gpu.html#options-for-gpu-code gpu_method should be supported in python too.

@pavel-shmakov
Copy link
Author

With gpu_method=1 we are also getting an incorrect, but very different answer on A100:

batch_size = 32
n_modes = 64
points = torch.tensor([[0], [0], [0]], dtype=torch.float32, device='cuda')
values = torch.tensor([[1] for i in range(batch_size)], dtype=torch.complex64, device="cuda")
for gpu_method in [1, 2]:
    spectrum = cufinufft.nufft3d1(*points, values, (n_modes, n_modes, n_modes), eps=1e-6, gpu_method=gpu_method)
    print(f"{gpu_method=}: {spectrum[0, 0, 0, 0].item()}")

A100:

gpu_method=1: (-4.974409603164531e-05-0.00036744182580150664j)
gpu_method=2: (2097152.5-0.16087542474269867j)

On T4 all good:

gpu_method=1: (1.000000238418579+0j)
gpu_method=2: (1.000000238418579+0j)

@DiamonDinoia
Copy link
Collaborator

@janden could you provide the command to do a debug build with pip? I saw this type of errors when using debug symbols. In my tests if I compile with -G nvcc generates an incorrect binary that stacks overflows while spreading but it does not crash. It just generates an output that is wrong in some points.

@pavel-shmakov could you try a bigger eps? 1e-2 or 1e-3?

@pavel-shmakov
Copy link
Author

@pavel-shmakov could you try a bigger eps? 1e-2 or 1e-3?

eps=0.1: (0.5681691765785217-1.674233078956604j)
eps=0.01: 0j
eps=0.001: (0.1657368689775467-0.00020104330906178802j)
eps=0.0001: (1.0010954141616821+0j)
eps=1e-05: (2097339.75+10.764257431030273j)
eps=1e-06: (2097152.5-0.16087542474269867j)

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Jan 21, 2025

@pavel-shmakov for the local compilation which version of CUDA are you using?
we create the binary using this script: https://github.com/flatironinstitute/finufft/blob/master/tools/cufinufft/distribution_helper.sh

If we move to email we could share binary wheels with different flags to narrow down the issue

@pavel-shmakov
Copy link
Author

I'm using CUDA 12.3.

If we move to email we could share binary wheels with different flags to narrow down the issue

Great, please feel free to reach out on [email protected]

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Jan 22, 2025

I am able to reproduce the issue locally on my machine A6000 GPU:

gpu_method=1: (-0.0020656900014728308-0.0015528309158980846j)
gpu_method=2: (2097152.5-0.16087542474269867j)

I think the issue might be the nvcc version. If I build it locally with: pip install cufinufft --no-binary cufinufft the error goes away.

gpu_method=1: (1.000000238418579+0j)
gpu_method=2: (1.000000238418579+0j)

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Jan 22, 2025

@janden we should investigate the compile flags we pass to pip or can we test this with CUDA 11.2? Maybe it is a specific 11.2 problem.

In that case upgrading to 11.3 or newer might be the solution.

We could also ship source only for cufinufft? if one installs nvidia runtime nvcc is also present. In principle they can compile it locally.

@janden
Copy link
Collaborator

janden commented Jan 27, 2025

I've compiled the master branch for CUDA 11.3 and 11.4 here:

https://users.flatironinstitute.org/~janden/cufinufft-2.4.0dev0-cuda11.3/
https://users.flatironinstitute.org/~janden/cufinufft-2.4.0dev0-cuda11.4/

Let me know how these work out.

FWIW, I can't reproduce the bug above on my local machine with the published 2.3.1 binary wheels.

@pavel-shmakov
Copy link
Author

@DiamonDinoia
Copy link
Collaborator

I would call it either a CUDA bug or me forgetting some sort of synchronisation, might have needed before but since I develop on the latest this could have been relaxed.

@janden for the next release can we upgrade to 11.4? It was released in 2022.

@janden
Copy link
Collaborator

janden commented Jan 28, 2025

@pavel-shmakov Can confirm the same behavior on an FI machine (i.e., reproduce the bug for CUDA 11.3 and not for CUDA 11.4).

@DiamonDinoia That's fine with me. We're talking 2.4.0 here, right?

@DiamonDinoia
Copy link
Collaborator

Yes

@pavel-shmakov
Copy link
Author

@janden for the next release can we upgrade to 11.4?

That would be great! Any chance for an even newer CUDA version? CUDA release notes mention many improvements to cuFFT performance, for instance.

@DiamonDinoia
Copy link
Collaborator

I would recommend doing pip install --no-binary cufinufft cufinufft if you have a working cuda setup.

@janden shall we follow torch and have pip install cufinufft pull cuda12.4 and use --index-url for wheel hosted at the foundation for people that want older cuda support? https://pytorch.org/

@DiamonDinoia
Copy link
Collaborator

Alternatives are shared linking cufft, but in cufinufft the fft time is negligible.

@mreineck
Copy link
Collaborator

My colleagues working with FFTs on GPU mentioned to me that VkFFT (https://github.com/DTolm/VkFFT) is the way to go if high performance is required. But I agree that it should not have a big impact on overall NUFFT performance.

@DiamonDinoia
Copy link
Collaborator

I agree, this is something to consider when will target not NVIDIA GPUs. It does not take much to use it as we have the cmake facility in-place. But I'd imagine having problem with the complex data type and different API naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants