Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

pavel-shmakov · 2024-10-15T10:30:12Z

We've encountered an issue where cufinufft.nufft3d1 outputs wildly incorrect results for very specific inputs and only on certain GPUs. This can be reproduced by running the following code on an A100 GPU:

import torch
import cufinufft

points = torch.load("points.pt")
values = torch.load("values.pt")
spectrum = cufinufft.nufft3d1(
        *points,
        values,
        (64, 64, 64),
        isign=-1,
        eps=1e-06
)
print(torch.linalg.norm(spectrum).item())

Here's an archive with points.pt and values.pt: inputs.zip

The value is many orders of magnitude greater than it should be. It also grows quickly with decreasing eps.

Notes:

We reproduced this both for cufinufft 2.2.0 and 2.3.0.
Reproduced on A100, but not on A10G. We haven't tried other GPUs.
The "blow-up" happens for specific grid sizes: from 61 to 64, while for 60, 65 and beyond it goes back to normal. This is for float32 inputs; for float64, we saw a "blow-up" for grid size 32.
We compiled cufinufft from sources to investigate further, but surprisingly couldn't reproduce the bug. We've tried compiling from master and v2.3.X as well as various compilation options. If you could point us to the compilation options with which the release version of libcufinufft.so is built, that would be helpful, and we can investigate further!

The text was updated successfully, but these errors were encountered:

pavel-shmakov · 2024-10-15T12:59:25Z

Smaller reproducer with just one point:

batch_size = 32
v = torch.tensor([[1] for i in range(batch_size)], dtype=torch.complex64, device="cuda")
p = torch.tensor([[0], [0], [0]], dtype=torch.float32, device='cuda')
spectrum = cufinufft.nufft3d1(*p, v, (64, 64, 64), eps=1e-6)

The spectra should be 1 everywhere, which it is for batch_size < 16. For batch_size >= 16 it starts misbehaving.

DiamonDinoia · 2025-01-15T19:21:48Z

What happens if we use GM instead of SM? https://finufft.readthedocs.io/en/latest/c_gpu.html#options-for-gpu-code gpu_method should be supported in python too.

pavel-shmakov · 2025-01-16T11:31:00Z

With gpu_method=1 we are also getting an incorrect, but very different answer on A100:

batch_size = 32
n_modes = 64
points = torch.tensor([[0], [0], [0]], dtype=torch.float32, device='cuda')
values = torch.tensor([[1] for i in range(batch_size)], dtype=torch.complex64, device="cuda")
for gpu_method in [1, 2]:
    spectrum = cufinufft.nufft3d1(*points, values, (n_modes, n_modes, n_modes), eps=1e-6, gpu_method=gpu_method)
    print(f"{gpu_method=}: {spectrum[0, 0, 0, 0].item()}")

A100:

gpu_method=1: (-4.974409603164531e-05-0.00036744182580150664j)
gpu_method=2: (2097152.5-0.16087542474269867j)

On T4 all good:

gpu_method=1: (1.000000238418579+0j)
gpu_method=2: (1.000000238418579+0j)

DiamonDinoia · 2025-01-16T16:04:45Z

@janden could you provide the command to do a debug build with pip? I saw this type of errors when using debug symbols. In my tests if I compile with -G nvcc generates an incorrect binary that stacks overflows while spreading but it does not crash. It just generates an output that is wrong in some points.

@pavel-shmakov could you try a bigger eps? 1e-2 or 1e-3?

pavel-shmakov · 2025-01-17T15:09:31Z

@pavel-shmakov could you try a bigger eps? 1e-2 or 1e-3?

eps=0.1: (0.5681691765785217-1.674233078956604j)
eps=0.01: 0j
eps=0.001: (0.1657368689775467-0.00020104330906178802j)
eps=0.0001: (1.0010954141616821+0j)
eps=1e-05: (2097339.75+10.764257431030273j)
eps=1e-06: (2097152.5-0.16087542474269867j)

DiamonDinoia · 2025-01-21T20:18:54Z

@pavel-shmakov for the local compilation which version of CUDA are you using?
we create the binary using this script: https://github.com/flatironinstitute/finufft/blob/master/tools/cufinufft/distribution_helper.sh

If we move to email we could share binary wheels with different flags to narrow down the issue

pavel-shmakov · 2025-01-22T17:19:52Z

I'm using CUDA 12.3.

If we move to email we could share binary wheels with different flags to narrow down the issue

Great, please feel free to reach out on [email protected]

DiamonDinoia · 2025-01-22T20:27:35Z

I am able to reproduce the issue locally on my machine A6000 GPU:

gpu_method=1: (-0.0020656900014728308-0.0015528309158980846j)
gpu_method=2: (2097152.5-0.16087542474269867j)

I think the issue might be the nvcc version. If I build it locally with: pip install cufinufft --no-binary cufinufft the error goes away.

gpu_method=1: (1.000000238418579+0j)
gpu_method=2: (1.000000238418579+0j)

DiamonDinoia · 2025-01-22T20:29:41Z

@janden we should investigate the compile flags we pass to pip or can we test this with CUDA 11.2? Maybe it is a specific 11.2 problem.

In that case upgrading to 11.3 or newer might be the solution.

We could also ship source only for cufinufft? if one installs nvidia runtime nvcc is also present. In principle they can compile it locally.

janden · 2025-01-27T19:15:00Z

I've compiled the master branch for CUDA 11.3 and 11.4 here:

https://users.flatironinstitute.org/~janden/cufinufft-2.4.0dev0-cuda11.3/
https://users.flatironinstitute.org/~janden/cufinufft-2.4.0dev0-cuda11.4/

Let me know how these work out.

FWIW, I can't reproduce the bug above on my local machine with the published 2.3.1 binary wheels.

pavel-shmakov · 2025-01-28T18:09:15Z

https://users.flatironinstitute.org/~janden/cufinufft-2.4.0dev0-cuda11.3/

With this one the bug still reproduces.

https://users.flatironinstitute.org/~janden/cufinufft-2.4.0dev0-cuda11.4/

With this one it doesn't!

DiamonDinoia · 2025-01-28T18:20:54Z

I would call it either a CUDA bug or me forgetting some sort of synchronisation, might have needed before but since I develop on the latest this could have been relaxed.

@janden for the next release can we upgrade to 11.4? It was released in 2022.

janden · 2025-01-28T19:31:57Z

@pavel-shmakov Can confirm the same behavior on an FI machine (i.e., reproduce the bug for CUDA 11.3 and not for CUDA 11.4).

@DiamonDinoia That's fine with me. We're talking 2.4.0 here, right?

DiamonDinoia · 2025-01-28T20:14:03Z

Yes

pavel-shmakov · 2025-01-29T16:00:53Z

@janden for the next release can we upgrade to 11.4?

That would be great! Any chance for an even newer CUDA version? CUDA release notes mention many improvements to cuFFT performance, for instance.

DiamonDinoia · 2025-01-29T16:06:04Z

I would recommend doing pip install --no-binary cufinufft cufinufft if you have a working cuda setup.

@janden shall we follow torch and have pip install cufinufft pull cuda12.4 and use --index-url for wheel hosted at the foundation for people that want older cuda support? https://pytorch.org/

DiamonDinoia · 2025-01-29T16:06:42Z

Alternatives are shared linking cufft, but in cufinufft the fft time is negligible.

mreineck · 2025-01-29T16:09:24Z

My colleagues working with FFTs on GPU mentioned to me that VkFFT (https://github.com/DTolm/VkFFT) is the way to go if high performance is required. But I agree that it should not have a big impact on overall NUFFT performance.

DiamonDinoia · 2025-01-29T16:17:00Z

I agree, this is something to consider when will target not NVIDIA GPUs. It does not take much to use it as we have the cmake facility in-place. But I'd imagine having problem with the complex data type and different API naming.

DiamonDinoia mentioned this issue Jan 29, 2025

Migrate from cuFFT to VkFFT #616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

pavel-shmakov commented Oct 15, 2024 •

edited

Loading

pavel-shmakov commented Oct 15, 2024

DiamonDinoia commented Jan 15, 2025

pavel-shmakov commented Jan 16, 2025

DiamonDinoia commented Jan 16, 2025

pavel-shmakov commented Jan 17, 2025

DiamonDinoia commented Jan 21, 2025 •

edited

Loading

pavel-shmakov commented Jan 22, 2025

DiamonDinoia commented Jan 22, 2025 •

edited

Loading

DiamonDinoia commented Jan 22, 2025 •

edited

Loading

janden commented Jan 27, 2025

pavel-shmakov commented Jan 28, 2025

DiamonDinoia commented Jan 28, 2025

janden commented Jan 28, 2025

DiamonDinoia commented Jan 28, 2025

pavel-shmakov commented Jan 29, 2025

DiamonDinoia commented Jan 29, 2025

DiamonDinoia commented Jan 29, 2025

mreineck commented Jan 29, 2025

DiamonDinoia commented Jan 29, 2025

Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

Invalid results of type 1 transform into (64, 64, 64) grid on A100 GPU #575

Comments

pavel-shmakov commented Oct 15, 2024 • edited Loading

pavel-shmakov commented Oct 15, 2024

DiamonDinoia commented Jan 15, 2025

pavel-shmakov commented Jan 16, 2025

DiamonDinoia commented Jan 16, 2025

pavel-shmakov commented Jan 17, 2025

DiamonDinoia commented Jan 21, 2025 • edited Loading

pavel-shmakov commented Jan 22, 2025

DiamonDinoia commented Jan 22, 2025 • edited Loading

DiamonDinoia commented Jan 22, 2025 • edited Loading

janden commented Jan 27, 2025

pavel-shmakov commented Jan 28, 2025

DiamonDinoia commented Jan 28, 2025

janden commented Jan 28, 2025

DiamonDinoia commented Jan 28, 2025

pavel-shmakov commented Jan 29, 2025

DiamonDinoia commented Jan 29, 2025

DiamonDinoia commented Jan 29, 2025

mreineck commented Jan 29, 2025

DiamonDinoia commented Jan 29, 2025

pavel-shmakov commented Oct 15, 2024 •

edited

Loading

DiamonDinoia commented Jan 21, 2025 •

edited

Loading

DiamonDinoia commented Jan 22, 2025 •

edited

Loading

DiamonDinoia commented Jan 22, 2025 •

edited

Loading