Skip to content

[BUG] Enabling NRT slows down compilation unacceptably #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gmarkall opened this issue Apr 14, 2025 · 2 comments
Open

[BUG] Enabling NRT slows down compilation unacceptably #203

gmarkall opened this issue Apr 14, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@gmarkall
Copy link
Collaborator

Describe the bug

Noticed whilst reviewing #167 - I enabled NRT with the environment variable NUMBA_CUDA_ENABLE_NRT=1 and I got fed up of waiting for the test suite after 10 minutes (it usually completes in under 2 minutes on my machine).

Steps/Code to reproduce bug

Run:

NUMBA_CUDA_ENABLE_NRT=1 python -m numba.runtests numba.cuda.tests -v -m

and be prepared to wait a long time.

Expected behavior

For the test suite to complete in approximately the same amount of time as with NRT disabled.

Environment details (please complete the following information):

Numba-cuda main.

Additional context

There are two problems:

  • I think we are linking in NRT more than we need to be - I haven't yet looked deeper, but I see it getting added for ufunc tests, which I think it should not be required for.
  • The NRT is added as a path, so it has to be read from disk by the driver every time. We should switch to reading it in on first use into a CUSource object and linking that, so we only read it once and cut disk I/O out of the compilation path.
@gmarkall gmarkall added the bug Something isn't working label Apr 14, 2025
@gmarkall
Copy link
Collaborator Author

I think we are linking in NRT more than we need to be

With NRT enabled we are now incref / decrefing every array that gets passed in. We need to add in the refcount pruning pass so that we're not degrading performance of all kernels that use arrays with unnecessary refcount operations.

@gmarkall
Copy link
Collaborator Author

We need to add in the refcount pruning pass

I just noticed that Numba's built-in refcount pruning pass is running, but a lot of functions fail to satisfy its preconditions:

https://github.com/numba/numba/blob/1ba9c54e395e611bbe8d2dbe7a726488f5a75fbd/numba/core/removerefctpass.py#L99-L114

Remove unnecessary NRT incref/decref in the given LLVM function. It uses highlevel type info to determine if the function does not need NRT. Such a function does not:

  • return array object(s);
  • take arguments that need refcounting except array;
  • call function(s) that return refcounted object.

In effect, the function will not capture or create references that extend
the lifetime of any refcounted objects beyond the lifetime of the function.

The rewrite is performed in place.
If rewrite has happened, this function returns True, otherwise, it returns False.

The problem is call function(s) that return [a] refcounted object - a lot of functions will do this, if they return an array that was passed in to the function (which was always legal in the CUDA target, even without NRT).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant