[BUG] Enabling NRT slows down compilation unacceptably #203

gmarkall · 2025-04-14T10:29:02Z

Describe the bug

Noticed whilst reviewing #167 - I enabled NRT with the environment variable NUMBA_CUDA_ENABLE_NRT=1 and I got fed up of waiting for the test suite after 10 minutes (it usually completes in under 2 minutes on my machine).

Steps/Code to reproduce bug

Run:

NUMBA_CUDA_ENABLE_NRT=1 python -m numba.runtests numba.cuda.tests -v -m

and be prepared to wait a long time.

Expected behavior

For the test suite to complete in approximately the same amount of time as with NRT disabled.

Environment details (please complete the following information):

Numba-cuda main.

Additional context

There are two problems:

I think we are linking in NRT more than we need to be - I haven't yet looked deeper, but I see it getting added for ufunc tests, which I think it should not be required for.
The NRT is added as a path, so it has to be read from disk by the driver every time. We should switch to reading it in on first use into a CUSource object and linking that, so we only read it once and cut disk I/O out of the compilation path.

The text was updated successfully, but these errors were encountered:

gmarkall · 2025-04-14T10:35:38Z

I think we are linking in NRT more than we need to be

With NRT enabled we are now incref / decrefing every array that gets passed in. We need to add in the refcount pruning pass so that we're not degrading performance of all kernels that use arrays with unnecessary refcount operations.

gmarkall · 2025-04-14T10:48:58Z

We need to add in the refcount pruning pass

I just noticed that Numba's built-in refcount pruning pass is running, but a lot of functions fail to satisfy its preconditions:

https://github.com/numba/numba/blob/1ba9c54e395e611bbe8d2dbe7a726488f5a75fbd/numba/core/removerefctpass.py#L99-L114

Remove unnecessary NRT incref/decref in the given LLVM function. It uses highlevel type info to determine if the function does not need NRT. Such a function does not:

return array object(s);

take arguments that need refcounting except array;

call function(s) that return refcounted object.

In effect, the function will not capture or create references that extend
the lifetime of any refcounted objects beyond the lifetime of the function.

The rewrite is performed in place.
If rewrite has happened, this function returns True, otherwise, it returns False.

The problem is call function(s) that return [a] refcounted object - a lot of functions will do this, if they return an array that was passed in to the function (which was always legal in the CUDA target, even without NRT).

gmarkall added the bug Something isn't working label Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Enabling NRT slows down compilation unacceptably #203

[BUG] Enabling NRT slows down compilation unacceptably #203

gmarkall commented Apr 14, 2025

gmarkall commented Apr 14, 2025

gmarkall commented Apr 14, 2025

[BUG] Enabling NRT slows down compilation unacceptably #203

[BUG] Enabling NRT slows down compilation unacceptably #203

Comments

gmarkall commented Apr 14, 2025

gmarkall commented Apr 14, 2025

gmarkall commented Apr 14, 2025