Switch to use CUDA driver APIs in `Device` constructor #460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

leofang wants to merge 8 commits into NVIDIA:main from leofang:reduce_cudart

Member

leofang commented Feb 21, 2025 •

edited

Loading

~~Blocked by #459 & #439 (comment).~~

Before this PR:

In [1]: %timeit Device()
658 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With this PR:

In [1]: %timeit Device()
412 ns ± 2.22 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

(Bindings are built from the main branch.)

leofang added 5 commits

February 21, 2025 00:16


          cache cc to speed it up

2afcb20


          avoid using cudart APIs in Device constructor

87405ad


          avoid silly, redundant lock

95777c4


          Merge branch 'main' into cache_cc

4cfd505


          Merge branch 'cache_cc' into reduce_cudart

Contributor

copy-pr-bot bot commented Feb 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang self-assigned this

leofang added the blocked label

leofang added enhancement P1 cuda.core and removed blocked labels

leofang added this to the cuda.core beta 4 milestone


          Merge branch 'main' into reduce_cudart

7f11565

leofang changed the title ~~WIP: Switch to use CUDA driver APIs in Device constructor~~ Switch to use CUDA driver APIs in Device constructor

Member Author

leofang commented Apr 6, 2025

/ok to test

github-actions bot commented Apr 6, 2025

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-460/
https://nvidia.github.io/cuda-python/pr-preview/pr-460/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-460/cuda-bindings/
Preview will be ready when the GitHub Pages deployment is complete.

leofang requested review from rwgk and ksimpson-work

April 7, 2025 17:39

leofang marked this pull request as ready for review

April 7, 2025 17:39

ksimpson-work reviewed

View reviewed changes

cuda_core/cuda/core/experimental/_device.py Show resolved Hide resolved

rwgk reviewed

View reviewed changes

cuda_core/cuda/core/experimental/_device.py Outdated Show resolved Hide resolved

cuda_core/cuda/core/experimental/_device.py Show resolved Hide resolved

leofang marked this pull request as draft

April 7, 2025 22:19

leofang mentioned this pull request

[FEA]: Faster initialization time for cuda.core abstractions #658

Open

1 task

leofang added 2 commits

May 24, 2025 00:56


          Merge branch 'main' into reduce_cudart


          minor perf opt: try-except + skip assert

c9fac0b

leofang marked this pull request as ready for review

May 24, 2025 02:16

Member Author

leofang commented May 24, 2025

/ok to test c9fac0b

Member Author

leofang commented May 28, 2025

This is ready.

rwgk approved these changes

View reviewed changes

cuda_core/cuda/core/experimental/_device.py

-                          total = handle_return(runtime.cudaGetDeviceCount())
-                          assert_type(device_id, int)
-                          if not (0 <= device_id < total):
+                          total = handle_return(driver.cuDeviceGetCount())

Collaborator

rwgk May 28, 2025

Assuming that the happy path is common, this driver.cuDeviceGetCount() call seems redundant.

Also assuming it's not actually worth the cycles checking for isinstance, we could replace the else block here with:

        elif device_id < 0:
            raise ValueError(f"device_id must be >= 0, got {device_id!r}")

Then below (new line 998) we could do this:

        try:
            return devices[device_id]
        except IndexError:
            raise ValueError(f"device_id must be within [0, {len(devices)}), got {device_id!r}")

WDYT?

kkraus14 reviewed

View reviewed changes

cuda_core/cuda/core/experimental/_device.py

Comment on lines +963 to +964

		else:
		ctx = handle_return(driver.cuCtxGetCurrent())

Collaborator

kkraus14 May 29, 2025

Is there a specific error code or set of error codes we should be handling here? If the above driver.cuCtxGetDevice() call returns an error that we don't expect we should probably raise it as an exception instead of it propagating to the driver.cuCtxGetCurrent() call?

cuda_core/cuda/core/experimental/_device.py

+                      try:
+                          devices = _tls.devices
+                      except AttributeError:
+                          total = handle_return(driver.cuDeviceGetCount())

Collaborator

kkraus14 May 29, 2025

I think we can reuse total that was already calculated above?

cuda_core/cuda/core/experimental/_device.py

+                          devices = _tls.devices
+                      except AttributeError:
+                          total = handle_return(driver.cuDeviceGetCount())
+                          devices = _tls.devices = []
                           for dev_id in range(total):

Collaborator

kkraus14 May 29, 2025

If someone tries to create a Device with a specific ID, why do we need to initialize all of the devices at that point? Instead of calling driver.cuDeviceGetCount() to get the number of devices to initialize all of _tls.devices, could we use driver.cuDeviceGet and lazily populate _tls.devices as devices are created?

Collaborator

rwgk May 29, 2025

We had discussions about this code back in March. @leofang wrote here:

Each thread always has its own copy of _tls.devices.

I'm still unclear though TBH: Is that why we cannot lazily populate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core enhancement P1