Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not find device after 565+ on GH200 NVL2 #774

Open
1 of 2 tasks
zilingzhang opened this issue Feb 2, 2025 · 5 comments
Open
1 of 2 tasks

Can not find device after 565+ on GH200 NVL2 #774

zilingzhang opened this issue Feb 2, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@zilingzhang
Copy link

NVIDIA Open GPU Kernel Modules Version

nvidia-driver-565-open(565.57.01), nvidia-driver-570-open (570.86.15)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 24.04.1 LTS

Kernel Release

Linux 6.11.0-1002-nvidia-64k #2-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 23 19:17:25 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GH200 144G HBM3e | GPU 1: NVIDIA GH200 144G HBM3e

Describe the bug

Installing ubuntu 24.04 on a new GH200 NVL2 system, using apt NVIDIA open kernel driver.
Both GPUs not found when using the nvidia-driver-565-open or nvidia-driver-570-open apt package.
But both GPU found when using nvidia-driver-560-open

To Reproduce

sudo apt install nvidia-driver-565-open
nvidia-smi
No devices were found

sudo apt install nvidia-driver-570-open
nvidia-smi
No devices were found

sudo apt install nvidia-driver-560-open
nvidia-smi
both GPUs found.

Both GPUs are at 96.00.A0.00.01 VBIOS
Which should be newer than 96.00.68.00.xx
https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-565-57-01/index.html#known-issues

nvidia-bug-report.log.gz

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

@zilingzhang zilingzhang added the bug Something isn't working label Feb 2, 2025
@aritger
Copy link
Collaborator

aritger commented Feb 2, 2025

Could you also generate an nvidia-bug-report.log.gz for 560? It may help to compare logs between working and failing configurations.

@zilingzhang
Copy link
Author

nvidia-bug-report.log.gz
This is from 560.35.05, but collection hang, let me know if it's good for diagnosis

@aritger
Copy link
Collaborator

aritger commented Feb 3, 2025

If I'm reading the log correctly, 560.35.05 has the same symptom as 570.86.15:

[    6.488488] kernel: NVRM: numa memblock size of zero found during device start
[    6.488492] kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00090100] Failed to allocate NvKmsKapiDevice
[    6.504217] kernel: [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00090100] Failed to register device

Once the GPU gets into this state, I think the problem will persist until a reboot. Could I trouble you to do the following for each of 560.35.05 and 570.86.15?

  • install driver
  • reboot
  • run nvidia-smi
  • run nvidia-bug-report.sh

@zilingzhang
Copy link
Author

I did reboot between install, or nvidia-smi will show driver mismatch.
Currently I'm running 560.35.05 (New bug report) and GPUs show up in nvidia-smi and serving llama model correctly.

I will try to perform the sequence again for both versions tonight at down time and get back to you. Thanks!

@zilingzhang
Copy link
Author

  • Installed 570.86.15
  • reboot
  • run nvidia-smi -> No devices were found
  • run nvidia-bug-report.sh

570.nvidia-bug-report.log.gz

  • Installed 565.57.01
  • reboot
  • run nvidia-smi -> No devices were found
  • run nvidia-bug-report.sh

565.nvidia-bug-report.log.gz

  • Installed 560
  • reboot
  • run nvidia-smi -> both GPUs were found
  • run nvidia-bug-report.sh

560.nvidia-bug-report.log.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants