Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random boot fail to initialize with timeout or reset required #709

Closed
2 tasks
yamboo-efi opened this issue Sep 22, 2024 · 3 comments
Closed
2 tasks

Random boot fail to initialize with timeout or reset required #709

yamboo-efi opened this issue Sep 22, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@yamboo-efi
Copy link

NVIDIA Open GPU Kernel Modules Version

560.35.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Gentoo Linux

Kernel Release

6.6.52

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4070 Ti SUPER

Describe the bug

In a generally way nvidia driver failed to load properly. Sometimes with a error "RESET REQUIRED", other with timeout in communication with GPU.

I tried other kernels (6.8 and 6.11) with same results, and also propietary driver with/o GSP firmware loading, in all cases the result is the same.

It's strange and I cannot isolate the root of cause, so I will try to explain the case.

I changed motherboard and CPU, from an Gigabyte X570 and Ryzen 3700X to an Gigabyte X670 adn Ryzen 9700X. The GPU are the same, 4070Ti Super. Previously the system works right with same OS and 560.35.03 driver.

After hardware upgrade, I cannot boot in a seamless way, most of the times GPU initialization failed. Made a lot of trial and error (different kernel and nvidia driver config) with same result.

The driver produce two types of errors (I don't know why), one is a GPU communication timeout and other related to reset required.

Tried closed source driver (same issues) and disabling GSP firmware loading.

The most intrigguing thing is that I can boot properly after some tries, generally 2 or 3, combining soft reset with cool boot. When I get a a "RESET ERROR" I do a cool boot and when get a "TIMEOUT" I do a soft reset.

After get a proper init, things go right.

I think that exists a issue in initialization order or logic in the combination of AGESA.

The most stable configuration (also need some reboots to work) is using closed source with disabled GSP firmware.

If you need a proper details or logs, please tell me what you need.

PD: I post here because as I understand, the open-source driver is the default option on 560 driver.

To Reproduce

Almost, on any boot

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

At least a consistent boot, but a full fix would be great.

@yamboo-efi yamboo-efi added the bug Something isn't working label Sep 22, 2024
@ptr1337
Copy link

ptr1337 commented Sep 22, 2024

I think some AM5 Boards had problems on old BIOS Versions with 40xx gen. Did you try to upgrade to the latest available BIOS?

On my machine (9950X, Asrock x670E, 4070 Super) I do not see such issues.

@yamboo-efi
Copy link
Author

I'm using the latest BIOS, also tried older BIOS, but same result. Also changing BIOS settings (disabling TPM, TSME, and other things).

As a last resort I plugged a RTX3060, and it works fine. I moved to try other GPU because after installed Windows 10 (Using latest drivers) the system boot but get poor performance and system freeze and also get stable after some random reboots. With RTX3060 two OS works fine, as expected.

I suspect that GPU could be faulty, but I'm not sure because when they boot all run right, I can play games (loading the GPU at 60-70) for more than 4 hours without any issue.

Also readed that a faulty 12VHPWR connection could cause this issues, and double checked the connection and tried with other cable. Same result. The power supply is sufficcient (1000W) ATX 5.0 certified.

My last suspect is that could be some incompatibility between MB and GPU, but I'm not sure. I think is strange because MB and GPU are from same manufacturer and not a very new hardware.

Now, really, I'not sure if was a driver bug, faulty hardware or some sort of hardware/firmware incompatibility. I'm continuing investigating.

@yamboo-efi
Copy link
Author

I'm in conversations with manufacturer because I'm very sure that is a hardware incompatibility between GPU and Motherboard that cause an improper initialization of GPU internals.

The issue also occurs in Windows, with latest drivers, taking sometimes a long time to boot resulting in a very very poor performance after booting in Windows.

The GPU is fine, tested on other system and check it with different tools (including nvidia MODS/MATS tools) without any error. Separated tested the Motherboard and CPU with other GPU and everything works as expected on Linux.

So I will close that issue because isn't related to a driver bug.

@yamboo-efi yamboo-efi closed this as not planned Won't fix, can't repro, duplicate, stale Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants