-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDNN error at training iteration 1000 when calculating mAP% #8669
Comments
Downgraded CUDNN from 8.5.0 back to 8.4.1.50. Training works again. This is the command I used to downgrade:
|
The latest version of cudnn always has various bugs |
modify copy_weights_net(...) of network.c void copy_weights_net(network net_train, network* net_map)
} |
Please refer to issues #8667 |
Tried to use libcudnn8 8.6.0.163 today with cuda 11.8. Same problem still exists, aborts when it hits iteration # 1000. Used the command in the comment above and downgraded to lubcudnn8 8.4.1.50. Problem went away. This needs to be fixed... |
@AlexeyAB do you have thoughts on the fix for this? Do you need a pull request for @chgoatherd's proposed changes, or is this going down the wrong path? |
same problem as issue, but solved after applying @chgoatherd's suggestion + change |
I got the same error here at 1000 iterations, firstly I just got the Later, I downgraded the cudnn and it worked. In my case I'm using CUDA 11.2, with a container
|
I got the same error, Docker |
Same problem. |
@chgoatherd your solution and then a rebuild fixed my issues when using -map with cuda 11.x on a 3090 as well. Solid. You should create a pull request and get that merged in |
@chgoatherd Thanks a lot. I hit the same problem and your solution helped me to solve. I changed network.c and recompile darknet with vcpkg, CUDA v11.8 and CUDNN v8.6 on Windows 11. Now, everything works fine. |
I've made the changes that @chgoatherd listed above, switching out |
I'm using Ubuntu 20.04.6, CUDA 12.1.105-1, and CUDNN 8.9.1.23-1+cuda12.1. With the changes to network.c from @chgoatherd listed above from 2022-09-15, the error looks like this:
Without the changes to network.c, the error looks like this:
So the call stack and the error message from CUDA/CUDNN are not exactly the same. I think there are multiple issues, and the changes from above exposes the next problem. IMPORTANTPeople looking for a quick workaround for this issue, especially if training on hardware you don't own like Google colab where it is complicated to downgrade CUDA/CUDNN:
This is not ideal, but will get you past the problem until a solution is found. |
The release notes for CUDNN v8.5.0 -- where the problem started -- contains this text:
This sounds like a possible issue. I believe the cudnn handle is initialized in dark_cuda.c, and it looks like it is a global variable shared between all threads. See the two calls to |
Until a proper solution is found, this is still the solution I employ on my training rigs:
As stated 2 comments above, another possible workaround is to disable CUDNN in the Darknet Makefile. |
I made the change @chgoatherd suggested above, and it seems to work on Ubuntu 22.04.2 LTS with CUDA 11.7 + CUDNN 8.9.0 |
Unfortunately several (not all) of my neural network still cause the error to happen even with those changes. |
Wondering if the fixes made here might finally solve this issue: hank-ai/darknet@1ea2baf |
Preliminary tests show that this appears to have been fixed by that commit. See the Hank.ai Darknet repo. hank-ai/darknet@1ea2baf |
Thanks for sharing it @stephanecharette! |
how to solve this problem? |
I have the same error as you on RTX3060, RTX2070 Super works normally, have you fixed it yet? |
Yes, this is fixed in the new Darknet/YOLO repo: https://github.com/hank-ai/darknet#table-of-contents |
Upgraded my Ubuntu 20.04 training rig to install latest patches. This included a new version of CUDNN. Now using
CUDA 11.7.1-1
andCUDNN 8.5.0.96-1+cuda11.7
. Darknet is at latest version from 2022-08-16:All of my existing neural networks fail to train. Some are YOLOv4-tiny, others are YOLOv4-tiny-3L. Training rig is nvidia 3090 with 24 GB of vram, and networks fit well in vram. When darknet gets to iteration 1000 in training where it does the first mAP calculation, it produces this error:
The only important thing I can think which has changed today is that I installed the latest version of CUDNN8. This is the relevant portion of the upgrade log:
Curious to know if anyone else has a problem with CUDNN 8.5.0.96, or have an idea as to how to fix this problem.
The text was updated successfully, but these errors were encountered: