Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN Error: CUDNN_STATUS_BAD_PARAM while training #2367

Open
divid3d opened this issue Dec 13, 2020 · 6 comments
Open

cuDNN Error: CUDNN_STATUS_BAD_PARAM while training #2367

divid3d opened this issue Dec 13, 2020 · 6 comments

Comments

@divid3d
Copy link

divid3d commented Dec 13, 2020

While training in moment that mAP calculation should start I have assertion error:
(next mAP calculation at 1900 iterations) 1900: 185.299484, 158.218445 avg loss, 0.001000 rate, 6.550167 seconds, 121600 images, 16.111683 hours left 4cuDNN Error: CUDNN_STATUS_BAD_PARAM: File exists darknet: ./src/utils.c:331: error: Assertion 0 failed.

@Lerseb
Copy link

Lerseb commented Dec 20, 2020

I have the same issue.
I noticed it happens when I have 2 classes.
Right now I am training without the -map flag and it is training fine.
It is just not validating against your 20% images in your test.txt
I don't know if it is linked to darknet or Cudnn.

@niemiaszek
Copy link

niemiaszek commented Dec 20, 2020

EDIT: Didn't notice it's pjreedie repo. Opening issue on AlexeyAB fork
got the same issue with 1080 ti, built with ZED 3.2, opencv 4.5 with CUDA. Training only for 1 class.
Driver Version: 455.32.00, CUDA Version: 11.1, cuDNN 8.0.4
Issue didn't happend on cuda 10.1, ubuntu 18 on same setup without zed
/darknet$ git log
commit 14b196d

Crash on map:
(next mAP calculation at 1000 iterations)
1000: 0.831169, 0.674900 avg loss, 0.001000 rate, 4.804868 seconds, 64000 images, 3.878937 hours left
Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 89.65 MB
CUDA allocate done!

calculation mAP (mean average precision)...
Detection layer: 139 - type = 28
Detection layer: 150 - type = 28
Detection layer: 161 - type = 28
4
cuDNN status Error in: file: /home/patryk/darknet/src/convolutional_kernels.cu : () : line: 533 : build time: Nov 7 2020 - 09:24:25

cuDNN Error: CUDNN_STATUS_BAD_PARAM
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Resource temporarily unavailable
./szkolenie.sh: line 2: 36330 Segmentation fault (core dumped) ./darknet detector train dataset/obj.data dataset/op14.cfg yolov4.conv.137 -map

@Remco-Terwal-Bose
Copy link

Was this ever understood and/or addressed?

command:
darknet detector train yolov3-ambulance-setup.data yolov3-ambulance-train.cfg ./darknet53.conv.74 -dont_show -map 2> train_log.txt

GPU: Nvidia 2028 SUPER, OPENCV 4.5.5 with CUDA, training for 1 class
Driver version 512.15
CUDA version: 11.6
cuDNN version 8.3

(next mAP calculation at 100 iterations)
100: 0.637962, 1.110619 avg loss, 0.001000 rate, 4.220000 seconds, 6400 images, 1.343723 hours left
Resizing to initial size: 416 x 416 try to allocate additional workspace_size = 154.24 MB
CUDA allocate done!

calculation mAP (mean average precision)...
Detection layer: 82 - type = 28
Detection layer: 94 - type = 28
Detection layer: 106 - type = 28

cuDNN status Error in: file: C:\darknet\src\convolutional_kernels.cu : forward_convolutional_layer_gpu() : line: 555 : build time: Apr 8 2022 - 13:14:27

cuDNN Error: CUDNN_STATUS_BAD_PARAM

@niemiaszek
Copy link

@Remco-Terwal-Bose see mentioned issue above or here

@Rizama03
Copy link

This is my own error message, just when it's about to calculate the mAP, everything crashes:
(next mAP calculation at 1000 iterations) H1000/10000: loss=14.6 hours left=4.9�
1000: 14.551660, 16.706905 avg loss, 0.002610 rate, 1.315968 seconds, 64000 images, 4.874600 hours left
4Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #541
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Succes

@PriyankaIITI
Copy link

I also encountered the error when use -map in command,
"cuDNN status Error in: file: ./src/convolutional_kernels.cu function: forward_convolutional_layer_gpu() line: 541 cuDNN Error: CUDNN_STATUS_BAD_PARAM Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #541"

There are following solutions I found :

  1. Downgrade CUDA link reply7153
  2. use subdivisions=64 link
  3. other comment
  4. export CUDA_VISIBLE_DEVICES=0 link
  5. GPU architecture link

But still, I was able to run with the following:
GPU=1
CUDNN=0
CUDNN_HALF=0
OPENCV=1

I don't know, Is it a correct way to do so?

@AlexeyAB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants