CUDNN error at training iteration 1000 when calculating mAP% #8669

stephanecharette · 2022-09-14T02:44:18Z

Upgraded my Ubuntu 20.04 training rig to install latest patches. This included a new version of CUDNN. Now using CUDA 11.7.1-1 and CUDNN 8.5.0.96-1+cuda11.7. Darknet is at latest version from 2022-08-16:

> git log -1
commit 96f08de6839eb1c125c7b86bffe1d3dde9570e5b (HEAD -> master, origin/master, origin/HEAD)
Author: Stefano Sinigardi <[email protected]>
Date:   Tue Aug 16 20:20:48 2022 +0200

All of my existing neural networks fail to train. Some are YOLOv4-tiny, others are YOLOv4-tiny-3L. Training rig is nvidia 3090 with 24 GB of vram, and networks fit well in vram. When darknet gets to iteration 1000 in training where it does the first mAP calculation, it produces this error:

 (next mAP calculation at 1000 iterations) 
 1000: 1.540665, 2.618338 avg loss, 0.002600 rate, 1.743389 seconds, 64000 images, 2.605252 hours left
4Darknet error location: ./src/dark_cuda.c, cudnn_check_error, line #204
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
 Detection layer: 44 - type = 28 

 cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 543 : build time: Sep 13 2022 - 17:44:16 

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
Command exited with non-zero status 1

The only important thing I can think which has changed today is that I installed the latest version of CUDNN8. This is the relevant portion of the upgrade log:

Preparing to unpack .../04-libcudnn8-dev_8.5.0.96-1+cuda11.7_amd64.deb ...
update-alternatives: removing manually selected alternative - switching libcudnn to auto mode
Unpacking libcudnn8-dev (8.5.0.96-1+cuda11.7) over (8.4.1.50-1+cuda11.6) ...
Preparing to unpack .../05-libcudnn8_8.5.0.96-1+cuda11.7_amd64.deb ...
Unpacking libcudnn8 (8.5.0.96-1+cuda11.7) over (8.4.1.50-1+cuda11.6) ...

Curious to know if anyone else has a problem with CUDNN 8.5.0.96, or have an idea as to how to fix this problem.

The text was updated successfully, but these errors were encountered:

stephanecharette · 2022-09-14T03:45:26Z

Downgraded CUDNN from 8.5.0 back to 8.4.1.50. Training works again. This is the command I used to downgrade:

sudo apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6

1027663760 · 2022-09-14T07:14:11Z

The latest version of cudnn always has various bugs

chgoatherd · 2022-09-16T04:26:05Z

modify copy_weights_net(...) of network.c

void copy_weights_net(network net_train, network* net_map)
{
int k;

for (k = 0; k < net_train.n; ++k)
{
    layer* l = &(net_train.layers[k]);
    layer tmp_layer;

    copy_cudnn_descriptors(net_train.layers[k], &tmp_layer);
    net_map->layers[k] = net_train.layers[k];
    copy_cudnn_descriptors(tmp_layer, &net_train.layers[k]);

    if (l->type == CRNN)
    {
        layer tmp_input_layer, tmp_self_layer, tmp_output_layer;

        copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
        copy_cudnn_descriptors(*net_train.layers[k].self_layer, &tmp_self_layer);
        copy_cudnn_descriptors(*net_train.layers[k].output_layer, &tmp_output_layer);
        net_map->layers[k].input_layer = net_train.layers[k].input_layer;
        net_map->layers[k].self_layer = net_train.layers[k].self_layer;
        net_map->layers[k].output_layer = net_train.layers[k].output_layer;
        //net_map->layers[k].output_gpu = net_map->layers[k].output_layer->output_gpu;  // already copied out of if()

        copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
        copy_cudnn_descriptors(tmp_self_layer, net_train.layers[k].self_layer);
        copy_cudnn_descriptors(tmp_output_layer, net_train.layers[k].output_layer);
    }
    else if (l->input_layer) // for AntiAliasing
    {
        layer tmp_input_layer;

        copy_cudnn_descriptors(*net_train.layers[k].input_layer, &tmp_input_layer);
        net_map->layers[k].input_layer = net_train.layers[k].input_layer;
        copy_cudnn_descriptors(tmp_input_layer, net_train.layers[k].input_layer);
    }

    net_map->layers[k].batch = 1;
    net_map->layers[k].steps = 1;
    net_map->layers[k].train = 0;
}

}

chgoatherd · 2022-09-16T04:33:18Z

Please refer to issues #8667

stephanecharette · 2022-10-23T05:33:50Z

Tried to use libcudnn8 8.6.0.163 today with cuda 11.8. Same problem still exists, aborts when it hits iteration # 1000. Used the command in the comment above and downgraded to lubcudnn8 8.4.1.50. Problem went away. This needs to be fixed...

#8669 (comment)

stephanecharette · 2022-10-23T05:41:09Z

@AlexeyAB do you have thoughts on the fix for this? Do you need a pull request for @chgoatherd's proposed changes, or is this going down the wrong path?

ryj0902 · 2022-10-25T04:26:40Z

Environment: container - nvidia/cuda:11.3.1-cudnn8-devel-centos7
Branch:

Author: St<C3><A9>phane Charette <[email protected]>
Date:   Wed Sep 21 04:03:47 2022 -0700

    Make sure best.weights is the most recent weights for a given mAP% (#8670)
    
    * issue #8308: memory leaks in map
    
    * update the window title with some training stats
    
    * make sure _best.weights is the most recent weights with that mAP%

GPU: NVIDIA Tesla V100
Command: ./darknet detector train data/obj.data cfg/yolov4-obj.cfg yolov4.conv.137 -dont_show -map

same problem as issue, but solved after applying @chgoatherd's suggestion + change subdivision=16 → 32
(changing subdivision value is not related to this issue but CUDA OOM error)

nailsonlinux · 2022-10-30T01:33:51Z

I got the same error here at 1000 iterations, firstly I just got the -map option out. The training is going well after 1000 iterations.

Later, I downgraded the cudnn and it worked. In my case I'm using CUDA 11.2, with a container

GPU: 1050 Ti 4GB
Docker image: nvidia/cuda:11.2.0-devel-ubuntu20.04
Cudnn packages: libcudnn8=8.1.1.33-1+cuda11.2 and libcudnn8-dev=8.1.1.33-1+cuda11.2

hnothing2016 · 2022-10-30T14:07:27Z

I got the same error， Docker

mari9myr · 2022-11-17T08:16:53Z

Same problem.
Ubuntu 20.04
GPU RTX A6000 46GB
Nvidia Driver: 515.65.01
Makefile: GPU 1, CUDNN 1, CUDNN_HALF 1, OPENCV 1, OPENMP 1, LIBSO 1
ARCH= -gencode arch=compute_86,code=[sm_86,compute_86]
Cuda Toolkit 11.[7-8] + cuDNN 8.[6-7-8] works only if you use subdivision=batch=64 or in case subdivision is smaller than batch you remove the "-map" parameter on the darknet training command.
Then I followed stephanecharette instructions and downgraded to the versions Cuda Toolkit 11.6 + cuDNN 8.4.1 and now everything works great with “-map” even decreasing the subdivision value down to 8.

jackneil · 2023-01-23T12:49:40Z

@chgoatherd your solution and then a rebuild fixed my issues when using -map with cuda 11.x on a 3090 as well. Solid. You should create a pull request and get that merged in

avkwok · 2023-02-21T08:27:20Z

@chgoatherd Thanks a lot. I hit the same problem and your solution helped me to solve. I changed network.c and recompile darknet with vcpkg, CUDA v11.8 and CUDNN v8.6 on Windows 11. Now, everything works fine.

stephanecharette · 2023-05-06T01:14:57Z

I've made the changes that @chgoatherd listed above, switching out net_map->... for net_train... in that function. But I'm still seeing the same error when it attempts to calculate the map at iteration 1000.

stephanecharette · 2023-05-07T00:43:46Z

I'm using Ubuntu 20.04.6, CUDA 12.1.105-1, and CUDNN 8.9.1.23-1+cuda12.1. With the changes to network.c from @chgoatherd listed above from 2022-09-15, the error looks like this:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4CUDA status Error: file: ./src/network_kernels.cu: func: network_predict_gpu() line: 735

 CUDA Error: an illegal memory access was encountered
Darknet error location: ./src/network_kernels.cu, network_predict_gpu(), line #735
CUDA Error: an illegal memory access was encountered: Success
backtrace (11 entries)
1/11: darknet(log_backtrace+0x38) [0x562adf9d1dd8]
2/11: darknet(error+0x3d) [0x562adf9d1ebd]
3/11: darknet(check_error+0xd0) [0x562adf9d4eb0]
4/11: darknet(check_error_extended+0x7c) [0x562adf9d4f9c]
5/11: darknet(network_predict_gpu+0x15f) [0x562adfad509f]
6/11: darknet(validate_detector_map+0x9ad) [0x562adfa64f6d]
7/11: darknet(train_detector+0x16a4) [0x562adfa67ca4]
8/11: darknet(run_detector+0x897) [0x562adfa6bc57]
9/11: darknet(main+0x34d) [0x562adf98663d]
10/11: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f433e662083]
11/11: darknet(_start+0x2e) [0x562adf9888be]

Without the changes to network.c, the error looks like this:

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4
 cuDNN status Error in: file: ./src/convolutional_kernels.cu function: forward_convolutional_layer_gpu() line: 543

 cuDNN Error: CUDNN_STATUS_BAD_PARAM
Darknet error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #543
cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success
backtrace (13 entries)
1/13: darknet(log_backtrace+0x38) [0x5588eb21bdd8]
2/13: darknet(error+0x3d) [0x5588eb21bebd]
3/13: darknet(+0x8bd40) [0x5588eb21ed40]
4/13: darknet(cudnn_check_error_extended+0x7c) [0x5588eb21f2fc]
5/13: darknet(forward_convolutional_layer_gpu+0x2c2) [0x5588eb307802]
6/13: darknet(forward_network_gpu+0x101) [0x5588eb31c281]
7/13: darknet(network_predict_gpu+0x131) [0x5588eb31f0a1]
8/13: darknet(validate_detector_map+0x9ad) [0x5588eb2aef9d]
9/13: darknet(train_detector+0x16a4) [0x5588eb2b1cd4]
10/13: darknet(run_detector+0x897) [0x5588eb2b5c87]
11/13: darknet(main+0x34d) [0x5588eb1d063d]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fb67bcba083]
13/13: darknet(_start+0x2e) [0x5588eb1d28be]

So the call stack and the error message from CUDA/CUDNN are not exactly the same. I think there are multiple issues, and the changes from above exposes the next problem.

IMPORTANT

People looking for a quick workaround for this issue, especially if training on hardware you don't own like Google colab where it is complicated to downgrade CUDA/CUDNN:

edit Darknet's Makefile
set CUDNN=0
set CUDNN_HALF=0
rebuild Darknet

This is not ideal, but will get you past the problem until a solution is found.

stephanecharette · 2023-05-07T03:57:25Z

The release notes for CUDNN v8.5.0 -- where the problem started -- contains this text:

A buffer was shared between threads and caused segmentation faults. There was previously no way to have a per-thread buffer to avoid these segmentation faults. The buffer has been moved to the cuDNN handle. Ensure you have a cuDNN handle for each thread because the buffer in the cuDNN handle is only for the use of one thread and cannot be shared between two threads.

This sounds like a possible issue. I believe the cudnn handle is initialized in dark_cuda.c, and it looks like it is a global variable shared between all threads. See the two calls to cudnnCreate(), as well as the variables cudnnInit, cudnnHandle, switchCudnnInit and switchCudnnhandle.

stephanecharette · 2023-05-08T18:16:52Z

Until a proper solution is found, this is still the solution I employ on my training rigs:

sudo apt-get install libcudnn8-dev=8.4.1.50-1+cuda11.6 libcudnn8=8.4.1.50-1+cuda11.6
sudo apt-mark hold libcudnn8-dev
sudo apt-mark hold libcuddn8

As stated 2 comments above, another possible workaround is to disable CUDNN in the Darknet Makefile.

avmusat · 2023-05-09T09:24:04Z

I made the change @chgoatherd suggested above, and it seems to work on Ubuntu 22.04.2 LTS with CUDA 11.7 + CUDNN 8.9.0

stephanecharette · 2023-05-09T09:28:25Z

Unfortunately several (not all) of my neural network still cause the error to happen even with those changes.

This reverts commit 1491d4f since it seems to be the cause of the illegal memory access error described in #1

stephanecharette · 2023-06-25T22:09:33Z

Wondering if the fixes made here might finally solve this issue: hank-ai/darknet@1ea2baf

stephanecharette · 2023-06-26T04:29:44Z

Preliminary tests show that this appears to have been fixed by that commit. See the Hank.ai Darknet repo. hank-ai/darknet@1ea2baf

nailsonlinux · 2023-06-26T05:19:27Z

Thanks for sharing it @stephanecharette!

asyilanrftr · 2023-10-17T04:55:08Z

how to solve this problem?

xxtkidxx · 2024-03-11T07:44:55Z

I have the same error as you on RTX3060, RTX2070 Super works normally, have you fixed it yet?

stephanecharette · 2024-03-11T08:40:07Z

Yes, this is fixed in the new Darknet/YOLO repo: https://github.com/hank-ai/darknet#table-of-contents

stephanecharette mentioned this issue Jan 21, 2023

When the RTX3060 graphics card uses CUDA11.1 to train, the instruction -map will not respond when the number of iterations is 999. How to solve it? #8727

Closed

lrf19991230 mentioned this issue Feb 11, 2023

May be mistake about copy_weights_net(...) of network.c #8667

Closed

stephanecharette mentioned this issue Apr 4, 2023

"cuDNN Error: CUDNN_STATUS_BAD_PARAM: Success" while training yolov4-tiny #8766

Open

stephanecharette mentioned this issue May 5, 2023

There is an error in the map part when learning. If you don't give -map , it doesn't sit and it goes well. #8779

Open

stephanecharette mentioned this issue May 6, 2023

better logging for filename/function/line on CUDA and CUDNN error; lo… #8791

Merged

stephanecharette added a commit to hank-ai/darknet that referenced this issue May 21, 2023

implemented fix described in AlexeyAB/darknet#8669 (comment)

1491d4f

stephanecharette added a commit to hank-ai/darknet that referenced this issue May 31, 2023

Revert "implemented fix described in AlexeyAB/darknet#8669 (comment)"

c3d0845

This reverts commit 1491d4f since it seems to be the cause of the illegal memory access error described in #1

stephanecharette mentioned this issue May 31, 2023

illegal memory access was encountered during first mAP calculations while training network hank-ai/darknet#1

Closed

stephanecharette mentioned this issue Jul 17, 2023

darknet crashes when calculating mAP% at iteration #1000 hank-ai/darknet#2

Closed

Compaile mentioned this issue Jan 8, 2024

Fix crash on map 1000 hank-ai/darknet#36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDNN error at training iteration 1000 when calculating mAP% #8669

CUDNN error at training iteration 1000 when calculating mAP% #8669

stephanecharette commented Sep 14, 2022

stephanecharette commented Sep 14, 2022

1027663760 commented Sep 14, 2022

chgoatherd commented Sep 16, 2022

chgoatherd commented Sep 16, 2022

stephanecharette commented Oct 23, 2022

stephanecharette commented Oct 23, 2022

ryj0902 commented Oct 25, 2022 •

edited

Loading

nailsonlinux commented Oct 30, 2022

hnothing2016 commented Oct 30, 2022

mari9myr commented Nov 17, 2022

jackneil commented Jan 23, 2023

avkwok commented Feb 21, 2023 •

edited

Loading

stephanecharette commented May 6, 2023

stephanecharette commented May 7, 2023 •

edited

Loading

stephanecharette commented May 7, 2023

stephanecharette commented May 8, 2023

avmusat commented May 9, 2023

stephanecharette commented May 9, 2023

stephanecharette commented Jun 25, 2023

stephanecharette commented Jun 26, 2023

nailsonlinux commented Jun 26, 2023

asyilanrftr commented Oct 17, 2023

xxtkidxx commented Mar 11, 2024

stephanecharette commented Mar 11, 2024

CUDNN error at training iteration 1000 when calculating mAP% #8669

CUDNN error at training iteration 1000 when calculating mAP% #8669

Comments

stephanecharette commented Sep 14, 2022

stephanecharette commented Sep 14, 2022

1027663760 commented Sep 14, 2022

chgoatherd commented Sep 16, 2022

chgoatherd commented Sep 16, 2022

stephanecharette commented Oct 23, 2022

stephanecharette commented Oct 23, 2022

ryj0902 commented Oct 25, 2022 • edited Loading

nailsonlinux commented Oct 30, 2022

hnothing2016 commented Oct 30, 2022

mari9myr commented Nov 17, 2022

jackneil commented Jan 23, 2023

avkwok commented Feb 21, 2023 • edited Loading

stephanecharette commented May 6, 2023

stephanecharette commented May 7, 2023 • edited Loading

IMPORTANT

stephanecharette commented May 7, 2023

stephanecharette commented May 8, 2023

avmusat commented May 9, 2023

stephanecharette commented May 9, 2023

stephanecharette commented Jun 25, 2023

stephanecharette commented Jun 26, 2023

nailsonlinux commented Jun 26, 2023

asyilanrftr commented Oct 17, 2023

xxtkidxx commented Mar 11, 2024

stephanecharette commented Mar 11, 2024

ryj0902 commented Oct 25, 2022 •

edited

Loading

avkwok commented Feb 21, 2023 •

edited

Loading

stephanecharette commented May 7, 2023 •

edited

Loading