darknet crashes when calculating mAP% at iteration #1000 #2

stephanecharette · 2023-07-17T17:26:00Z

User "cmorzy" reported today that they're still seeing the error/crash when Darknet reaches iteration #1000. A copy of the dataset, .names, and .cfg is available.

The exact message they're seeing is:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #546
* Error message:  cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (13 entries):
1/13: ./darknet(log_backtrace+0x38) [0x560b3fb79128]
2/13: ./darknet(darknet_fatal_error+0x19d) [0x560b3fb7936d]
3/13: ./darknet(cudnn_check_error_extended+0x83) [0x560b3fb7bf83]
4/13: ./darknet(forward_convolutional_layer_gpu+0x2c5) [0x560b3fc56985]
5/13: ./darknet(forward_network_gpu+0xe1) [0x560b3fc6af81]
6/13: ./darknet(network_predict_gpu+0x140) [0x560b3fc6d800]
7/13: ./darknet(validate_detector_map+0xa49) [0x560b3fc02f29]
8/13: ./darknet(train_detector+0x1ce0) [0x560b3fc05f70]
9/13: ./darknet(run_detector+0x9f6) [0x560b3fc09996]
10/13: ./darknet(main+0x4b3) [0x560b3fb308b3]
11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6ed5bd7d90]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6ed5bd7e40]
13/13: ./darknet(_start+0x25) [0x560b3fb32b25]
Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

stephanecharette · 2023-07-17T17:31:34Z

This is a continuation of AlexeyAB/darknet#8669

chrislytras · 2023-07-17T17:34:37Z

Using:

libcudnn8=8.5.0.96-1+cuda11.7
libcudnn8-dev=8.5.0.96-1+cuda11.7

But also recreated using 8.9.3.28-1+cuda11.8.

sinyb · 2023-09-08T10:44:58Z

Me too:

Ubuntu 22.04.3

libcudnn8=8.9.4.25-1+cuda12.2
libcudnn8-dev=8.9.4.25-1+cuda12.2

`...
-> next mAP calculation will be at iteration #1000
Tensor Cores are disabled until iteration #3000.
1000: loss=4.558, avg loss=4.317, rate=0.001000, 103.801 milliseconds, 32000 images, time remaining=30 hours

calculating mAP (mean average precision)...
Detection layer #30 is type 28 (yolo)
Detection layer #37 is type 28 (yolo)
using 4 threads to load 420 validation images for mAP% calculations
processing #0 (0%)
cuDNN status error in /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554

A fatal error has been detected. Darknet will now exit.
Errno 2: No such file or directory
Error location: /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554
Error message: cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
Version v2.0-4-g7d84f744 built on Sep 8 2023 09:13:21

backtrace (13 entries):
1/13: darknet(_Z13log_backtracev+0x38) [0x55b121550ce8]
2/13: darknet(darknet_fatal_error+0x1bd) [0x55b121550f4d]
3/13: darknet(cudnn_check_error_extended+0x83) [0x55b1214982b3]
4/13: darknet(forward_convolutional_layer_gpu+0x2d5) [0x55b12148bce5]
5/13: darknet(forward_network_gpu+0xe1) [0x55b12152b9d1]
6/13: darknet(network_predict_gpu+0x140) [0x55b12152e660]
7/13: darknet(validate_detector_map+0xa06) [0x55b1214afa56]
8/13: darknet(train_detector+0x1475) [0x55b1214b2185]
9/13: darknet(_Z12run_detectoriPPc+0xa85) [0x55b1214b60f5]
10/13: darknet(main+0x4a1) [0x55b1214454e1]
11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6dd2e29d90]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6dd2e29e40]
13/13: darknet(_start+0x25) [0x55b121447ef5]
Command exited with non-zero status 1`

kdill00 · 2023-09-18T10:45:32Z

You probably know this but it usually works if you set subdivisions to 64. Just leaves a lot of wasted memory on the card and quadruples training time. Thanks for trying on this, this is probably the biggest pain in the ass for the last two years with darknet. I gave up and wrote bash scripts to stop, run map, post it online and start training again. Would be nice to get map in training working well.

kdill00 · 2023-09-18T10:48:00Z

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

chrislytras · 2023-09-18T10:49:58Z

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.4.1.50-1+cuda11.6_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.4.1.50-1+cuda11.6_amd64.deb

This should do it.

kdill00 · 2023-09-22T02:25:39Z

Ill give it a try thank you

suminoshi · 2023-10-02T03:20:49Z

If this error occurs, the config
[net]
burn_in=1000

If you set this value to 800, a similar error will occur on the 800th try.
Also, if you set this value to 100, result is same.

but if you set subdevision to non x2 scales such like 6, 10 , this error is not occur.
I think it's a problem with the results of internal multiplication or division.
The burn-in result may be an error due to the number of training files or other factors.

Rares926 · 2024-09-12T11:29:10Z

Just as a side note, some people recommended setting the minibatch to 64 in order to avoid this problem, and it would work like that; however, take into consideration that with the minibatch set to 64, there is a potential danger for the model to overfit the training data more easily. This is particularly a concern if your dataset isn't very large or diverse.

stephanecharette · 2024-09-12T12:22:50Z

The problem was solved long ago with the new Darknet repo. Please use https://github.com/hank-ai/darknet as that repo is maintained and correctly solves this issue.

stephanecharette · 2024-09-21T21:50:29Z

This memory issue should be fixed in V2.

In V3, the fix was modified for performance reasons. If this problem comes back in V3, please see the comment block in cudnn_convolutional_setup() within convolutional_layer.cpp. Specifically, the fix is where the variable compu_capability_ver gets used.

Instead of using the major/minor version numbers, perhaps we should be better at calculating memory usage and deciding which algorithm to include?

stephanecharette self-assigned this Jul 17, 2023

stephanecharette closed this as completed Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

darknet crashes when calculating mAP% at iteration #1000 #2

darknet crashes when calculating mAP% at iteration #1000 #2

stephanecharette commented Jul 17, 2023

stephanecharette commented Jul 17, 2023

chrislytras commented Jul 17, 2023

sinyb commented Sep 8, 2023

kdill00 commented Sep 18, 2023 •

edited

Loading

kdill00 commented Sep 18, 2023

chrislytras commented Sep 18, 2023 •

edited

Loading

kdill00 commented Sep 22, 2023

suminoshi commented Oct 2, 2023 •

edited

Loading

Rares926 commented Sep 12, 2024 •

edited

Loading

stephanecharette commented Sep 12, 2024

stephanecharette commented Sep 21, 2024

darknet crashes when calculating mAP% at iteration #1000 #2

darknet crashes when calculating mAP% at iteration #1000 #2

Comments

stephanecharette commented Jul 17, 2023

stephanecharette commented Jul 17, 2023

chrislytras commented Jul 17, 2023

sinyb commented Sep 8, 2023

kdill00 commented Sep 18, 2023 • edited Loading

kdill00 commented Sep 18, 2023

chrislytras commented Sep 18, 2023 • edited Loading

kdill00 commented Sep 22, 2023

suminoshi commented Oct 2, 2023 • edited Loading

Rares926 commented Sep 12, 2024 • edited Loading

stephanecharette commented Sep 12, 2024

stephanecharette commented Sep 21, 2024

kdill00 commented Sep 18, 2023 •

edited

Loading

chrislytras commented Sep 18, 2023 •

edited

Loading

suminoshi commented Oct 2, 2023 •

edited

Loading

Rares926 commented Sep 12, 2024 •

edited

Loading