Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

darknet crashes when calculating mAP% at iteration #1000 #2

Closed
stephanecharette opened this issue Jul 17, 2023 · 11 comments
Closed

darknet crashes when calculating mAP% at iteration #1000 #2

stephanecharette opened this issue Jul 17, 2023 · 11 comments
Assignees

Comments

@stephanecharette
Copy link
Collaborator

User "cmorzy" reported today that they're still seeing the error/crash when Darknet reaches iteration #1000. A copy of the dataset, .names, and .cfg is available.

The exact message they're seeing is:

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Error location: ./src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #546
* Error message:  cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (13 entries):
1/13: ./darknet(log_backtrace+0x38) [0x560b3fb79128]
2/13: ./darknet(darknet_fatal_error+0x19d) [0x560b3fb7936d]
3/13: ./darknet(cudnn_check_error_extended+0x83) [0x560b3fb7bf83]
4/13: ./darknet(forward_convolutional_layer_gpu+0x2c5) [0x560b3fc56985]
5/13: ./darknet(forward_network_gpu+0xe1) [0x560b3fc6af81]
6/13: ./darknet(network_predict_gpu+0x140) [0x560b3fc6d800]
7/13: ./darknet(validate_detector_map+0xa49) [0x560b3fc02f29]
8/13: ./darknet(train_detector+0x1ce0) [0x560b3fc05f70]
9/13: ./darknet(run_detector+0x9f6) [0x560b3fc09996]
10/13: ./darknet(main+0x4b3) [0x560b3fb308b3]
11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6ed5bd7d90]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6ed5bd7e40]
13/13: ./darknet(_start+0x25) [0x560b3fb32b25]
Segmentation fault (core dumped)
@stephanecharette stephanecharette self-assigned this Jul 17, 2023
@stephanecharette
Copy link
Collaborator Author

This is a continuation of AlexeyAB/darknet#8669

@chrislytras
Copy link

Using:

libcudnn8=8.5.0.96-1+cuda11.7
libcudnn8-dev=8.5.0.96-1+cuda11.7

But also recreated using 8.9.3.28-1+cuda11.8.

@sinyb
Copy link

sinyb commented Sep 8, 2023

Me too:

Ubuntu 22.04.3

libcudnn8=8.9.4.25-1+cuda12.2
libcudnn8-dev=8.9.4.25-1+cuda12.2

`...
-> next mAP calculation will be at iteration #1000
Tensor Cores are disabled until iteration #3000.
1000: loss=4.558, avg loss=4.317, rate=0.001000, 103.801 milliseconds, 32000 images, time remaining=30 hours

calculating mAP (mean average precision)...
Detection layer #30 is type 28 (yolo)
Detection layer #37 is type 28 (yolo)
using 4 threads to load 420 validation images for mAP% calculations
processing #0 (0%)
cuDNN status error in /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554


  • A fatal error has been detected. Darknet will now exit.
  • Errno 2: No such file or directory
  • Error location: /home/user/src/darknet/src/convolutional_kernels.cu, forward_convolutional_layer_gpu(), line #554
  • Error message: cuDNN current error: status=3, CUDNN_STATUS_BAD_PARAM
  • Version v2.0-4-g7d84f744 built on Sep 8 2023 09:13:21

backtrace (13 entries):
1/13: darknet(_Z13log_backtracev+0x38) [0x55b121550ce8]
2/13: darknet(darknet_fatal_error+0x1bd) [0x55b121550f4d]
3/13: darknet(cudnn_check_error_extended+0x83) [0x55b1214982b3]
4/13: darknet(forward_convolutional_layer_gpu+0x2d5) [0x55b12148bce5]
5/13: darknet(forward_network_gpu+0xe1) [0x55b12152b9d1]
6/13: darknet(network_predict_gpu+0x140) [0x55b12152e660]
7/13: darknet(validate_detector_map+0xa06) [0x55b1214afa56]
8/13: darknet(train_detector+0x1475) [0x55b1214b2185]
9/13: darknet(_Z12run_detectoriPPc+0xa85) [0x55b1214b60f5]
10/13: darknet(main+0x4a1) [0x55b1214454e1]
11/13: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6dd2e29d90]
12/13: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6dd2e29e40]
13/13: darknet(_start+0x25) [0x55b121447ef5]
Command exited with non-zero status 1`

@kdill00
Copy link

kdill00 commented Sep 18, 2023

You probably know this but it usually works if you set subdivisions to 64. Just leaves a lot of wasted memory on the card and quadruples training time. Thanks for trying on this, this is probably the biggest pain in the ass for the last two years with darknet. I gave up and wrote bash scripts to stop, run map, post it online and start training again. Would be nice to get map in training working well.

@kdill00
Copy link

kdill00 commented Sep 18, 2023

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

@chrislytras
Copy link

chrislytras commented Sep 18, 2023

Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.4.1.50-1+cuda11.6_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.4.1.50-1+cuda11.6_amd64.deb

This should do it.

@kdill00
Copy link

kdill00 commented Sep 22, 2023

Ill give it a try thank you

@suminoshi
Copy link

suminoshi commented Oct 2, 2023

If this error occurs, the config
[net]
burn_in=1000

If you set this value to 800, a similar error will occur on the 800th try.
Also, if you set this value to 100, result is same.

but if you set subdevision to non x2 scales such like 6, 10 , this error is not occur.
I think it's a problem with the results of internal multiplication or division.
The burn-in result may be an error due to the number of training files or other factors.

@Rares926
Copy link

Rares926 commented Sep 12, 2024

Just as a side note, some people recommended setting the minibatch to 64 in order to avoid this problem, and it would work like that; however, take into consideration that with the minibatch set to 64, there is a potential danger for the model to overfit the training data more easily. This is particularly a concern if your dataset isn't very large or diverse.

@stephanecharette
Copy link
Collaborator Author

The problem was solved long ago with the new Darknet repo. Please use https://github.com/hank-ai/darknet as that repo is maintained and correctly solves this issue.

@stephanecharette
Copy link
Collaborator Author

This memory issue should be fixed in V2.

In V3, the fix was modified for performance reasons. If this problem comes back in V3, please see the comment block in cudnn_convolutional_setup() within convolutional_layer.cpp. Specifically, the fix is where the variable compu_capability_ver gets used.

Instead of using the major/minor version numbers, perhaps we should be better at calculating memory usage and deciding which algorithm to include?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants