-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
darknet crashes when calculating mAP% at iteration #1000 #2
Comments
This is a continuation of AlexeyAB/darknet#8669 |
Using: libcudnn8=8.5.0.96-1+cuda11.7 But also recreated using 8.9.3.28-1+cuda11.8. |
Me too: Ubuntu 22.04.3 libcudnn8=8.9.4.25-1+cuda12.2 `... calculating mAP (mean average precision)...
backtrace (13 entries): |
You probably know this but it usually works if you set subdivisions to 64. Just leaves a lot of wasted memory on the card and quadruples training time. Thanks for trying on this, this is probably the biggest pain in the ass for the last two years with darknet. I gave up and wrote bash scripts to stop, run map, post it online and start training again. Would be nice to get map in training working well. |
Just to note i have tried and experienced this on cuda 11.4 through 12.2 currently over last year and a half with all kinds of datasets. Smaller training resolution and higher subdivisions will allow it to work most times but like previous post it increases train time too much. |
This should do it. |
Ill give it a try thank you |
If this error occurs, the config If you set this value to 800, a similar error will occur on the 800th try. but if you set subdevision to non x2 scales such like 6, 10 , this error is not occur. |
Just as a side note, some people recommended setting the minibatch to 64 in order to avoid this problem, and it would work like that; however, take into consideration that with the minibatch set to 64, there is a potential danger for the model to overfit the training data more easily. This is particularly a concern if your dataset isn't very large or diverse. |
The problem was solved long ago with the new Darknet repo. Please use https://github.com/hank-ai/darknet as that repo is maintained and correctly solves this issue. |
This memory issue should be fixed in V2. In V3, the fix was modified for performance reasons. If this problem comes back in V3, please see the comment block in cudnn_convolutional_setup() within convolutional_layer.cpp. Specifically, the fix is where the variable Instead of using the major/minor version numbers, perhaps we should be better at calculating memory usage and deciding which algorithm to include? |
User "cmorzy" reported today that they're still seeing the error/crash when Darknet reaches iteration #1000. A copy of the dataset, .names, and .cfg is available.
The exact message they're seeing is:
The text was updated successfully, but these errors were encountered: