Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illegal memory access was encountered during first mAP calculations while training network #1

Closed
stephanecharette opened this issue May 22, 2023 · 4 comments

Comments

@stephanecharette
Copy link
Collaborator

stephanecharette commented May 22, 2023

Training a new network. Once it has reached iteration 1000, this is logged:

 (next mAP calculation at 1000 iterations) 
 1000: 0.483266, 0.592546 avg loss, 0.002610 rate, 2.077195 seconds, 64000 images, 2.909745 hours left
4* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Error location: ./src/network_kernels.cu, network_predict_gpu(), line #744
* Error message:  current CUDA error: status=700, an illegal memory access was encountered
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (11 entries):
1/11: /home/stephane/src/darknet/darknet(log_backtrace+0x38) [0x55ab1d538e18]
2/11: /home/stephane/src/darknet/darknet(darknet_fatal_error+0x178) [0x55ab1d539038]
3/11: /home/stephane/src/darknet/darknet(check_error+0x5c) [0x55ab1d53becc]
4/11: /home/stephane/src/darknet/darknet(check_error_extended+0x7c) [0x55ab1d53bf8c]
5/11: /home/stephane/src/darknet/darknet(network_predict_gpu+0x15f) [0x55ab1d63bb5f]
6/11: /home/stephane/src/darknet/darknet(validate_detector_map+0x9af) [0x55ab1d5cbdcf]
7/11: /home/stephane/src/darknet/darknet(train_detector+0x1698) [0x55ab1d5ceaf8]
8/11: /home/stephane/src/darknet/darknet(run_detector+0x897) [0x55ab1d5d2ad7]
9/11: /home/stephane/src/darknet/darknet(main+0x375) [0x55ab1d4ed685]
10/11: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f424cd09083]
11/11: /home/stephane/src/darknet/darknet(_start+0x2e) [0x55ab1d4ef8fe]

This error does not happen when I turn off CUDNN in Darknet's Makefile:

CUDNN=0
CUDNN_HALF=0
@TommiHonkanen
Copy link

Same thing happens to me. Using an RTX 3070 on Ubuntu 23.04 with Cuda 12.0 and cuDNN 8.9.1

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4CUDA status error: file: ./src/network_kernels.cu: func: network_predict_gpu() line: 744
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Errno 11: Resource temporarily unavailable
* Error location: ./src/network_kernels.cu, network_predict_gpu(), line #744
* Error message:  current CUDA error: status=700, an illegal memory access was encountered
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (10 entries):
1/10: ./darknet(log_backtrace+0x38) [0x564261048698]
2/10: ./darknet(darknet_fatal_error+0x16f) [0x5642610488af]
3/10: ./darknet(check_error_extended+0xc5) [0x56426104c1a5]
4/10: ./darknet(network_predict_gpu+0x166) [0x564261166b16]
5/10: ./darknet(train_detector+0x1e61) [0x5642610dec51]
6/10: ./darknet(run_detector+0xa05) [0x5642610e85b5]
7/10: ./darknet(main+0x37a) [0x56426100669a]
8/10: /lib/x86_64-linux-gnu/libc.so.6(+0x23a90) [0x7f3832023a90]
9/10: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89) [0x7f3832023b49]
10/10: ./darknet(_start+0x25) [0x564261008915]

stephanecharette added a commit that referenced this issue May 31, 2023
This reverts commit 1491d4f since it seems to be the cause of the illegal memory access error described in #1
@stephanecharette
Copy link
Collaborator Author

I'm 100% certain that the fix described in AlexeyAB/darknet#8669 (comment) is the cause of this "illegal memory access was encountered". The changes have been reverted.

@TommiHonkanen
Copy link

I pulled the new changes. It didn't fix the error for me, but it changed the error message a little. Now some additional errors are printed along with the illegal memory access.

 (next mAP calculation at 2000 iterations) 
 2000: 1.204594, 1.385511 avg loss, 0.000100 rate, 0.770990 seconds, 128000 images, 2.817951 hours left

 calculation mAP (mean average precision)...
 Detection layer: 30 - type = 28 
 Detection layer: 37 - type = 28 
4
CUDA status error: file: ./src/network_kernels.cu, network_predict_gpu(), line #744

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Errno 11: Resource temporarily unavailable
* Error location: ./src/network_kernels.cu, network_predict_gpu(), line #744
* Error message:  current CUDA error: status=700, an illegal memory access was encountered
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (12 entries):
1/12: ./darknet(log_backtrace+0x38) [0x5604a11c10b8]
2/12: ./darknet(darknet_fatal_error+0x15b) [0x5604a11c12bb]
3/12: ./darknet(check_error+0x5c) [0x5604a11c3ecc]
4/12: ./darknet(check_error_extended+0x83) [0x5604a11c3f93]
5/12: ./darknet(network_predict_gpu+0x166) [0x5604a12c15c6]
6/12: ./darknet(validate_detector_map+0x9af) [0x5604a1253b9f]
7/12: ./darknet(train_detector+0x14ca) [0x5604a12565da]
8/12: ./darknet(run_detector+0xa85) [0x5604a125a5d5]
9/12: ./darknet(main+0x406) [0x5604a117add6]
10/12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7eff0ca29d90]
11/12: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7eff0ca29e40]
12/12: ./darknet(_start+0x25) [0x5604a117d075]

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Error location: ./src/darknet.c, darknet_signal_handler(), line #439
* Error message:  signal handler invoked for signal #11 (Segmentation fault)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (14 entries):
1/14: ./darknet(log_backtrace+0x38) [0x5604a11c10b8]
2/14: ./darknet(darknet_fatal_error+0x15b) [0x5604a11c12bb]
3/14: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7eff0ca42520]
4/14: /lib/x86_64-linux-gnu/libopencv_core.so.4.5d(_ZN2cv13parallel_for_ERKNS_5RangeERKNS_16ParallelLoopBodyEd+0x277) [0x7eff1c9bcbd7]
5/14: /lib/x86_64-linux-gnu/libopencv_imgproc.so.4.5d(+0x2e7442) [0x7eff1cee7442]
6/14: /lib/x86_64-linux-gnu/libopencv_imgproc.so.4.5d(_ZN2cv3hal11cvtBGRtoBGREPKhmPhmiiiiib+0x262) [0x7eff1ccce552]
7/14: /lib/x86_64-linux-gnu/libopencv_imgproc.so.4.5d(_ZN2cv8cvtColorERKNS_11_InputArrayERKNS_12_OutputArrayEii+0xa9b) [0x7eff1ccb6c1b]
8/14: ./darknet(load_image_mat_cv+0x7b2) [0x5604a1185ee2]
9/14: ./darknet(load_image_mat+0x5f) [0x5604a11860ef]
10/14: ./darknet(load_image_cv+0x34) [0x5604a11861a4]
11/14: ./darknet(load_image+0x32) [0x5604a12080a2]
12/14: ./darknet(load_thread+0x52d) [0x5604a122482d]
13/14: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7eff0ca94b43]
14/14: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7eff0cb26a00]

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* A fatal error has been detected.  Darknet will now exit.
* Errno 11: Resource temporarily unavailable
* Error location: ./src/darknet.c, darknet_signal_handler(), line #439
* Error message:  signal handler invoked for signal #11 (Segmentation fault)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (16 entries):
1/16: ./darknet(log_backtrace+0x38) [0x5604a11c10b8]
2/16: ./darknet(darknet_fatal_error+0x15b) [0x5604a11c12bb]
3/16: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7eff0ca42520]
4/16: /lib/x86_64-linux-gnu/libc.so.6(+0x453fc) [0x7eff0ca453fc]
5/16: /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7eff0ca45610]
6/16: ./darknet(darknet_fatal_error+0x165) [0x5604a11c12c5]
7/16: ./darknet(check_error+0x5c) [0x5604a11c3ecc]
8/16: ./darknet(check_error_extended+0x83) [0x5604a11c3f93]
9/16: ./darknet(network_predict_gpu+0x166) [0x5604a12c15c6]
10/16: ./darknet(validate_detector_map+0x9af) [0x5604a1253b9f]
11/16: ./darknet(train_detector+0x14ca) [0x5604a12565da]
12/16: ./darknet(run_detector+0xa85) [0x5604a125a5d5]
13/16: ./darknet(main+0x406) [0x5604a117add6]
14/16: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7eff0ca29d90]
15/16: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7eff0ca29e40]
16/16: ./darknet(_start+0x25) [0x5604a117d075]

@stephanecharette
Copy link
Collaborator Author

I believe this is now fixed. The illegal memory access was caused by the image memory getting freed incorrectly/prematurely when the image is resized. 1ea2baf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants