-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20
Comments
Hi @maxenceliu , How long it takes per iteration? |
2-3secs on GTX1080 |
after the newest commit, total_loss explode after 350 iterations due to the rpn_cls_loss exploded. |
result is not stable, this time, regular_loss become Nan ater 500 iters... |
I also encounter the total_loss explosion when try the newest commit. I implement a caffe version of mask-rcnn and also encounter the same problem. here is the newest commit loss: iter 584: image-id:0262213, time:0.359(sec), regular_loss: 0.177546, total-loss 739.0580(47.7280, 112.5653, 1.444757, 577.0145, 0.3054), instances: 1, batch:(1|33, 1|19, 1|1) iter 585: image-id:0534559, time:0.429(sec), regular_loss: 0.355617, total-loss nan(nan, 1685118073183372762735414607872.0000, nan, 713696030880020606040835379691520.0000, 4810318291543261184.0000), instances: 16, batch:(128|528, 14|18, 14|14) |
I have got similar issue at iter 493, this seem to be caused by rpn loss explosion. We might need to double check the box matching strategy. The error:
|
I have experienced the same issue, then updated to cuda_8.0.61_375.26 and cudnn to 5.1 and it went away. Could be sporadic too. |
I confirm that upgrading to cuda 8.0 fix the problem. Thank you very much @opikalo |
I also encounter this problem. After 29683 iters, it gives warnings:
Then, in iter 29684, the loss becomes unusual:
NaN happens... @opikalo @Nikasa1889 My cuda version is 8.0 and cudnn is 5.1 already. |
I've gotten the overflow error quite a few times, all without changing anything. It seems that the overflow errors occur randomly, possibly caused by poor convergence in the weights. Unfortunately the trick for now is to simply restart the training and pray it doesn't overflow again - that's working for me, so far. @Kongsea seems like you got pretty lucky reaching 29000+ iters before seeing overflow, my first overflow was < 1000 iters |
I would suggest changing the checkpoint value in train/train.py from 10000 to a smaller value e.g. 3000, so that you have more checkpoints in case of overflow error |
Can you "reproduce" the random occurance with a certain seed for the random initialization? |
I had similar error. My CUDA is 8.0 and Cudnn is 5.1. export PATH=/usr/local/cuda-8.0/bin:$PATH |
Check my comment here |
Has anyone encountered these two problems and fixed them?
Overflow in exp after 10 iters, and Segmentation fault after 40 iters.
The text was updated successfully, but these errors were encountered: