-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numerical issues in the mish implementation #5452
Comments
YOLOv4 support was added to the CUDA backend in OpenCV. While profiling darknet and OCV to find the performance discrepancy, I noticed that Darknet's bias and mish kernels together takes 3.3x more time than OpenCV's fused bias activation kernel. The convolution seems to be faster than them. Here are some stats for the first convolution layer in YOLOv4 on GTX 1050 for 608x608 image.
There are some neat approximations to mish which can be considerably faster: https://cs.stackexchange.com/questions/125002/fast-and-stable-x-tanhlog1pexpx-computation The |
Do you mean that
What is SM SOL? |
Yes, in the first few layers.
NVIDIA Nsight Compute (profiled all kernels that were executed with
The numbers I reported are for the first convolution layer in YOLOv4. The first 4 kernels that are executed by darknet are:
The total darknet inference time is around 116ms on my device.
OpenCV is executing YOLOv4 almost ~15% faster than Darknet on my device. On profiling, at least for my device, the bias and mish kernels seem to be much slower than OpenCV equivalents. Convolution is faster only in the initial few layers. The convolution eventually becomes slower at some point.
The accepted answer is nearly as accurate as using
I think it's possible to come up with accurate fast approximations for the gradient.
Streaming Multiprocessor Speed Of Light It's a measure of compute usage relative to the maximum theoretical performance. 100% indicates compute usage has reached maximum theoretical compute performance. Bias and activation steps are generally expected to be bandwidth bound kernels. So it should sort of have high memory utilization and low compute utilization. Darknet kernels seem to have high compute usage which is preventing maximal utilization of the memory bandwidth (because enough memory requests aren't made fast enough to keep the memory subsystem busy). |
Do you suggest to use one of these implementations? __device__ float mish_njuffa(float x)
{
float r;
float e = expf(x);
r = 1.0f / fmaf(fmaf(-0.5f, e, -1.0f), e, -1.0f);
r = fmaf(r, x, x);
return r;
}
__device__ float mish_yashas(float x)
{
auto e = __expf(x);
if (x <= -18.0f)
return x * e;
auto n = e * e + 2 * e;
if (x <= -5.0f)
return x * __fdividef(n, n + 2);
return x - 2 * __fdividef(x, n + 2);
}
Did you calculate it by using TFlops from GPU specification, number of OPs in your function and execution time? Is this your own definition? I can not find it even on Google. |
Nope. Mish is not being changed at all. Mish is still mish. The mathematical formulation is still the same. The difference is in the implementation. The fast implementations are very good approximations to the mish (errors in the range of few ULPs). Even using
Not really. The trained models will still work. YOLOv4 is giving really good detections in OpenCV with the approximation. The max relative error for a sample image I used is I had written some code while designing and testing the approximations. You can find it here. With this, you can actually see how close the approximations are to the original mish function by comparing individual places in the mantissa. Based on my memory, the approximations are practically identical for numbers below -20 and numbers more than -2. There are small differences for some values in the range [-20, -2] and these differences are often in the last few significant digits (i.e. it's like
I don't think a less accurate implementation (at least the ones we are talking about currently) will change the accuracy. They are still more or less mish. Any improvements or losses could just be noise. It's my intuition and I could be wrong.
Mine is faster but less accurate. I have put in OpenCV because it doesn't seem to alter the results and the approximation in the CUDA backend is still more accurate than OpenCV's CPU implementation. Mine gives at least 5 decimal places of precision if I remember correctly. Njuffa's implementation has errors in the 7th decimal place which is almost perfect but it's a tad slower. I don't have much experience in DL. I don't know which is best but given that people use half precision for training (which is very inaccurate compared to the fp32 approximations presented here), I think these single-precision approximations should do really well.
NVIDIA Nsight Compute does all the calculation and reports it.
No, it's NVIDIA's metric. This article explains it. |
Comparisions of double precision version vs darknet vs njuffa vs mine https://gist.github.com/YashasSamaga/3fdf001d32f04062e3f36495d5c962db Note that even direct double-precision implementation becomes less accurate after a certain point. The differences you see near -100 are because the direct double-precision version is less accurate than the approximations at those ranges. You can see darknet accuracy falling rapidly starting from All of them are same in the positive halfplane (probably bitwise identical). The differences are primarily in Ideally, I should have compared the accuracy with multi-precision libraries instead of using a double-precision reference. I wasn't able to set it up on CUDA. I had to use CUDA for testing because I was using the CUDA intrinsics which I cannot get to test on a CPU. |
As I understand it, all these approximations do not correspond strictly/analytically to the original Mish formula. I mean, that values of your-mish implementation are closer to the fp64-mish than Darkent-mish. Does your
Did you check, do we have the same issue for derivative? May be we have the same issue for training too? Do we just need to use darknet/src/activation_kernels.cu Line 43 in 2500278
Or should we change anything else? |
Yes, new mish is faster, gives the same accuracy (AP) of detection, and gives a smoother and more accurate curve. Can you check please, do we have any problem with derivative, and should we solve it?
I tested new Mish-implementation
|
If the darknet is trying to implement the gradient of the analytical mish activation, then I think using
I have never even written the derivative of mish on paper. I haven't checked if there are any issues with the derivative. I'll look into it.
The problems with the forward pass itself might affect training since the forward pass computation will affect the outputs of the layers that follow it (including the loss and eventually the gradients). Since the precision of the current mish implementation is really poor in the And then there could be numerical problems in the gradient implementation. The There is also a possibility that analytical mish is bad and the current darknet's implementation of mish is better than the analytical mish activation. If that's the case, then there is a new activation which is better than mish!
Yes, softplus should return
I had experimented with YOLOv4 two weeks ago. If I remember correctly, the dynamic range for activations happens to be in the range The gradient would become really small for large negative numbers, so it probably doesn't matter what happens there. The Looks like I am really slow at replying. This reply was written without looking at your latest reply.
I'm looking into it. |
darknet/src/activation_kernels.cu Lines 322 to 346 in 36c73c5
The introduction of
This has numerical problems when I have compared these three:
Code and results here: https://gist.github.com/YashasSamaga/1572a1b446c66594000fc1e59d82158d The current darknet gradient has problems. Using I don't know how better numerical accuracy will translate to the actual performance of the models. I hope it causes some improvement. |
@YashasSamaga Thanks! @WongKinYiu Hi, I fixed Mish-activation and Mish-gradient for Training and Detection:
It doesn't reduce accuracy (AP) during inference, but increases +3% FPS: #5452 (comment) It's closer to the original Mish-formula. There is a difference in derivative (gradient) only for |
By the way, when I compared the mish implementations, the njuffa's version I used was:
This is the correct version but is slightly slower. The other short version he gave is if giving zeros below |
Hi, @AlexeyAB and @YashasSamaga ! I came here by way of the CS StackExchange question. Thought I'd give you a heads up I added an answer there. Also saw #5922 which points to this code. That code is branch-free, and does use TL;DR: I would suggest one of these two branch-free implementations:
Or:
|
Follow up: #5922 also points to:
|
@armadillojim Looking at your error graph, I think I have messed up the thresholds. Thanks for pointing out.
@armadillojim
The compute usage is least in
And they might not be the same in terms of precision. I haven't checked but my intuition suggests that |
@AlexeyAB There is a faster and more accurate mish implementation:
* L2 norm of the error vector for 16 million+ activations uniformly sampled from Code: https://gist.github.com/YashasSamaga/8ad0cd3b30dbd0eb588c1f4c035db28c The accuracy has improved by an entire unit place. There was a bug in my threshold finding code. Thanks to the error graph in @armadillojim answer (wouldn't have cross-checked otherwise).
Raw accuracy output (with other implementations)For [-100, -80]
For [-80, -20]:
For [-20, 0]:
For [0, 100]:
The performance improvement is not significant as the activation is not compute-bound. But the reduction in compute usage makes more room for other compute-heavy operations to be fused with the activation (like the division step in bias addition). There will probably be improvements in timings in fused operations in OpenCV. |
@YashasSamaga Hi, Thanks! __device__ float mish_yashas2(float x)
{
float e = __expf(x);
float n = e * e + 2 * e;
if (x <= -0.6f)
return x * __fdividef(n, n + 2);
return x - 2 * __fdividef(x, n + 2);
} |
darknet/src/activation_kernels.cu
Lines 40 to 44 in 6cbb75d
darknet/src/activation_kernels.cu
Line 238 in 6cbb75d
This implementation has numerical issues due to the use of
log(1 + exp(x))
.log(1 + exp(x))
suffers from numerical problems for small values ofexp(x)
. There is a significant loss of precision in the range[-20, -10]
. It keeps losing precision asx
gets more and more negative and eventually results in zeros forx
in the range[-20, -16.6]
. The precision is regained below-20
assoftplus
switches toexp(x)
for that range.The standard library has
log1p
which takes care of this issue. CUDA also provideslog1p
for single-precision floats.The text was updated successfully, but these errors were encountered: