-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to Julia version of lap3dkernel #7
Conversation
Dear Luiz, Here's the 2017 i7 laptop:
I am adding your results to the main table. I'm also trying an avx512 desktop but don't have julia on it yet.. Anyway, thanks for showing julia manual SIMD is getting to similar level to C++ and VCL manual SIMD! |
Dear Alex,
Disclaimer # 1: I am not a SIMD expert in Julia nor in any other language :-) . With that in mind, here are some possible explanations.
— I guess having the Julia and the C++ code run at roughly the same speed should not be a big surprise, provided both are appropriately vectorized. This was not the case with the prior version, which you can easily check runs at the same speed with or without the @simd macro. I did not bother to check, but I would suspect the assembly code generated for the innermost loop is very similar.
— The speed difference between single and double precision can (probably) be explained by the following observation. The function lap3dkernel is compute-bound, but adding two floats (single precision) is not significantly faster than adding two doubles (double precision). What you gain going to single precision is the ability to pack more “elements” on a register vector, so if your code is vectorized you can do twice as many “operations” with a single instruction using reduced precision. The fact that a significant difference is observed only with the new version of the code is consistent with this since the new version is the only one which is actually vectorized!
— As for the 11 digits… I don’t have any good guesses at the moment.
Regards,
Luiz
… On Sep 7, 2020, at 5:36 PM, Alex Barnett ***@***.***> wrote:
Dear Luiz,
Thanks for this. Looks like you are a SIMD expert in julia - very useful. Indeed this matches the VCL C++ lib w/ standard sqrt, in double. Single-prec is same as double except your manual SIMD, which twice as fast as the custom rsqrt tweak to VCL ... this is a surprise! I'm confused why the answer matches to 11 digits for a single-prec calc here, that presumable doesn't sum in exactly the same order. Any thoughts?
Here's the 2017 i7 laptop:
Result with type Float32:
targ-vec: 100000000 src-targ pairs, ans: 92799.578125
time 1.54 s 0.065 Gpair/s
devec: 100000000 src-targ pairs, ans: 92799.578125
time 0.405 s 0.247 Gpair/s
devec par: 100000000 src-targ pairs, ans: 92799.578125
time 0.0975 s 1.03 Gpair/s
devec par new: 100000000 src-targ pairs, ans: 92799.578125
time 0.0112 s 8.9 Gpair/s
Result with type Float64:
targ-vec: 100000000 src-targ pairs, ans: 63886.595569
time 1.78 s 0.0563 Gpair/s
devec: 100000000 src-targ pairs, ans: 63886.595569
time 0.3 s 0.333 Gpair/s
devec par: 100000000 src-targ pairs, ans: 63886.595569
time 0.0783 s 1.28 Gpair/s
devec par new: 100000000 src-targ pairs, ans: 63886.595569
time 0.0372 s 2.69 Gpair/s
I am adding your results to the main table.
I'm also trying an avx512 desktop but don't have julia on it yet..
Anyway, thanks for showing julia manual SIMD is getting to similar level to C++ and VCL manual SIMD!
Best, Alex
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSFFFVQUKO7AJZDUMCOTELSET4XHANCNFSM4Q6HEVTA>.
|
lap3dkernel
function. It relies on the packageSIMD.jl
to perform "explicit" vectorization. On my machine, this brings the performance close to the one observed on the vectorized C++ code.