Improvements to Julia version of lap3dkernel #7

maltezfaria · 2020-09-07T13:40:32Z

Implement a vectorized Julia version of the lap3dkernel function. It relies on the package SIMD.jl to perform "explicit" vectorization. On my machine, this brings the performance close to the one observed on the vectorized C++ code.
Make julia version work with single precision, and modify the script to print these benchmarks.

ahbarnett · 2020-09-07T15:36:04Z

Dear Luiz,
Thanks for this. Looks like you are a SIMD expert in julia - very useful. Indeed this matches the VCL C++ lib w/ standard sqrt, in double. Single-prec is same as double except your manual SIMD, which twice as fast as the custom rsqrt tweak to VCL ... this is a surprise! I'm confused why the answer matches to 11 digits for a single-prec calc here, that presumable doesn't sum in exactly the same order. Any thoughts?

Here's the 2017 i7 laptop:

Result with type Float32: 
targ-vec: 100000000 src-targ pairs, ans: 92799.578125 
 	 time 1.54 s 0.065 Gpair/s
devec: 100000000 src-targ pairs, ans: 92799.578125 
 	 time 0.405 s 0.247 Gpair/s
devec par: 100000000 src-targ pairs, ans: 92799.578125 
 	 time 0.0975 s 1.03 Gpair/s
devec par new: 100000000 src-targ pairs, ans: 92799.578125 
 	 time 0.0112 s 8.9 Gpair/s
Result with type Float64: 
targ-vec: 100000000 src-targ pairs, ans: 63886.595569 
 	 time 1.78 s 0.0563 Gpair/s
devec: 100000000 src-targ pairs, ans: 63886.595569 
 	 time 0.3 s 0.333 Gpair/s
devec par: 100000000 src-targ pairs, ans: 63886.595569 
 	 time 0.0783 s 1.28 Gpair/s
devec par new: 100000000 src-targ pairs, ans: 63886.595569 
 	 time 0.0372 s 2.69 Gpair/s

I am adding your results to the main table.

I'm also trying an avx512 desktop but don't have julia on it yet..

Anyway, thanks for showing julia manual SIMD is getting to similar level to C++ and VCL manual SIMD!
Best, Alex

maltezfaria · 2020-09-07T19:26:05Z

Dear Alex, Disclaimer # 1: I am not a SIMD expert in Julia nor in any other language :-) . With that in mind, here are some possible explanations. — I guess having the Julia and the C++ code run at roughly the same speed should not be a big surprise, provided both are appropriately vectorized. This was not the case with the prior version, which you can easily check runs at the same speed with or without the @simd macro. I did not bother to check, but I would suspect the assembly code generated for the innermost loop is very similar. — The speed difference between single and double precision can (probably) be explained by the following observation. The function lap3dkernel is compute-bound, but adding two floats (single precision) is not significantly faster than adding two doubles (double precision). What you gain going to single precision is the ability to pack more “elements” on a register vector, so if your code is vectorized you can do twice as many “operations” with a single instruction using reduced precision. The fact that a significant difference is observed only with the new version of the code is consistent with this since the new version is the only one which is actually vectorized! — As for the 11 digits… I don’t have any good guesses at the moment. Regards, Luiz

…

On Sep 7, 2020, at 5:36 PM, Alex Barnett ***@***.***> wrote: Dear Luiz, Thanks for this. Looks like you are a SIMD expert in julia - very useful. Indeed this matches the VCL C++ lib w/ standard sqrt, in double. Single-prec is same as double except your manual SIMD, which twice as fast as the custom rsqrt tweak to VCL ... this is a surprise! I'm confused why the answer matches to 11 digits for a single-prec calc here, that presumable doesn't sum in exactly the same order. Any thoughts? Here's the 2017 i7 laptop: Result with type Float32: targ-vec: 100000000 src-targ pairs, ans: 92799.578125 time 1.54 s 0.065 Gpair/s devec: 100000000 src-targ pairs, ans: 92799.578125 time 0.405 s 0.247 Gpair/s devec par: 100000000 src-targ pairs, ans: 92799.578125 time 0.0975 s 1.03 Gpair/s devec par new: 100000000 src-targ pairs, ans: 92799.578125 time 0.0112 s 8.9 Gpair/s Result with type Float64: targ-vec: 100000000 src-targ pairs, ans: 63886.595569 time 1.78 s 0.0563 Gpair/s devec: 100000000 src-targ pairs, ans: 63886.595569 time 0.3 s 0.333 Gpair/s devec par: 100000000 src-targ pairs, ans: 63886.595569 time 0.0783 s 1.28 Gpair/s devec par new: 100000000 src-targ pairs, ans: 63886.595569 time 0.0372 s 2.69 Gpair/s I am adding your results to the main table. I'm also trying an avx512 desktop but don't have julia on it yet.. Anyway, thanks for showing julia manual SIMD is getting to similar level to C++ and VCL manual SIMD! Best, Alex — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSFFFVQUKO7AJZDUMCOTELSET4XHANCNFSM4Q6HEVTA>.

maltezfaria added 4 commits September 7, 2020 14:29

add vectorized+threaded version of laplace

c6247c7

make functions work with Float32

28a9e10

modify command line options to julia program

56c0698

cleanup + some comments

7d7f4c5

ahbarnett merged commit ef568fb into ahbarnett:master Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Julia version of lap3dkernel #7

Improvements to Julia version of lap3dkernel #7

maltezfaria commented Sep 7, 2020

ahbarnett commented Sep 7, 2020

maltezfaria commented Sep 7, 2020 via email •

edited

Loading

Improvements to Julia version of lap3dkernel #7

Improvements to Julia version of lap3dkernel #7

Conversation

maltezfaria commented Sep 7, 2020

ahbarnett commented Sep 7, 2020

maltezfaria commented Sep 7, 2020 via email • edited Loading

maltezfaria commented Sep 7, 2020 via email •

edited

Loading