Skip to content
This repository was archived by the owner on Mar 11, 2020. It is now read-only.

Robust Benchmarking #2

Closed
quinnj opened this issue Aug 18, 2016 · 10 comments
Closed

Robust Benchmarking #2

quinnj opened this issue Aug 18, 2016 · 10 comments

Comments

@quinnj
Copy link

quinnj commented Aug 18, 2016

It'd be great to have a robust benchmarking suite/process where the pure Julia versions could be compared against a C library implementation (or two or three). I wonder if @jrevels could help us get something setup.

@simonbyrne
Copy link
Member

simonbyrne commented Aug 19, 2016

Accuracy testing would also be useful (most functions here should also be defined in MPFR, so we can use that as a reference).

My previous experience has been that the OS X libm is one of the fastest (and also fairly accurate), so might be a good candidate as a reference (though obviously we can't run that on nanosoldier).

@simonbyrne
Copy link
Member

Reliable performance regression testing will also be necessary as we start trying to optimise things. Based on my previous experience, small changes can often have large effects (unforeseen) effects.

@simonbyrne
Copy link
Member

I've been playing around a bit with performance benchmarking.

It seems that the current approach of benchmarking the vectorised ops (e.g. @benchmark Libm.erf.($X)) incurs a bit of array overhead, and induces a lot of gc noise.

It is possible to benchmark the functions directly, e.g.

@benchmark Libm.erf(1.0)

but the problem here is then that we're (a) testing only one value, and (b) we hit the nanosecond resolution problem (if each function call takes only 9 nanoseconds, it is hard to detect small performance changes).

We can partially address (a) by doing something like

@benchmark Libm.erf(x) setup=(x = rand())

however this still only tests 1 value per sample, which might be misleading for things like branch prediction, and does not address (b).

The best I have come up with is using reduction operators with Julia's new generator synatx:

@benchmark foo(Libm.erf(x) for x in $X)

An easy choice here is sum: the cost of a floating point addition is fairly minor, however it can be problematic if we hit weird regions (such as subnormals, and I think NaNs can be slow on some processors).

We could be even more clever, and so something like

@benchmark mapreduce(identity, (x,y) -> nothing, Libm.erf(x) for x in X)

which reduces the overhead even further, however we need to be careful that LLVM in future doesn't just optimise the whole lot away as a no-op.

@ViralBShah
Copy link
Member

Shouldn't we compare against extended precision implementations for correctness as well?

@simonbyrne
Copy link
Member

Shouldn't we compare against extended precision implementations for correctness as well?

Yes, but that discussion is probably better in #16. We can keep this thread for performance benchmarking.

@simonbyrne
Copy link
Member

The lowest overhead function that I've found which isn't optimised away is:

@benchmark mapreduce(x -> reinterpret(Unsigned,x), |, Libm.erf(x) for x in $X)

@musm
Copy link
Collaborator

musm commented Sep 30, 2016

Benchmarking is giving me a lot of headache, because it seems so wildly inconsistent because it depends on many different factors (cpu etc..) and the whole benchmarking procedure.

This makes it really hard to tell, what's good enough or what needs more work.

one machine:
image
julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

another machine:
image
julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
System: NT (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

@vchuravy
Copy link

@jrevles has put in a lot of work to make benchmarking as robust as possible for base and it is very tricky to get this right. Jarrett how tricky would it be to teach nanosoldier about Libm.jl?

@jrevels
Copy link

jrevels commented Sep 30, 2016

how tricky would it be to teach nanosoldier about Libm.jl?

I'd have to make Nanosoldier capable of tracking multiple repos at a time if you'd want to use our current hardware. JuliaCI/Nanosoldier.jl#18 is probably a good issue for more discussion of this.

Benchmarking is giving me a lot of headache, because it seems so wildly inconsistent because it depends on many different factors (cpu etc..)

Benchmarking can definitely be a headache-inducing endeavor. Differences between platforms isn't unexpected. The benchmarks should be consistent between runs on a single platform, though - are you experiencing problems in that vein? This document might help if you're benchmarking on Linux.

the whole benchmarking procedure.

I'd be interested to hear more, if you're able to provide specifics here. It might be more useful to do your benchmarking interactively instead of running all of them in the benchmark definition script. You also might check out https://github.com/JuliaCI/PkgBenchmark.jl.

Since I'm already here: I hit a roadblock in my other research, so I'm taking a break from it today by reviving and finishing up JuliaCI/BenchmarkTools.jl#12, which will enable picosecond timing resolution.

@musm
Copy link
Collaborator

musm commented Oct 1, 2016

Benchmarking can definitely be a headache-inducing endeavor. Differences between platforms isn't unexpected. The benchmarks should be consistent between runs on a single platform, though - are you experiencing problems in that vein? This document might help if you're benchmarking on Linux.

Indeed. Especially, considering parts of this lib's perf depend strongly on whether hardware fma is available, which adds another variable in the mix. Typically, I don't see huge differences between multiple benchmarks on the same computers, but occasionally I do between different days.

I'll have to look at PkgBenchmark and think about using that instead of the script I have.

Since I'm already here: I hit a roadblock in my other research, so I'm taking a break from it today by
reviving and finishing up JuliaCI/BenchmarkTools.jl#12, which will enable picosecond timing resolution.

Very glad to hear! This'll make benchmarking so much easier at single test points. So I'm looking forward to it's merge.

@musm musm closed this as completed Jan 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants