-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathNotes.txt
62 lines (49 loc) · 1.64 KB
/
Notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Visual studio compiler flags:
/W4
/WX
/sdl ???
/O2 in Release mode, same as /Og /Oi /Ot /Oy /Ob2 /GF /Gy )
/GL in Release mode, does it make any difference?
/arch:AVX2 faster than /arch:AVX, mainly due to vectorized exp() in BoostTrainer
/fp:fast much faster than /fp:strict and /fp:precise
/std:c++17
/permissive-
/openmp
/Zc:__cplusplus pdqsort uses __cplusplus to detect C++11
========================================================================================================================
some profiling numbers from Visual Studio 2019, providing justification for using
splitmix, pdq_sort, VeryFastBernoulliDistribution and FastBernoulliDistribution
............................................
clock cycles per call
splitmix 2.8
pcg 3.5
xorshift 3.6
std::mt19937 7.5
std::minstd_rand 8.7
std::minstd_rand0 8.9
std::mt19937_64 8.9
..........................................
pdqsort_branchless is 2.2 times faster than std::sort
..........................................
the following code snippets were used to profile different implementations of the Bernoulli distribution
the rng is an instance of splitmix
while (n != 0) {
bool b = VeryFastBernoulliDistribution(m, n)(rng);
bool b = BD(m, n)(rng);
m -= b;
--n;
}
3.8 clock cycles per iteration
while (n != 0) {
bool b = FastBernoulliDistribution(m, n)(rng);
bool b = BD(m, n)(rng);
m -= b;
--n;
}
8.9 clock cycles per iteration
while (n != 0) {
bool b = std::bernoulli_distribution(double(m) / n)(rng);
m -= b;
--n;
}
51.7 clock cycles per iteration!