-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathperformance.slide
540 lines (337 loc) · 18.5 KB
/
performance.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
Performance in Go
Lviv, UA
18 May 2017
Ivan Kutuzov
http://golang.org.ua/
http://golang-ua.slack.com/
http://gophers.in.ua/
@arbrix
* License and Materials
This presentation is licensed under the [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons Attribution-ShareAlike 4.0 International]] licence.
This materials was prepared by Dave Cheney.
The materials for this presentation are available on GitHub:
.link https://github.com/davecheney/high-performance-go-workshop
You are encouraged to remix, transform, or build upon the material, providing you give appropriate credit and distribute your contributions under the same license.
If you have suggestions or corrections to this presentation, please raise [[https://github.com/davecheney/high-performance-go-workshop/isues][an issue on the GitHub project]].
* Agenda
Today we are going to cover three areas:
- Recap
- What does performance mean, what is possible?
- Benchmarking
- Performance measurement and profiling
* Recap
We need more practice
.link http://adventofcode.com
.link http://gocode.io/
.link https://www.codingame.com/
.link https://coderbyte.com/
.link http://users.csc.calpoly.edu/~jdalbey/103/Projects/ProgrammingPractice.html
.link http://www.w3resource.com/c-programming-exercises/
.link http://www.codeabbey.com/index/task_list
* Tasks
- Elementary
- Array, Strings
- Intermediate
- Advance
* What does performance mean?
Before we talk about writing high performance code, we need to talk about the hardware that will execute this code.
What are its properties and how have they changed over time?
As software authors we have benefited from Moore's Law, the doubling of the number of available transistors on a chip every 18 months, for 50 years.
No other industry has experienced a _six_order_of_magnitude_ improvement in their tools in the space of a lifetime.
* Network and disk I/O are still expensive
Network and disk I/O are still expensive, so expensive that the Go runtime will schedule something else while those operations are in progress.
.image images/media-20160803.jpg
* Benchmarking
* Benchmarking
Before you can begin to tune your application, you need to establish a reliable baseline to measure the impact of your change to know if you're making things better, or worse.
In other words, _"Don't_guess,_measure"_
This section focuses on how to construct useful benchmarks using the Go testing framework, and gives practical tips for avoiding the pitfalls.
Benchmarking is closely related to profiling, which we'll touch on during this section, then cover it in detail in the next.
* Benchmarking ground rules
Before you benchmark, you must have a stable environment to get repeatable results.
- The machine must be idle—don't profile on shared hardware, don't browse the web while waiting for a long benchmark to run.
- Watch out for power saving and thermal scaling.
- Avoid virtual machines and shared cloud hosting; they are too noisy for consistent measurements.
If you can afford it, buy dedicated performance test hardware. Rack it, disable all the power management and thermal scaling and never update the software on those machines.
For everyone else, have a before and after sample and run them multiple times to get consistent results.
* Using the testing package for benchmarking
The `testing` package has built in support for writing benchmarks.
// Fib computes the n'th number in the Fibonacci series.
func Fib(n int) int {
if n < 2 {
return n
}
return Fib(n-1) + Fib(n-2)
}
.caption fib.go
func BenchmarkFib(b *testing.B) {
for n := 0; n < b.N; n++ {
Fib(20) // run the Fib function b.N times
}
}
.caption fib_test.go
DEMO: `go`test`-bench=.`./examples/fib`
* How benchmarks work
Each benchmark is run `b.N` times until it takes longer than 1 second.
`b.N` starts at 1, if the benchmark completes in under 1 second `b.N` is increased and the benchmark run again.
`b.N` increases in the approximate sequence; 1, 2, 3, 5, 10, 20, 30, 50, 100, ...
% go test -bench=. ./examples/fib
BenchmarkFib-4 30000 46408 ns/op
PASS
ok _/Users/dfc/devel/high-performance-go-workshop/examples/fib 1.910s
_Beware:_ below the μs mark you will start to see the relativistic effects of instruction reordering and code alignment.
- Run benchmarks longer to get more accuracy; `go`test`-benchtime=10s`
- Run benchmarks multiple times; `go`test`-count=10`
_Tip:_ If this is required, codify it in a `Makefile` so everyone is comparing apples to apples.
# Setups costs usually amortized. Reset timer better than stop and start timer.
* Comparing benchmarks
For repeatable results, you should run benchmarks multiple times.
You can do this manually, or use the `-count=` flag.
Determining the performance delta between two sets of benchmarks can be tedious and error prone.
Tools like [[https://godoc.org/rsc.io/benchstat][rsc.io/benchstat]] are useful for comparing results.
% go test -c
% mv fib.test fib.golden
DEMO: Improve `Fib`
% go test -c
% ./fib.golden -test.bench=. -test.count=5 > old.txt
% ./fib.test -test.bench=. -test.count=5 > new.txt
% benchstat old.txt new.txt
DEMO: `benchstat`{old,new}.txt`
* Avoid benchmarking start up costs
Sometimes your benchmark has a once per run setup cost. `b.ResetTimer()` will can be used to ignore the time accrued in setup.
func BenchmarkExpensive(b *testing.B) {
boringAndExpensiveSetup()
b.ResetTimer() // HL
for n := 0; n < b.N; n++ {
// function under test
}
}
If you have some expensive setup logic _per_loop_iteration, use `b.StopTimer()` and `b.StartTimer()` to pause the benchmark timer.
func BenchmarkComplicated(b *testing.B) {
for n := 0; n < b.N; n++ {
b.StopTimer() // HL
complicatedSetup()
b.StartTimer() // HL
// function under test
}
}
* Benchmarking allocations
Allocation count and size is strongly correlated with benchmark time.
You can tell the `testing` framework to record the number of allocations made by code under test with
func BenchmarkRead(b *testing.B) {
b.ReportAllocs() // HL
for n := 0; n < b.N; n++ {
// function under test
}
}
DEMO: `go`test`-run=^$`-bench=.`bufio`
_Note:_ you can also use the `go`test`-benchmem` flag to do the same for _all_ benchmarks.
DEMO: `go`test`-run=^$`-bench=.`-benchmem`bufio`
* Watch out for compiler optimisations
This example comes from [[https://github.com/golang/go/issues/14813#issue-140603392][issue 14813]]. How fast will this function benchmark?
const m1 = 0x5555555555555555
const m2 = 0x3333333333333333
const m4 = 0x0f0f0f0f0f0f0f0f
const h01 = 0x0101010101010101
func popcnt(x uint64) uint64 {
x -= (x >> 1) & m1
x = (x & m2) + ((x >> 2) & m2)
x = (x + (x >> 4)) & m4
return (x * h01) >> 56
}
func BenchmarkPopcnt(b *testing.B) {
for i := 0; i < b.N; i++ {
popcnt(uint64(i))
}
}
* What happened?
% go test -bench=. ./examples/popcnt
BenchmarkPopcnt-8 2000000000 0.29 ns/op
PASS
ok github.com/davecheney/high-performance-go-workshop/examples/popcnt 0.625s
`popcnt` is a leaf function, so the compiler can inline it.
Because the function is inlined, the compiler can see it has no side effects, so the call is eliminated. This is what the compiler sees:
func BenchmarkPopcnt(b *testing.B) {
for i := 0; i < b.N; i++ {
// optimised away
}
}
The same optimisations that make real code fast, by removing unnecessary computation, are the same ones that remove benchmarks that have no observable side effects.
This is only going to get more common as the Go compiler improves.
* Benchmark mistakes
The `for` loop is crucial to the operation of the benchmark.
Here are two incorrect benchmarks, can you explain what is wrong with them?
func Fib(n int) int {
a, b := 0, 1
for i := 0; i < n; i++ {
a, b = b, a+b
}
return a
}
func BenchmarkFibWrong(b *testing.B) {
Fib(b.N)
}
func BenchmarkFibWrong2(b *testing.B) {
for n := 0; n < b.N; n++ {
Fib(n)
}
}
* Discussion
Are there any questions?
* Performance measurement and profiling
* Performance measurement and profiling
In the previous section we studied how to measure the performance of programs from the outside.
In this section we'll use profiling tools built into Go to investigate the operation of the program from the inside.
Don't trade performance for reliability
* pprof
The primary tool we're going to be talking about today is _pprof_.
[[https://github.com/google/pprof][pprof]] descends from the [[https://github.com/gperftools/gperftools][Google Perf Tools]] suite of tools.
`pprof` profiling is built into the Go runtime.
It consists of two parts:
- `runtime/pprof` package built into every Go program
- `go`tool`pprof` for investigating profiles.
pprof supports several types of profiling, we'll discuss three of these today:
- CPU profiling.
- Memory profiling.
- Block (or blocking) profiling.
* CPU profiling
CPU profiling is the most common type of profile, and the most obvious.
When CPU profiling is enabled the runtime will interrupt itself every 10ms and record the stack trace of the currently running goroutines.
Once the profile is complete we can analyse it to determine the hottest code paths.
The more times a function appears in the profile, the more time that code path is taking as a percentage of the total runtime.
* Memory profiling
Memory profiling records the stack trace when a _heap_ allocation is made.
Stack allocations are assumed to be free and are _not_tracked_ in the memory profile.
Memory profiling, like CPU profiling is sample based, by default memory profiling samples 1 in every 1000 allocations. This rate can be changed.
Because of memory profiling is sample based and because it tracks _allocations_ not _use_, using memory profiling to determine your application's overall memory usage is difficult.
Block profiling is quite unique.
A block profile is similar to a CPU profile, but it records the amount of time a goroutine spent waiting for a shared resource.
This can be useful for determining _concurrency_ bottlenecks in your application.
Block profiling can show you when a large number of goroutines _could_ make progress, but were _blocked_. Blocking includes:
- Sending or receiving on a unbuffered channel.
- Sending to a full channel, receiving from an empty one.
- Trying to `Lock` a `sync.Mutex` that is locked by another goroutine.
Block profiling is a very specialised tool, it should not be used until you believe you have eliminated all your CPU and memory usage bottlenecks.
* One profile at at time
Profiling is not free.
Profiling has a moderate, but measurable impact on program performance—especially if you increase the memory profile sample rate.
Most tools will not stop you from enabling multiple profiles at once.
If you enable multiple profile's at the same time, they will observe their own interactions and throw off your results.
*Do*not*enable*more*than*one*kind*of*profile*at*a*time.*
* Using pprof
[[https://www.youtube.com/watch?v=xxDZuPEgbBU&t=2533s][Profiling & Optimizing in Go / Brad Fitzpatrick]]
Now that I've talked about what pprof can measure, I will talk about how to use pprof to analyse a profile.
pprof should always be invoked with _two_ arguments.
go tool pprof /path/to/your/binary /path/to/your/profile
The `binary` argument *must* be the binary that produced this profile.
The `profile` argument *must* be the profile generated by this binary.
*Warning*: Because pprof also supports an online mode where it can fetch profiles from a running application over http, the pprof tool can be invoked without the name of your binary ([[https://github.com/golang/go/issues/10863][issue 10863]]):
go tool pprof /tmp/c.pprof
*Do*not*do*this*or*pprof*will*report*your*profile*is*empty.*
* Using pprof (cont.)
This is a sample CPU profile:
% go tool pprof $BINARY /tmp/c.p
Entering interactive mode (type "help" for commands)
(pprof) top
Showing top 15 nodes out of 63 (cum >= 4.85s)
flat flat% sum% cum cum%
21.89s 9.84% 9.84% 128.32s 57.71% net.(*netFD).Read
17.58s 7.91% 17.75% 40.28s 18.11% runtime.exitsyscall
15.79s 7.10% 24.85% 15.79s 7.10% runtime.newdefer
12.96s 5.83% 30.68% 151.41s 68.09% test_frame/connection.(*ServerConn).readBytes
11.27s 5.07% 35.75% 23.35s 10.50% runtime.reentersyscall
10.45s 4.70% 40.45% 82.77s 37.22% syscall.Syscall
9.38s 4.22% 44.67% 9.38s 4.22% runtime.deferproc_m
9.17s 4.12% 48.79% 12.73s 5.72% exitsyscallfast
8.03s 3.61% 52.40% 11.86s 5.33% runtime.casgstatus
7.66s 3.44% 55.85% 7.66s 3.44% runtime.cas
7.59s 3.41% 59.26% 7.59s 3.41% runtime.onM
6.42s 2.89% 62.15% 134.74s 60.60% net.(*conn).Read
6.31s 2.84% 64.98% 6.31s 2.84% runtime.writebarrierptr
6.26s 2.82% 67.80% 32.09s 14.43% runtime.entersyscall
Often this output is hard to understand.
* Using pprof (cont.)
A better way to understand your profile is to visualise it.
% go tool pprof application /tmp/c.p
Entering interactive mode (type "help" for commands)
(pprof) web
Opens a web page with a graphical display of the profile.
.link images/profile.svg
_Note_: visualisation requires graphviz.
I find this method to be superior to the text mode, I strongly recommend you try it.
pprof also supports these modes in a non interactive form with flags like `-svg`, `-pdf`, etc. See `go`tool`pprof`-help` for more details.
.link http://blog.golang.org/profiling-go-programs Further reading: Profiling Go programs
.link https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs Further reading: Debugging performance issues in Go programs
* Using pprof (cont.)
The output of a memory profile can be similarly visualised.
% go build -gcflags='-memprofile=/tmp/m.p'
% go tool pprof --alloc_objects -svg $(go tool -n compile) /tmp/m.p > alloc_objects.svg
% go tool pprof --inuse_objects -svg $(go tool -n compile) /tmp/m.p > inuse_objects.svg
Memory profiles come in two varieties
- Alloc objects reports the call site where each allocation was made
.link images/alloc_objects.svg
- Inuse objects reports the call site where an allocation was made _iff_ it was reachable at the end of the profile
.link images/inuse_objects.svg
* Using pprof (cont.)
Here is a visualisation of a block profile:
% go test -run=XXX -bench=ClientServer -blockprofile=/tmp/b.p net/http
% go tool pprof -svg http.test /tmp/b.p > block.svg
.link images/block.svg
* Profiling benchmarks
The `testing` package has built in support for generating CPU, memory, and block profiles.
- `-cpuprofile=$FILE` writes a CPU profile to `$FILE`.
- `-memprofile=$FILE`, writes a memory profile to `$FILE`, `-memprofilerate=N` adjusts the profile rate to `1/N`.
- `-blockprofile=$FILE`, writes a block profile to `$FILE`.
Using any of these flags also preserves the binary.
% go test -run=XXX -bench=. -cpuprofile=c.p bytes
% go tool pprof bytes.test c.p
_Note:_ use `-run=XXX` to disable tests, you only want to profile benchmarks. You can also use `-run=^$` to accomplish the same thing.
* Profiling applications
Profiling `testing` benchmarks is useful for _microbenchmarks_.
We use microbenchmarks inside the standard library to make sure individual packages do not regress, but what if you want to profile a complete application?
The Go runtime's profiling interface is in the `runtime/pprof` package.
`runtime/pprof` is a very low level tool, and for historic reasons the interfaces to the different kinds of profile are not uniform.
A few years ago I wrote a small package, [[https://github.com/pkg/profile][github.com/pkg/profile]], to make it easier to profile an application.
import "github.com/pkg/profile"
func main() {
defer profile.Start().Stop()
...
}
* Framepointers
Go 1.7 has been released and along with a new compiler for amd64, the compiler now enables frame pointers by default.
The frame pointer is a register that always points to the top of the current stack frame.
Framepointers enable tools like `gdb(1)`, and `perf(1)` to understand the Go call stack.
Further reading:
.link talks.godoc.org/github.com/davecheney/presentations/seven.slide Seven ways to profile a Go program (slides)
.link https://www.youtube.com/watch?v=2h_NFBFrciI Video (30 mins)
.link https://www.bigmarker.com/remote-meetup-go/Seven-ways-to-profile-a-Go-program Recording (60 mins)
* Always use the latest released version of Go
Old versions of Go will never get better. They will never get bug fixes or optimisations.
- Go 1.4 should not be used.
- Go 1.5 and 1.6 had a slower compiler, but it produces faster code, and has a faster GC.
- Go 1.7 delivered roughly a 30% improvement in compilation speed over 1.6, a 2x improvement in linking speed (better than any previous version of Go).
- Go 1.8 will deliver a smaller improvement in compilation speed (at this point), but a significant improvement in code quality for non Intel architectures.
Old version of Go receive no updates. Do not use them. Use the latest and you will get the best performance.
.link http://dave.cheney.net/2016/04/02/go-1-7-toolchain-improvements Go 1.7 toolchain improvements
.link http://dave.cheney.net/2016/09/18/go-1-8-performance-improvements-one-month-in Go 1.8 performance improvements
* Conclusion
* Always write the simplest code you can
Start with the simplest possible code.
Measure.
If performance is good, _stop_. You don't need to optimise everything, only the hottest parts of your code.
As your application grows, or your traffic pattern evolves, the performance hot spots will change.
Don't leave complex code that is not performance critical, rewrite it with simpler operations if the bottleneck moves elsewhere.
* Don't trade performance for reliability
"I can make things very fast if they don't have to be correct."
.caption Russ Cox
"Readable means reliable"
.caption Rob Pike
Performance and reliability are equally important.
I see little value in having a very fast server that panics, deadlocks or OOMs on a regular basis.
* In conclusion
Profile your code to identify the bottlenecks, _do_not_guess_.
Always write the simplest code you can, the compiler is optimised for _normal_ code.
Shorter code is faster code; Go is not C++, do not expect the compiler to unravel complicated abstractions.
Shorter code is _smaller_ code; which is important for the CPU's cache.
Pay very close attention to allocations, avoid unnecessary allocation where possible.
Don't trade performance for reliability.