Open
Description
Take the small snippet:
incq %r15
addq $0x4, %r13
cmpq $0x3f, %r15
Running this through MCA on skylake
/skylake-avx512
produces the following:
Iterations: 100
Instructions: 300
Total Cycles: 104
Total uOps: 300
Dispatch Width: 6
uOps Per Cycle: 2.88
IPC: 2.88
Block RThroughput: 0.8
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 incq %r15
1 1 0.25 addq $4, %r13
1 1 0.25 cmpq $63, %r15
Resources:
[0] - SKXDivider
[1] - SKXFPDivider
[2] - SKXPort0
[3] - SKXPort1
[4] - SKXPort2
[5] - SKXPort3
[6] - SKXPort4
[7] - SKXPort5
[8] - SKXPort6
[9] - SKXPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 0.75 0.75 - - - 0.75 0.75 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - 0.24 0.25 - - - 0.26 0.25 - incq %r15
- - 0.25 0.25 - - - 0.25 0.25 - addq $4, %r13
- - 0.26 0.25 - - - 0.24 0.25 - cmpq $63, %r15
However, running this within llvm-exegesis
(llvm-exegesis -snippets-file=/tmp/test.s --mode=latency
) produces the following:
---
mode: latency
key:
instructions:
- 'INC64r R15 R15'
- 'ADD64ri8 R13 R13 i_0x4'
- 'CMP64ri8 R15 i_0x3f'
config: ''
register_initial_values:
- 'R15=0x123456'
- 'R13=0x123456'
cpu_name: skylake-avx512
llvm_triple: x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 0.4234, per_snippet_value: 1.26995, validation_counters: {} }
error: ''
info: ''
assembled_snippet: 4157415549BF563412000000000049BD563412000000000049FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F415D415FC3
...
The predicted throughput from llvm-mca
is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.