Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to use a single lookup table #116

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

franz1981
Copy link
Contributor

@franz1981 franz1981 commented Jan 6, 2025

This is funny here - on Ryzen, the existing code produce this performance

Benchmark                                                                     (size)  (specialCharPercentage)   Mode  Cnt       Score    Error      Units
MyBenchmark.benchReplaceBackslashRawCompressedTable3                           65536                        0  thrpt   20   26783.951 ± 66.355      ops/s
MyBenchmark.benchReplaceBackslashRawCompressedTable3:CPI                       65536                        0  thrpt    2       0.297           clks/insn
MyBenchmark.benchReplaceBackslashRawCompressedTable3:IPC                       65536                        0  thrpt    2       3.370           insns/clk
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-load-misses     65536                        0  thrpt    2    2118.079                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-loads           65536                        0  thrpt    2  263567.211                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-icache-load-misses     65536                        0  thrpt    2       0.594                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-icache-loads           65536                        0  thrpt    2     268.828                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branch-misses             65536                        0  thrpt    2      34.027                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branches                  65536                        0  thrpt    2   82121.699                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:cycles                    65536                        0  thrpt    2  204572.765                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-load-misses          65536                        0  thrpt    2       0.073                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-loads                65536                        0  thrpt    2       4.141                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:iTLB-load-misses          65536                        0  thrpt    2       0.256                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:iTLB-loads                65536                        0  thrpt    2       0.192                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:instructions              65536                        0  thrpt    2  689362.358                #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:stalled-cycles-frontend   65536                        0  thrpt    2     766.591                #/op

which is decent and near to what can be obtained with the existing table approach when there are no special chars - but still is underperforming compared to the whoopy 33K with newish Intel - which seems able to perform 2 lookups in parallel in the same cycle, for each read char.
For comparison - the closed pr at #114 with Ryzen, was overperforming the existing approach with Intel - even if not by a great margin i.e. ~28K ops/sec

Moving to a single table lookup with Ryzen (4) the performance is the best we could achieve i.e.

Benchmark                                                                     (size)  (specialCharPercentage)   Mode  Cnt       Score     Error      Units
MyBenchmark.benchReplaceBackslashRawCompressedTable3                           65536                        0  thrpt   20   38490.297 ± 225.671      ops/s
MyBenchmark.benchReplaceBackslashRawCompressedTable3:CPI                       65536                        0  thrpt    2       0.186            clks/insn
MyBenchmark.benchReplaceBackslashRawCompressedTable3:IPC                       65536                        0  thrpt    2       5.369            insns/clk
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-load-misses     65536                        0  thrpt    2    2098.366                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-loads           65536                        0  thrpt    2  197925.670                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-icache-load-misses     65536                        0  thrpt    2       0.656                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-icache-loads           65536                        0  thrpt    2     216.512                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branch-misses             65536                        0  thrpt    2      30.163                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branches                  65536                        0  thrpt    2   82099.583                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:cycles                    65536                        0  thrpt    2  140648.734                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-load-misses          65536                        0  thrpt    2       0.171                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-loads                65536                        0  thrpt    2       3.432                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:iTLB-load-misses          65536                        0  thrpt    2       0.204                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:iTLB-loads                65536                        0  thrpt    2       0.195                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:instructions              65536                        0  thrpt    2  755059.926                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:stalled-cycles-frontend   65536                        0  thrpt    2     589.245                 #/op

IPC now is much better and L1-dcache-loads are way less - as expected.

I'll give it a shot on Intel to see how it performs - but I start to think we're moving in a land where the CPU frontend design matter enough to give very different outcome with small changes in code

Just for reference., the assembly produced by this version is not really great - which surprise me that it gives so much better performance...

  0x00007f7b005d75b0:   vmovq  %xmm0,%rsi
  0x00007f7b005d75b5:   vmovd  %xmm1,%r8d
  0x00007f7b005d75ba:   movslq %ecx,%rax
  0x00007f7b005d75bd:   movzbl 0x10(%rsi,%rax,1),%r9d
  0x00007f7b005d75c3:   mov    0x10(%rdi,%r9,4),%r9d
  0x00007f7b005d75c8:   cmp    %ebx,%r10d
  0x00007f7b005d75cb:   jae    0x00007f7b005d76cd
  0x00007f7b005d75d1:   vmovd  %r8d,%xmm1
  0x00007f7b005d75d6:   vmovq  %rsi,%xmm0
  0x00007f7b005d75db:   mov    %r9d,%r8d
  0x00007f7b005d75de:   sar    $0x10,%r8d
  0x00007f7b005d75e2:   lea    (%r8,%r10,1),%r13d
  0x00007f7b005d75e6:   movslq %r10d,%rsi
  0x00007f7b005d75e9:   mov    %r9w,0x10(%rdx,%rsi,1)
  0x00007f7b005d75ef:   vmovq  %xmm0,%r10
  0x00007f7b005d75f4:   movzbl 0x11(%r10,%rax,1),%r10d
  0x00007f7b005d75fa:   mov    0x10(%rdi,%r10,4),%r9d
  0x00007f7b005d75ff:   cmp    %ebx,%r13d
  0x00007f7b005d7602:   jae    0x00007f7b005d76d5
  0x00007f7b005d7608:   mov    %r9d,%r10d
  0x00007f7b005d760b:   sar    $0x10,%r10d
  0x00007f7b005d760f:   add    %r13d,%r10d
  0x00007f7b005d7612:   movslq %r8d,%r8
  0x00007f7b005d7615:   add    %rsi,%r8
  0x00007f7b005d7618:   mov    %r9w,0x10(%rdx,%r8,1)
  0x00007f7b005d761e:   vmovq  %xmm0,%r8
  0x00007f7b005d7623:   movzbl 0x12(%r8,%rax,1),%r9d
  0x00007f7b005d7629:   mov    0x10(%rdi,%r9,4),%r9d
  0x00007f7b005d762e:   cmp    %ebx,%r10d
  0x00007f7b005d7631:   jae    0x00007f7b005d76c7
  0x00007f7b005d7637:   mov    %r9d,%r8d
  0x00007f7b005d763a:   sar    $0x10,%r8d
  0x00007f7b005d763e:   lea    (%r8,%r10,1),%r13d
  0x00007f7b005d7642:   movslq %r10d,%rsi
  0x00007f7b005d7645:   mov    %r9w,0x10(%rdx,%rsi,1)
  0x00007f7b005d764b:   vmovq  %xmm0,%r10
  0x00007f7b005d7650:   movzbl 0x13(%r10,%rax,1),%r10d
  0x00007f7b005d7656:   mov    0x10(%rdi,%r10,4),%r9d
  0x00007f7b005d765b:   cmp    %ebx,%r13d
  0x00007f7b005d765e:   jae    0x00007f7b005d76d2
  0x00007f7b005d7660:   mov    %r9d,%r10d
  0x00007f7b005d7663:   sar    $0x10,%r10d
  0x00007f7b005d7667:   add    %r13d,%r10d                  ;   {no_reloc}
  0x00007f7b005d766a:   movslq %r8d,%r8
  0x00007f7b005d766d:   add    %rsi,%r8
  0x00007f7b005d7670:   mov    %r9w,0x10(%rdx,%r8,1)
  0x00007f7b005d7676:   add    $0x4,%ecx
  0x00007f7b005d7679:   cmp    %r11d,%ecx
  0x00007f7b005d767c:   jl     0x00007f7b005d75b0

Still an unrolling of 4, but more xmm* spilling, likely due to register pressure

@franz1981
Copy link
Contributor Author

As expected, on my Intel

Benchmark                                                                   (size)  (specialCharPercentage)   Mode  Cnt       Score     Error      Units
MyBenchmark.benchReplaceBackslashRawCompressedTable3                         65536                        3  thrpt   20   31742.389 ± 318.571      ops/s
MyBenchmark.benchReplaceBackslashRawCompressedTable3:CPI                     65536                        3  thrpt    2       0.199            clks/insn
MyBenchmark.benchReplaceBackslashRawCompressedTable3:IPC                     65536                        3  thrpt    2       5.015            insns/clk
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-load-misses   65536                        3  thrpt    2    2142.053                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-loads         65536                        3  thrpt    2  131228.535                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-stores        65536                        3  thrpt    2   65594.234                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-icache-load-misses   65536                        3  thrpt    2      17.519                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-load-misses         65536                        3  thrpt    2       0.088                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-loads               65536                        3  thrpt    2       0.777                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-store-misses        65536                        3  thrpt    2       0.021                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-stores              65536                        3  thrpt    2       0.515                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branch-misses           65536                        3  thrpt    2      20.430                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branches                65536                        3  thrpt    2   81996.179                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:cycles                  65536                        3  thrpt    2  150337.545                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-load-misses        65536                        3  thrpt    2       0.129                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-loads              65536                        3  thrpt    2  131380.625                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-store-misses       65536                        3  thrpt    2       0.149                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-stores             65536                        3  thrpt    2   65620.880                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:iTLB-load-misses        65536                        3  thrpt    2       0.946                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:instructions            65536                        3  thrpt    2  753879.967                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3                         65536                       50  thrpt   20   31622.627 ±  88.974      ops/s
MyBenchmark.benchReplaceBackslashRawCompressedTable3:CPI                     65536                       50  thrpt    2       0.200            clks/insn
MyBenchmark.benchReplaceBackslashRawCompressedTable3:IPC                     65536                       50  thrpt    2       5.003            insns/clk
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-load-misses   65536                       50  thrpt    2    2632.865                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-loads         65536                       50  thrpt    2  131171.290                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-dcache-stores        65536                       50  thrpt    2   65554.079                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:L1-icache-load-misses   65536                       50  thrpt    2      19.551                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-load-misses         65536                       50  thrpt    2       0.064                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-loads               65536                       50  thrpt    2       0.749                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-store-misses        65536                       50  thrpt    2       0.007                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:LLC-stores              65536                       50  thrpt    2       0.512                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branch-misses           65536                       50  thrpt    2      19.852                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:branches                65536                       50  thrpt    2   82030.355                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:cycles                  65536                       50  thrpt    2  150790.287                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-load-misses        65536                       50  thrpt    2       0.099                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-loads              65536                       50  thrpt    2  131321.071                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-store-misses       65536                       50  thrpt    2       0.147                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:dTLB-stores             65536                       50  thrpt    2   65644.900                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:iTLB-load-misses        65536                       50  thrpt    2       0.983                 #/op
MyBenchmark.benchReplaceBackslashRawCompressedTable3:instructions            65536                       50  thrpt    2  754475.330                 #/op

performance is not WOW compared to #116, but decent enough to justify merging this - the Ryzen improvement is just too large to ignore it.
I wish to have a similar tool I got for Intel to improve the Ryzen performance...

@lucamolteni
Copy link

On Apple M2 Pro

the ryzen_opt branch

MyBenchmark.benchReplaceBackslashRawCompressedTable3   65536                        3  thrpt   20  23588.168 ± 352.752  ops/s
MyBenchmark.benchReplaceBackslashRawCompressedTable3   65536                       50  thrpt   20  23697.671 ± 177.000  ops/s

on lemire/master

MyBenchmark.benchReplaceBackslashRawCompressedTable3   65536                        3  thrpt   20  27976.637 ± 420.268  ops/s
MyBenchmark.benchReplaceBackslashRawCompressedTable3   65536                       50  thrpt   20  28038.792 ± 435.524  ops/s

@franz1981
Copy link
Contributor Author

thanks @lucamolteni as expected - since Apple have lot of parallelism on loads, the data-dependent operation using a single table, make it a bit slower :"(

@franz1981
Copy link
Contributor Author

@lemire this new version is slightly slower on M1/M2 then master, but still immensely better than the original table approach ❤️

@franz1981
Copy link
Contributor Author

What you think @lemire ? This pr shine on x86 AMD but not on Arm, master, the opposite

@lemire
Copy link
Owner

lemire commented Jan 19, 2025

@franz1981 Thanks for the comments. I will come back to this issue at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants