-
-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving to use a single lookup table #116
base: master
Are you sure you want to change the base?
Conversation
As expected, on my Intel
performance is not WOW compared to #116, but decent enough to justify merging this - the Ryzen improvement is just too large to ignore it. |
On Apple M2 Pro the
on
|
thanks @lucamolteni as expected - since Apple have lot of parallelism on loads, the data-dependent operation using a single table, make it a bit slower :"( |
@lemire this new version is slightly slower on M1/M2 then master, but still immensely better than the original table approach ❤️ |
What you think @lemire ? This pr shine on x86 AMD but not on Arm, master, the opposite |
@franz1981 Thanks for the comments. I will come back to this issue at a later date. |
This is funny here - on Ryzen, the existing code produce this performance
which is decent and near to what can be obtained with the existing table approach when there are no special chars - but still is underperforming compared to the whoopy 33K with newish Intel - which seems able to perform 2 lookups in parallel in the same cycle, for each read char.
For comparison - the closed pr at #114 with Ryzen, was overperforming the existing approach with Intel - even if not by a great margin i.e. ~28K ops/sec
Moving to a single table lookup with Ryzen (4) the performance is the best we could achieve i.e.
IPC now is much better and
L1-dcache-loads
are way less - as expected.I'll give it a shot on Intel to see how it performs - but I start to think we're moving in a land where the CPU frontend design matter enough to give very different outcome with small changes in code
Just for reference., the assembly produced by this version is not really great - which surprise me that it gives so much better performance...
Still an unrolling of 4, but more xmm* spilling, likely due to register pressure