-
-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question #53
Comments
What you are describing are conventional approaches that may work well at least some of the time. There are, of course, several alternatives and tuning parameters. What works best is likely to depend on your data. |
Right, even using a 8bit mask is interesting - My wish is to benchmark fastest AVX2 functions (on i5-7200U), also to include the fastest grep - Simply, I wonder how different variants are executed by different CPUs, last year I saw how changing "a trivial" |
In few days will start testing, at the moment wanted to share the first unfinished draft/skeleton:
|
@lemire |
@lemire If you can run it against other memmems it would be interesting and appreciated. |
Finally can share how i5-7200U executes my AVX2 memmem: At 11,999,906,341/1024/1024/1024= 11.1 GB/s, when finding "Linus Torvalds" in the kernel. Roughly, NyoTengu_YMM is 58.231/(0.572/2)= 203x faster than Windows 10's find.exe tool.
It is good always to test short patterns/needles (4 bytes) along with medium ones (8 bytes) in order to get more realistic picture, with low/high occurrences, with English/Japanese/DNA/binary testdata. And the DNA test (small alphabet), the hits discrepancy (1607850 vs 1548684) is due to grep/find tools report number of lines containing the hit(s), not the actual hits, as in:
Below, again I execute my both scalar and vector functions back-to-back, so most likely it is 2x faster:
ripgrep is downloadable at: https://github.com/BurntSushi/ripgrep By the way, haven't seen a github project with 22,000 stars! My amateurish texttoy outspeeds it in exact search department :P The benchmark is reproducible and downloadable, www.sanmayce.com/Railgun/Benchmark_Linus-Torvalds_unfinished_Nyotengu.zip Still no results on how NyoTengu_ZMM performs, at the end of the year, I intend to buy Lenovo ThinkBook, i5-1035G4, 16 GB DDR4 2133 MHz, and share how 4 threads running NyoTengu_ZMM saturate memory READ bandwidth... |
And my invitation for you to provide something that I can build, benchmark and analyze still stands. If you can't meet my request, then please stop pinging me. |
Didn't know about those runs, but by speed stats I meant human readable (and built-in) as Bytes/second, simple ones that can be distinguished from wall time and what not. Or there is some option reporting the pure search speed, if not why not adding such in new releases!
If you are really open (the goal of all benchmarking is to find new ways to improve existing functions) please use the current initial version of my memmem, I think there is more to be refined left:
I was pinging out of courtesy, to let you know how ripgrep performs in other benchmarks, you obviously are not pleased, so won't bother you again. |
Yes, the If you want to speak more with me, you can come to my issue tracker. But I'd like to request that you re-read my invitation more clearly. A simple code dump isn't good enough. Look at the other tools I've benchmarked. Those weren't benchmarked by getting a code dump on github. They were tools with well documented build processes. |
Finally, did what I wanted - a side by side comparison of GCC 7.3.0 vs ICL v19.0 and latest Intel's Ice Lake vs AMD's Renoir: Yet, still wonder how things can be bettered, thinking about streaming, not only the all-in-ram etude. |
#Regarding_all_SIMD_memmem_variants_in_here.
Hi,
Johnny-come-lately, yet, have to ask what is the fastest of all variants shared here.
Wanna include it in my searcher and compare it with my variants... and if fast then to share them here.
In next days will try to write a SIMD memmem, just to see whether it is slower than already existing variants.
My idea is to emulate Boyer-Moore-Horspool of order 2, even order 3.
So my question, have you tried this approach:
Instead of filling the pattern vector with
epi_8
(as in above scheme) but 'epi_16', as the manual shows the latter could be implemented either only with vectors or with scalar mask and vector:Don't know how much latency hurts in a real etude:
I will explore whether the mask (mixing scalars and vectors) variant stalls pipeline in practical searches with Linux kernel (1GB strong) used as a haystack... the needle will be "Linus Torvalds".
The text was updated successfully, but these errors were encountered: