Question #53

Sanmayce · 2020-09-15T08:08:54Z

#Regarding_all_SIMD_memmem_variants_in_here.

Hi,
Johnny-come-lately, yet, have to ask what is the fastest of all variants shared here.
Wanna include it in my searcher and compare it with my variants... and if fast then to share them here.

In next days will try to write a SIMD memmem, just to see whether it is slower than already existing variants.

My idea is to emulate Boyer-Moore-Horspool of order 2, even order 3.

Pattern: "to" 
Haystack:                           "otto...........................toz" 
YMM HaystackVector1:                "otto...........................t" 
YMM HaystackVector2:                "tto...........................to" 
YMM Vector1:                        "tttttttttttttttttttttttttttttttt" 
YMM Vector2:                        "oooooooooooooooooooooooooooooooo" 
 
Mask1=(HaystackVector1 AND Vector1): 0110...........................1 
Mask2=(HaystackVector2 AND Vector2): 001............................1 
Result=(Mask1 AND Mask2):            0010...........................1

So my question, have you tried this approach:

Instead of filling the pattern vector with epi_8 (as in above scheme) but 'epi_16', as the manual shows the latter could be implemented either only with vectors or with scalar mask and vector:

Pattern: "to" 
Haystack:                           "otto...........................toz" 
YMM HaystackVector1:                "otto...........................t" 
YMM HaystackVector2:                "tto...........................to" 
YMM Vector1:                        "totototototototototototototototo" 
 
Mask1=(HaystackVector1 AND Vector1): 0 1 . . . . . . . . . . . . . 0    ! 16bit !
Mask2=(HaystackVector2 AND Vector1): 0 0 . . . . . . . . . . . . . 1    ! 16bit !
Result=(Mask1 OR Mask2):             0 1 . . . . . . . . . . . . . 1    ! 16bit !

Don't know how much latency hurts in a real etude:

__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)
Architecture Latency  Throughput (CPI)
Skylake            1         0.5

__mmask16 _mm256_cmpeq_epi16_mask (__m256i a, __m256i b)
Architecture Latency Throughput (CPI)
Skylake            3          1

I will explore whether the mask (mixing scalars and vectors) variant stalls pipeline in practical searches with Linux kernel (1GB strong) used as a haystack... the needle will be "Linus Torvalds".

The text was updated successfully, but these errors were encountered:

lemire · 2020-09-15T12:29:38Z

What you are describing are conventional approaches that may work well at least some of the time. There are, of course, several alternatives and tuning parameters. What works best is likely to depend on your data.

Sanmayce · 2020-09-15T12:55:31Z

Right, even using a 8bit mask is interesting - __mmask8 _mm256_cmpeq_epi32_mask (__m256i a, __m256i b) case.

My wish is to benchmark fastest AVX2 functions (on i5-7200U), also to include the fastest grep - ripgrep which uses some SIMD variant called from RUST. Few days ago, thanks to overclock.net members, saw how 8 threads running my plain C Boyer-Moore-Horspool of order 2 function outperformed 1 thread of SIMDed RUST, only on 10th gen CPU (on 7th, 8th and 9th was slower), go figure.

Simply, I wonder how different variants are executed by different CPUs, last year I saw how changing "a trivial" uint64_t with size_t resulted in detectable speed up, the GCC generated 3 less instructions, didn't expect it, so benchmarking even small changes might be instrumental in tuning.

Sanmayce · 2020-09-16T00:57:41Z

In few days will start testing, at the moment wanted to share the first unfinished draft/skeleton:

// Caution: It doesn't work for needles 1 byte long!
char * Railgun_AVX2 (char * pbTarget, char * pbPattern, uint32_t cbTarget, uint32_t cbPattern)
{
	char * pbTargetMax = pbTarget + cbTarget;
	register uint32_t ulHashPattern;
	register uint32_t ulHashTarget;
	signed long count;

	unsigned char SINGLET;
	uint32_t Quadruplet2nd;
	uint32_t Quadruplet3rd;
	uint32_t Quadruplet4th;

	uint32_t AdvanceHopperGrass;

	if (cbPattern > cbTarget) return(NULL);

	if ( cbPattern<4 ) { needle 2..3; SCALAR

        	pbTarget = pbTarget+cbPattern;
		ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
		if ( cbPattern==3 ) {
			for ( ;; ) {
				if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
					if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
				}
				if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) { 
					pbTarget++;
					if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
				}
				pbTarget++;
				if (pbTarget > pbTargetMax) return(NULL);
			}
		} else {
		}
		for ( ;; ) {
			if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) ) return((pbTarget-2));
			if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
			pbTarget++;
			if (pbTarget > pbTargetMax) return(NULL);
		}

	} else { // Below: haystack <128; needle >=4; SCALAR
		if (cbTarget<128) { // This value is arbitrary (don't know how exactly), it ensures (at least must) better performance than 'Boyer_Moore_Horspool'.

		pbTarget = pbTarget+cbPattern;
		ulHashPattern = *(uint32_t *)(pbPattern);
		SINGLET = ulHashPattern & 0xFF;
		Quadruplet2nd = SINGLET<<8;
		Quadruplet3rd = SINGLET<<16;
		Quadruplet4th = SINGLET<<24;
		for ( ;; ) {
			AdvanceHopperGrass = 0;
			ulHashTarget = *(uint32_t *)(pbTarget-cbPattern);
			if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
				count = cbPattern-1;
				while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
					if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
					count--;
				}
				if ( count == 0) return((pbTarget-cbPattern));
			} else { // The goal here: to avoid memory accesses by stressing the registers.
				if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
					AdvanceHopperGrass++;
					if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
						AdvanceHopperGrass++;
						if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
					}
				}
			}
			AdvanceHopperGrass++;
			pbTarget = pbTarget + AdvanceHopperGrass;
			if (pbTarget > pbTargetMax) return(NULL);
		}
		} else { // Below: haystack >=128; needle >=4; VECTOR

		// Stage 1: SSE2 or AVX2 i.e. 16 or 32 strides.
		// Stage 2: Dealing with the eventual remainder.
		// The main idea: Stressing the registers as it was done in Quadruplet (the above fastest etude) - outperforms Stephen R. van den Berg's strstr at http://www.scs.stanford.edu/histar/src/pkg/uclibc/libc/string/generic/strstr.c
		// __m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b) needs AVX2; the more attractive __mmask8 _mm256_cmpeq_epi32_mask (__m256i a, __m256i b) needs AVX512??
		
// Pattern: "Linus Torvalds" 
// Order4:            [    ] skip 32 if not a single occurrence of 'alds' within YMM + (Order - 1) = 32 + 3 = 35 bytes window:
// Haystack:                                 "otto.......................Torvalds" 
// YMM HaystackVector1:                      "otto.......................Torva" 
// YMM HaystackVector2:                      "tto.......................Torval" 
// YMM HaystackVector3:                      "to.......................Torvald" 
// YMM HaystackVector4:                      "o.......................Torvalds" 
// YMM Vector1:                              "aldsaldsaldsaldsaldsaldsaldsalds" 
// 
// Mask1=(HaystackVector1 eqd Vector1):       0   0   0   0   0   0   0   0     ! 8bit !
// Mask2=(HaystackVector2 eqd Vector1):       0   0   0   0   0   0   0   0     ! 8bit !
// Mask3=(HaystackVector3 eqd Vector1):       0   0   0   0   0   0   0   0     ! 8bit !
// Mask4=(HaystackVector4 eqd Vector1):       0   0   0   0   0   0   0   1     ! 8bit !
// Result=(Mask1 OR Mask2 OR Mask3 OR Mask4): 0   0   0   0   0   0   0   1     ! 8bit !

	size_t YMMchunks = cbTarget/32 -1; // in here, ensured at least 4 chunks; in order to avoid past haystack YMM reads - decrease 1 chunk and finish with Scalar_Quadruplet
	const __m256i last4 = _mm256_set1_epi32(pbPattern[cbPattern - 1 -3]);
	for (size_t i = 0; i < YMMchunks; i += 32) {

	const __m256i HaystackVector1 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 0));
	const __m256i HaystackVector2 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 1));
	const __m256i HaystackVector3 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 2));
	const __m256i HaystackVector4 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 3));

	const __m256i EQD1 = _mm256_cmpeq_epi32(HaystackVector1, last4);
	const __m256i EQD2 = _mm256_cmpeq_epi32(HaystackVector2, last4);
	const __m256i EQD3 = _mm256_cmpeq_epi32(HaystackVector3, last4);
	const __m256i EQD4 = _mm256_cmpeq_epi32(HaystackVector4, last4);

	const __m256i FinalVector12 = _mm256_or_si256(EQD1, EQD2);
	const __m256i FinalVector34 = _mm256_or_si256(EQD3, EQD4);

	uint32_t mask = _mm256_movemask_epi8(_mm256_or_si256(FinalVector12, FinalVector34));
	//uint8_t mask = _mm256_movemask_ps(_mm256_or_si256(FinalVector12, FinalVector34)); //AVX is 8x4float _mm256_movemask_ps, couldn't find _mm256_movemask_epi32 ! Is it allowed?

	// ...

	}

	// ...

		} //if (cbTarget<128) {
	} //if ( cbPattern<4 ) { needle 2..3; SCALAR
}

Sanmayce · 2020-10-06T06:55:48Z

@lemire
Still no enough time to finish the C source of my XMM/YMM/ZMM variant of Boyer-Moore-Horspool order 4, yet the end is near.
My AVX2 machine is uninterruptible for 9 more days, then will share the C source here.
I believe it will scream.
https://www.overclock.net/threads/cpu-benchmark-finding-linus-torvalds.1754066/page-2#post-28644885

Sanmayce · 2020-10-11T13:15:17Z

@lemire
Did write the initial version, can you suggest how to boost it even further:
https://www.overclock.net/threads/cpu-benchmark-finding-linus-torvalds.1754066/page-2#post-28648619

If you can run it against other memmems it would be interesting and appreciated.

Sanmayce · 2020-10-14T04:56:55Z

Finally can share how i5-7200U executes my AVX2 memmem:

At 11,999,906,341/1024/1024/1024= 11.1 GB/s, when finding "Linus Torvalds" in the kernel.
Disappointingly, GCC 9.2.0 is significantly slower, its speed is 8,707,896,637 bytes/second, 3 GB/s less, no idea why, it was compiled with gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC920.exe -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY.

Roughly, NyoTengu_YMM is 58.231/(0.572/2)= 203x faster than Windows 10's find.exe tool.
If only @BurntSushi reported speed stats, then the fastest performer clearly would be NyoTengu_YMM, especially with DNA haystacks - being more than 2x faster.

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>copy linux-5.8.5.tar nul
        1 file(s) copied.

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64 find /C "Linus Torvalds" linux-5.8.5.tar

---------- LINUX-5.8.5.TAR: 602

Kernel  Time =     0.140 =    0%
User    Time =    58.078 =   99%
Process Time =    58.218 =   99%    Virtual  Memory =      3 MB
Global  Time =    58.231 =  100%    Physical Memory =    942 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64 Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar n.txt
NyoTengu a.k.a. 'SHETENGU' - the skydogess exact searcher, written by Kaze, 2020-Oct-10, for contacts: [email protected]
Needle = Linus Torvalds
Current priority class is REALTIME_PRIORITY_CLASS.
Allocating Source-Buffer 983,992,320 bytes ... OK
Searching into Haystack (983,992,320) for all occurrences of Needle (14) with fastest (known to me) SCALAR memmem() - 'Railgun_Trolldom_64' ...
Hits: 602
Search took 0.096 seconds.
Pure Search Performance: 10,249,920,000 bytes/second.
Searching into Haystack (983,992,320) for all occurrences of Needle (14) with fastest (known to me) VECTOR memmem() - 'Railgun_Nyotengu_YMM' ...
Hits: 602
Search took 0.082 seconds.
Pure Search Performance: 11,999,906,341 bytes/second.

Kernel  Time =     0.468 =   81%
User    Time =     0.187 =   32%
Process Time =     0.656 =  114%    Virtual  Memory =    941 MB
Global  Time =     0.572 =  100%    Physical Memory =    941 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64.exe "ripgrep-12.1.1-x86_64-pc-windows-gnu.exe" -c "Linus Torvalds" linux-5.8.5.tar
602

Kernel  Time =     0.109 =   35%
User    Time =     0.203 =   65%
Process Time =     0.312 =  100%    Virtual  Memory =      4 MB
Global  Time =     0.311 =  100%    Physical Memory =    944 MB

It is good always to test short patterns/needles (4 bytes) along with medium ones (8 bytes) in order to get more realistic picture, with low/high occurrences, with English/Japanese/DNA/binary testdata.

And the DNA test (small alphabet), the hits discrepancy (1607850 vs 1548684) is due to grep/find tools report number of lines containing the hit(s), not the actual hits, as in:

E:\Kazahana_UTF8_2020-Oct-10\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>copy con test
ACGT ACGT
^Z
1 file(s) copied.

E:\Kazahana_UTF8_2020-Oct-10\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>find /C "ACGT" test

---------- TEST: 1

E:\Kazahana_UTF8_2020-Oct-10\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>

Below, again I execute my both scalar and vector functions back-to-back, so most likely it is 2x faster:

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>copy "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28" nul
        1 file(s) copied.

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64 find /C "ACGT" "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28"

---------- SCB_DNA-GENOME-HOMO-SAPIENS-GCA_000001405.28-2019-02-28: 1548684

Kernel  Time =     0.750 =    0%
User    Time =   222.500 =   99%
Process Time =   223.250 =   99%    Virtual  Memory =      1 MB
Global  Time =   223.337 =  100%    Physical Memory =      3 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64 find /C "AACCGGTT" "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28"

---------- SCB_DNA-GENOME-HOMO-SAPIENS-GCA_000001405.28-2019-02-28: 2709

Kernel  Time =     0.703 =    0%
User    Time =   217.000 =   99%
Process Time =   217.703 =   99%    Virtual  Memory =      1 MB
Global  Time =   217.751 =  100%    Physical Memory =      3 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64 Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28" 4.txt
NyoTengu a.k.a. 'SHETENGU' - the skydogess exact searcher, written by Kaze, 2020-Oct-10, for contacts: [email protected]
Needle = ACGT
Current priority class is REALTIME_PRIORITY_CLASS.
Allocating Source-Buffer 3,353,989,743 bytes ... OK
Searching into Haystack (3,353,989,743) for all occurrences of Needle (4) with fastest (known to me) SCALAR memmem() - 'Railgun_Trolldom_64' ...
Hits: 1607850
Search took 5.112 seconds.
Pure Search Performance: 656,101,279 bytes/second.
Searching into Haystack (3,353,989,743) for all occurrences of Needle (4) with fastest (known to me) VECTOR memmem() - 'Railgun_Nyotengu_YMM' ...
Hits: 1607850
Search took 1.048 seconds.
Pure Search Performance: 3,200,371,892 bytes/second.

Kernel  Time =     1.578 =   20%
User    Time =     6.156 =   81%
Process Time =     7.734 =  102%    Virtual  Memory =   3206 MB
Global  Time =     7.568 =  100%    Physical Memory =   3201 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64 Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28" 8.txt
NyoTengu a.k.a. 'SHETENGU' - the skydogess exact searcher, written by Kaze, 2020-Oct-10, for contacts: [email protected]
Needle = AACCGGTT
Current priority class is REALTIME_PRIORITY_CLASS.
Allocating Source-Buffer 3,353,989,743 bytes ... OK
Searching into Haystack (3,353,989,743) for all occurrences of Needle (8) with fastest (known to me) SCALAR memmem() - 'Railgun_Trolldom_64' ...
Hits: 2723
Search took 2.522 seconds.
Pure Search Performance: 1,329,892,840 bytes/second.
Searching into Haystack (3,353,989,743) for all occurrences of Needle (8) with fastest (known to me) VECTOR memmem() - 'Railgun_Nyotengu_YMM' ...
Hits: 2723
Search took 1.623 seconds.
Pure Search Performance: 2,066,537,118 bytes/second.

Kernel  Time =     1.625 =   29%
User    Time =     4.140 =   74%
Process Time =     5.765 =  103%    Virtual  Memory =   3206 MB
Global  Time =     5.573 =  100%    Physical Memory =   3201 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64.exe "ripgrep-12.1.1-x86_64-pc-windows-gnu.exe" -c "ACGT" "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28"
1548684

Kernel  Time =     0.390 =    7%
User    Time =     4.937 =   89%
Process Time =     5.328 =   96%    Virtual  Memory =      8 MB
Global  Time =     5.506 =  100%    Physical Memory =   3205 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>timer64.exe "ripgrep-12.1.1-x86_64-pc-windows-gnu.exe" -c "AACCGGTT" "SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28"
2709

Kernel  Time =     0.421 =    7%
User    Time =     5.078 =   92%
Process Time =     5.500 =   99%    Virtual  Memory =      8 MB
Global  Time =     5.514 =  100%    Physical Memory =   3205 MB

D:\Benchmark_Linus-Torvalds_HUMAN-GENOME_unfinished_Nyotengu>

ripgrep is downloadable at: https://github.com/BurntSushi/ripgrep

By the way, haven't seen a github project with 22,000 stars! My amateurish texttoy outspeeds it in exact search department :P

The benchmark is reproducible and downloadable, www.sanmayce.com/Railgun/Benchmark_Linus-Torvalds_unfinished_Nyotengu.zip
The DNA haystack is at: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz

Still no results on how NyoTengu_ZMM performs, at the end of the year, I intend to buy Lenovo ThinkBook, i5-1035G4, 16 GB DDR4 2133 MHz, and share how 4 threads running NyoTengu_ZMM saturate memory READ bandwidth...

BurntSushi · 2020-10-14T13:07:43Z

If only @BurntSushi reported speed stats

I do.

And my invitation for you to provide something that I can build, benchmark and analyze still stands. If you can't meet my request, then please stop pinging me.

Sanmayce · 2020-10-14T13:55:40Z

I do.

Didn't know about those runs, but by speed stats I meant human readable (and built-in) as Bytes/second, simple ones that can be distinguished from wall time and what not. Or there is some option reporting the pure search speed, if not why not adding such in new releases!

And my invitation for you to provide something that I can build, benchmark and analyze still stands.

If you are really open (the goal of all benchmarking is to find new ways to improve existing functions) please use the current initial version of my memmem, I think there is more to be refined left:

// Caution: It doesn't work for needles 1 byte long!
char * Railgun_Nyotengu_XMM_YMM_ZMM (char * pbTarget, char * pbPattern, uint32_t cbTarget, uint32_t cbPattern)
{
	char * pbTargetMax = pbTarget + cbTarget;
	register uint32_t ulHashPattern;
	register uint32_t ulHashTarget;
	signed long count;

	unsigned char SINGLET;
	uint32_t Quadruplet2nd;
	uint32_t Quadruplet3rd;
	uint32_t Quadruplet4th;

	uint32_t AdvanceHopperGrass;

	size_t i;
	size_t j;
	size_t VECTORchunks;
	uint32_t mask;
	//uint8_t mask;

#ifdef XMMtengu
	int SkipWholeVector=16;
	__m128i last4;
	__m128i last1;
	__m128i first2;
	__m128i first4;
	__m128i HaystackVector1;
	__m128i HaystackVector2;
	__m128i HaystackVector3;
	__m128i HaystackVector4;

	__m128i EQD1;
	__m128i EQD2;
	__m128i EQD3;
	__m128i EQD4;

	__m128i FinalVector12;
	__m128i FinalVector34;
#endif 

#ifdef YMMtengu
	int SkipWholeVector=32;
	__m256i last4;
	__m256i last1;
	__m256i first2;
	__m256i first4;
	__m256i HaystackVector1;
	__m256i HaystackVector2;
	__m256i HaystackVector3;
	__m256i HaystackVector4;

	__m256i EQD1;
	__m256i EQD2;
	__m256i EQD3;
	__m256i EQD4;

	__m256i FinalVector12;
	__m256i FinalVector34;
#endif 

#ifdef ZMMtengu
	int SkipWholeVector=64;
	__m512i last4;
	__m512i last1;
	__m512i first2;
	__m512i first4;
	__m512i HaystackVector1;
	__m512i HaystackVector2;
	__m512i HaystackVector3;
	__m512i HaystackVector4;

	uint16_t EQD1mask16;
	uint16_t EQD2mask16;
	uint16_t EQD3mask16;
	uint16_t EQD4mask16;

	uint16_t FinalVector12mask16;
	uint16_t FinalVector34mask16;
#endif 

	if (cbPattern > cbTarget) return(NULL);

	if ( cbPattern<4 ) { // needle 2..3; SCALAR

        	pbTarget = pbTarget+cbPattern;
		ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
		if ( cbPattern==3 ) {
			for ( ;; ) {
				if ( ulHashPattern == ( (*(char *)(pbTarget-3))<<8 ) + *(pbTarget-1) ) {
					if ( *(char *)(pbPattern+1) == *(char *)(pbTarget-2) ) return((pbTarget-3));
				}
				if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) { 
					pbTarget++;
					if ( (char)(ulHashPattern>>8) != *(pbTarget-2) ) pbTarget++;
				}
				pbTarget++;
				if (pbTarget > pbTargetMax) return(NULL);
			}
		} else {
		}
		for ( ;; ) {
			if ( ulHashPattern == ( (*(char *)(pbTarget-2))<<8 ) + *(pbTarget-1) ) return((pbTarget-2));
			if ( (char)(ulHashPattern>>8) != *(pbTarget-1) ) pbTarget++;
			pbTarget++;
			if (pbTarget > pbTargetMax) return(NULL);
		}

	} else { // Below: haystack <128; needle >=4; SCALAR
		if (cbTarget<128) { // This value is arbitrary (don't know how exactly), it ensures (at least must) better performance than 'Boyer_Moore_Horspool'.

		pbTarget = pbTarget+cbPattern;
		ulHashPattern = *(uint32_t *)(pbPattern);
		SINGLET = ulHashPattern & 0xFF;
		Quadruplet2nd = SINGLET<<8;
		Quadruplet3rd = SINGLET<<16;
		Quadruplet4th = SINGLET<<24;
		for ( ;; ) {
			AdvanceHopperGrass = 0;
			ulHashTarget = *(uint32_t *)(pbTarget-cbPattern);
			if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
				count = cbPattern-1;
				while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
					if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
					count--;
				}
				if ( count == 0) return((pbTarget-cbPattern));
			} else { // The goal here: to avoid memory accesses by stressing the registers.
				if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
					AdvanceHopperGrass++;
					if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
						AdvanceHopperGrass++;
						if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
					}
				}
			}
			AdvanceHopperGrass++;
			pbTarget = pbTarget + AdvanceHopperGrass;
			if (pbTarget > pbTargetMax) return(NULL);
		}
		} else { // Below: haystack >=128; needle >=4; VECTOR

		// Stage 1: SSE2 or AVX2 i.e. 16 or 32 strides.
		// Stage 2: Dealing with the eventual remainder.
		//          Careful! Remainder starts (overlapping with the last 32byte chunk, if Needle<32) at NEXT position to 32*(YMM_Chunks_Traversed)+Order4-Needle_Length = 2*32+4-14 = 54:
		//          Chunk #0                          Chunk #1                          Remainder
		//          [00000000001111111111222222222233][33333333444444444455555555556666][6666...
		//          [01234567890123456789012345678901][23456789012345678901234567890123][4567...
		//                                                                  Linus Torva  lds  ! The needle's postfix of order 4 was sought up to 63 (ensuring the 32 bytes skips).
		//                                                                   Linus Torv  alds  ! Then next possible hit is at 54 position, or suffix starting at next to 63 or 63+1=64.
		// The main idea: Stressing the registers as it was done in Quadruplet (the above fastest etude) - outperforms Stephen R. van den Berg's strstr at http://www.scs.stanford.edu/histar/src/pkg/uclibc/libc/string/generic/strstr.c
		// __m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b) needs AVX2; the more attractive __mmask8 _mm256_cmpeq_epi32_mask (__m256i a, __m256i b) needs AVX512??
		
// Pattern: "Linus Torvalds" 
// Order4:            [    ] skip 32 if not a single occurrence of 'alds' within YMM + (Order - 1) = 32 + 3 = 35 bytes window:
// Haystack:                                 "otto.......................Torvalds" 
// YMM HaystackVector1:                      "otto.......................Torva" 
// YMM HaystackVector2:                      "tto.......................Torval" 
// YMM HaystackVector3:                      "to.......................Torvald" 
// YMM HaystackVector4:                      "o.......................Torvalds" 
// YMM Vector1:                              "aldsaldsaldsaldsaldsaldsaldsalds" 
// 
// Mask1=(HaystackVector1 eqd Vector1):       0   0   0   0   0   0   0   0     ! 8bit !
// Mask2=(HaystackVector2 eqd Vector1):       0   0   0   0   0   0   0   0     ! 8bit !
// Mask3=(HaystackVector3 eqd Vector1):       0   0   0   0   0   0   0   0     ! 8bit !
// Mask4=(HaystackVector4 eqd Vector1):       0   0   0   0   0   0   0   1     ! 8bit !
// Result=(Mask1 OR Mask2 OR Mask3 OR Mask4): 0   0   0   0   0   0   0   1     ! 8bit !

// printf("&pbPattern[cbPattern - 1 -3] = %s\n",&pbPattern[cbPattern - 1 -3]); //debug

#ifdef XMMtengu
	VECTORchunks = cbTarget/SkipWholeVector -1; // in here, ensured at least 7 chunks; in order to avoid past haystack XMM reads - decrease 1 chunk and finish with Scalar_Quadruplet
	// The preemptive search for the first char is slower SIGNIFICANTLY on i7 3rd gen?!
//	last1 = _mm_set1_epi8(*(uint8_t*)&pbPattern[0]); //cbPattern - 1
//	first2 = _mm_set1_epi16(*(uint16_t*)&pbPattern[0]); 
	first4 = _mm_set1_epi32(*(uint32_t*)&pbPattern[0]);
	for (i = 0; i < VECTORchunks*SkipWholeVector; i += SkipWholeVector) {
	HaystackVector1 = _mm_loadu_si128 ((const __m128i*)(pbTarget + i + 0));
//	EQD1 = _mm_cmpeq_epi8(HaystackVector1, last1);
//	mask = _mm_movemask_epi8( EQD1 );

//	if ( mask != 0 ) 
	{
	HaystackVector2 = _mm_loadu_si128 ((const __m128i*)(pbTarget + i + 1));
	EQD1 = _mm_cmpeq_epi32(HaystackVector1, first4);
	EQD2 = _mm_cmpeq_epi32(HaystackVector2, first4);
	HaystackVector3 = _mm_loadu_si128 ((const __m128i*)(pbTarget + i + 2));
	HaystackVector4 = _mm_loadu_si128 ((const __m128i*)(pbTarget + i + 3));
	EQD3 = _mm_cmpeq_epi32(HaystackVector3, first4);
	EQD4 = _mm_cmpeq_epi32(HaystackVector4, first4);

	FinalVector12 = _mm_or_si128(EQD1, EQD2);
	FinalVector34 = _mm_or_si128(EQD3, EQD4);

	mask = _mm_movemask_ps( _mm_castsi128_ps(_mm_or_si128(FinalVector12, FinalVector34)) );
	}

	_mm_prefetch((char*)(pbTarget + 64*64), _MM_HINT_T0);
#endif 
#ifdef YMMtengu
	VECTORchunks = cbTarget/SkipWholeVector -1; // in here, ensured at least 3 chunks; in order to avoid past haystack YMM reads - decrease 1 chunk and finish with Scalar_Quadruplet
//	last1 = _mm256_set1_epi8(*(uint8_t*)&pbPattern[0]); //cbPattern - 1
	first4 = _mm256_set1_epi32(*(uint32_t*)&pbPattern[0]);
	for (i = 0; i < VECTORchunks*SkipWholeVector; i += SkipWholeVector) {
	HaystackVector1 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 0));
//	EQD1 = _mm256_cmpeq_epi8(HaystackVector1, last1);
//	mask = _mm256_movemask_epi8( EQD1 );

//	if ( mask != 0 ) 
	{
	HaystackVector2 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 1));
	EQD1 = _mm256_cmpeq_epi32(HaystackVector1, first4);
	EQD2 = _mm256_cmpeq_epi32(HaystackVector2, first4);
	HaystackVector3 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 2));
	HaystackVector4 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 3));
	EQD3 = _mm256_cmpeq_epi32(HaystackVector3, first4);
	EQD4 = _mm256_cmpeq_epi32(HaystackVector4, first4);

	FinalVector12 = _mm256_or_si256(EQD1, EQD2);
	FinalVector34 = _mm256_or_si256(EQD3, EQD4);

	mask = _mm256_movemask_ps( _mm256_castsi256_ps(_mm256_or_si256(FinalVector12, FinalVector34)) );
	}
	_mm_prefetch((char*)(pbTarget + 64*64), _MM_HINT_T0);
#endif 
// The vector/main loop is (0012c-000fd+2)+(00175-0016d+2)= 59 bytes
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY -FeNyotengu_YMM_IntelV150_32bit -FAcs";

.B10.21:                        

;;; 	HaystackVector1 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 0));
;;; 	HaystackVector2 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 1));
;;; 	EQD1 = _mm256_cmpeq_epi32(HaystackVector1, first4);
;;; 	EQD2 = _mm256_cmpeq_epi32(HaystackVector2, first4);
;;; 	HaystackVector3 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 2));
;;; 	HaystackVector4 = _mm256_loadu_si256((const __m256i*)(pbTarget + i + 3));
;;; 	EQD3 = _mm256_cmpeq_epi32(HaystackVector3, first4);
;;; 	EQD4 = _mm256_cmpeq_epi32(HaystackVector4, first4);
;;; 	FinalVector12 = _mm256_or_si256(EQD1, EQD2);
;;; 	FinalVector34 = _mm256_or_si256(EQD3, EQD4);
;;; 	mask = _mm256_movemask_ps( _mm256_castsi256_ps(_mm256_or_si256(FinalVector12, FinalVector34)) );
;;; 	_mm_prefetch((char*)(pbTarget + 64*64), _MM_HINT_T0);

  000fd 0f 18 8e 00 10 
        00 00            prefetcht0 BYTE PTR [4096+esi]         
  00104 c5 fd 76 0b      vpcmpeqd ymm1, ymm0, YMMWORD PTR [ebx] 
  00108 c5 fd 76 54 32 
        01               vpcmpeqd ymm2, ymm0, YMMWORD PTR [1+edx+esi] 
  0010e c5 fd 76 5c 32 
        02               vpcmpeqd ymm3, ymm0, YMMWORD PTR [2+edx+esi] 
  00114 c5 fd 76 64 32 
        03               vpcmpeqd ymm4, ymm0, YMMWORD PTR [3+edx+esi] 
  0011a c5 f5 eb ea      vpor ymm5, ymm1, ymm2                  
  0011e c5 e5 eb f4      vpor ymm6, ymm3, ymm4                  
  00122 c5 d5 eb fe      vpor ymm7, ymm5, ymm6                  
  00126 c5 fc 50 cf      vmovmskps ecx, ymm7                    

;;; 	if ( mask != 0 ) {

  0012a 85 c9            test ecx, ecx                          
  0012c 74 3f            je .B10.26 

...

.B10.26:                        
  0016d 83 c2 20         add edx, 32                            
  00170 83 c3 20         add ebx, 32                            
  00173 3b d0            cmp edx, eax                           
  00175 72 86            jb .B10.21 

.B10.28:                        
*/

// __mmask16 _mm256_cmpeq_epi16_mask (__m256i a, __m256i b)
// __mmask16 _mm512_cmpeq_epi32_mask (__m512i a, __m512i b)
 // __mmask8 _mm256_cmpeq_epi32_mask (__m256i a, __m256i b)
#ifdef ZMMtengu
	VECTORchunks = cbTarget/SkipWholeVector -1; // in here, ensured at least 128/64-1=1 chunk; in order to avoid past haystack ZMM reads - decrease 1 chunk and finish with Scalar_Quadruplet
	first4 = _mm512_set1_epi32(*(uint32_t*)&pbPattern[0]);
	for (i = 0; i < VECTORchunks*SkipWholeVector; i += SkipWholeVector) {
	HaystackVector1 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 0));
	HaystackVector2 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 1));
	EQD1mask16 = _mm512_cmpeq_epi32_mask(HaystackVector1, first4);
	EQD2mask16 = _mm512_cmpeq_epi32_mask(HaystackVector2, first4);
	HaystackVector3 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 2));
	HaystackVector4 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 3));
	EQD3mask16 = _mm512_cmpeq_epi32_mask(HaystackVector3, first4);
	EQD4mask16 = _mm512_cmpeq_epi32_mask(HaystackVector4, first4);

	FinalVector12mask16 = (EQD1mask16 | EQD2mask16);
	FinalVector34mask16 = (EQD3mask16 | EQD4mask16);

	mask = (FinalVector12mask16 | FinalVector34mask16);
	
	_mm_prefetch((char*)(pbTarget + 64*64), _MM_HINT_T0);
#endif 
// The vector/main loop is (00143-000fc+2)+(00187-0017d+6)= 89 bytes
/*
; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 15.0.0.108 Build 20140726";
; mark_description "-O3 -DZMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY -FeNyotengu_ZMM_IntelV150_32bit -FAcs";

;;; #ifdef ZMMtengu
;;; 	VECTORchunks = cbTarget/SkipWholeVector -1; // in here, ensured at least 128/64-1=1 chunk; in order to avoid past haystack ZMM reads - decrease 1 chunk and finish with Scalar_Quadruplet
;;; 	first4 = _mm512_set1_epi32(*(uint32_t*)&pbPattern[0]);

  000df 8b 45 0c         mov eax, DWORD PTR [12+ebp]            
  000e2 8d 53 c0         lea edx, DWORD PTR [-64+ebx]           

;;; 	for (i = 0; i < VECTORchunks*SkipWholeVector; i += SkipWholeVector) {

  000e5 83 e2 c0         and edx, -64                           
  000e8 62 f2 7d 48 58 
        00               vpbroadcastd zmm0, DWORD PTR [eax]     
  000ee 0f 84 99 00 00 
        00               je .B10.28 

.B10.20:                        
  000f4 89 54 24 44      mov DWORD PTR [68+esp], edx            
  000f8 33 c0            xor eax, eax                           
  000fa 8b de            mov ebx, esi                           

.B10.21:                        

;;; 	HaystackVector1 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 0));
;;; 	HaystackVector2 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 1));
;;; 	EQD1mask16 = _mm512_cmpeq_epi32_mask(HaystackVector1, first4);
;;; 	EQD2mask16 = _mm512_cmpeq_epi32_mask(HaystackVector2, first4);
;;; 	HaystackVector3 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 2));
;;; 	HaystackVector4 = _mm512_loadu_si512((const __m256i*)(pbTarget + i + 3));
;;; 	EQD3mask16 = _mm512_cmpeq_epi32_mask(HaystackVector3, first4);
;;; 	EQD4mask16 = _mm512_cmpeq_epi32_mask(HaystackVector4, first4);
;;; 	FinalVector12mask16 = (EQD1mask16 | EQD2mask16);
;;; 	FinalVector34mask16 = (EQD3mask16 | EQD4mask16);
;;; 	mask = (FinalVector12mask16 | FinalVector34mask16);
;;; 	_mm_prefetch((char*)(pbTarget + 64*64), _MM_HINT_T0);

  000fc 8b 7d 08         mov edi, DWORD PTR [8+ebp]             
  000ff 0f 18 8f 00 10 
        00 00            prefetcht0 BYTE PTR [4096+edi]         
  00106 62 f1 7d 48 76 
        03               vpcmpeqd k0, zmm0, ZMMWORD PTR [ebx]   
  0010c 62 f1 7d 48 76 
        8c 38 01 00 00 
        00               vpcmpeqd k1, zmm0, ZMMWORD PTR [1+eax+edi] 
  00117 62 f1 7d 48 76 
        94 38 02 00 00 
        00               vpcmpeqd k2, zmm0, ZMMWORD PTR [2+eax+edi] 
  00122 62 f1 7d 48 76 
        9c 38 03 00 00 
        00               vpcmpeqd k3, zmm0, ZMMWORD PTR [3+eax+edi] 
  0012d c5 f8 93 f0      kmovw esi, k0                          
  00131 c5 f8 93 c9      kmovw ecx, k1                          
  00135 c5 f8 93 d2      kmovw edx, k2                          
  00139 c5 f8 93 fb      kmovw edi, k3                          
  0013d 0b f1            or esi, ecx                            
  0013f 0b d7            or edx, edi                            
  00141 0b f2            or esi, edx                            
;;; 	if ( mask != 0 ) {
  00143 74 38            je .B10.26 
                                
...
                                
.B10.26:                        
  0017d 83 c0 40         add eax, 64                            
  00180 83 c3 40         add ebx, 64                            
  00183 3b 44 24 44      cmp eax, DWORD PTR [68+esp]            
  00187 0f 82 6f ff ff 
        ff               jb .B10.21 
                                
.B10.28:                        
*/


//		printf("mask = %02x\n", mask); //debug
	if ( mask != 0 ) {
//		printf("_mm_popcnt_u32(mask) = %d\n", _mm_popcnt_u32(mask)); //debug

// For these two:
// char *Haystack = "CPU Benchmark: Linus Torvalds ................................................................................... Linus Torvalds"; // 128 bytes long
// char *Needle = "Linus Torvalds"; // 14 bytes long
// the 'debug' outcome is:
// &pbPattern[cbPattern - 1 -3] = alds
// mask = 40
// _mm_popcnt_u32(mask) = 1
// Okay, 0x40 is 0000 0010 (LSB first) i.e. 6th bit is set, it means 4 possible positions within Chunk #6 (4*6=24 offset) (in fact it is only 25th):
// DWORD #0DWORD #1DWORD #2DWORD #3DWORD #4DWORD #5DWORD #6DWORD #7
// [00..03][04..07][08..11][12..15][16..19][20..23][24..27][28..31]
//                                                  |      
//                                                  /
//                                                 /
//                                                \/
//                        000000000011111111112222[2222]2233
//                        012345678901234567890123[4567]8901
//                        CPU Benchmark: Linus Tor[vald]s ..

		// Manually find the first suffix position:
		//j = i; // somewhere in chunk #i lie possible POPCNT(mask) matches...
		// Okay, doing it dirty as a start - checking all the 16/32/64 positions one-by-one:
	for (j = 0; j < SkipWholeVector; j++)
		if (memcmp(pbTarget + i + j, &pbPattern[cbPattern - cbPattern], cbPattern) == 0) return( pbTarget + i + j ); //first4
	// pbTarget + i + j points to offset of DWORD/suffix so we have to repoint it to the start offset, namely, pbTarget + i + j - (cbPattern-4)
		//while (memcmp(pbTarget + j, &pbPattern[cbPattern - 1 -3], 4) != 0) j++;
		// Don't forget! Comparing the rest of the Needle (to the left) has to be boundary checked (to not go outside) - for speed, this check has to be done outside this loop.
/*
						if ( *(uint32_t *)&pbTarget[i] == ulHashPattern) { // This fast check ensures not missing a match (for remainder) when going under 0 in loop below:
						// Order 4 [
					// Let's try something "outrageous" like comparing with[out] overlap BBs 4bytes long instead of 1 byte back-to-back:
					// Inhere we are using order 4, 'cbPattern - Order + 1' is the number of BBs for text 'cbPattern' bytes long, for example, for cbPattern=11 'fastest fox' and Order=4 we have BBs = 11-4+1=8:
					//0:"fast" if the comparison failed here, 'count' is 1; 'Gulliver' is cbPattern-(4-1)-7
					//1:"aste" if the comparison failed here, 'count' is 2; 'Gulliver' is cbPattern-(4-1)-6
					//2:"stes" if the comparison failed here, 'count' is 3; 'Gulliver' is cbPattern-(4-1)-5
					//3:"test" if the comparison failed here, 'count' is 4; 'Gulliver' is cbPattern-(4-1)-4
					//4:"est " if the comparison failed here, 'count' is 5; 'Gulliver' is cbPattern-(4-1)-3
					//5:"st f" if the comparison failed here, 'count' is 6; 'Gulliver' is cbPattern-(4-1)-2
					//6:"t fo" if the comparison failed here, 'count' is 7; 'Gulliver' is cbPattern-(4-1)-1
					//7:" fox" if the comparison failed here, 'count' is 8; 'Gulliver' is cbPattern-(4-1)
						count = cbPattern-4+1; 
						//count = count-4; // Double-beauty here of already being checked 'ulHashTarget' and not polluting/repeating the final lookup below.
						while ( count > 0 && *(uint32_t *)(pbPattern+count-1) == *(uint32_t *)(&pbTarget[i]+(count-1)) )
							count = count-4; // - order, of course order 4 is much more SWEET&CHEAP - less loops
						if ( count <= 0 )
							return(pbTarget+i);
						// Order 4 ]
						}
*/

	} //if ( mask != 0 ) {
	} //for (i = 0; i <

// Deal with the remainder (starts right after the last chunk) with Scalar code [
pbTarget = pbTarget+ i; // 'i' has to be the traversed pool by the vector
	//if (cbPattern > cbTarget) return(NULL);
	// Above check precedes all Railguns, inhere 'cbTarget' is the HaystackLen-i i.e. the remainder
	if (cbPattern > cbTarget-(i)) return(NULL);

		pbTarget = pbTarget+cbPattern;
		ulHashPattern = *(uint32_t *)(pbPattern);
		SINGLET = ulHashPattern & 0xFF;
		Quadruplet2nd = SINGLET<<8;
		Quadruplet3rd = SINGLET<<16;
		Quadruplet4th = SINGLET<<24;
		for ( ;; ) {
			AdvanceHopperGrass = 0;
			ulHashTarget = *(uint32_t *)(pbTarget-cbPattern);
			if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
				count = cbPattern-1;
				while ( count && *(char *)(pbPattern+(cbPattern-count)) == *(char *)(pbTarget-count) ) {
					if ( cbPattern-1==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-count) ) AdvanceHopperGrass++;
					count--;
				}
				if ( count == 0) return((pbTarget-cbPattern));
			} else { // The goal here: to avoid memory accesses by stressing the registers.
				if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
					AdvanceHopperGrass++;
					if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
						AdvanceHopperGrass++;
						if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;
					}
				}
			}
			AdvanceHopperGrass++;
			pbTarget = pbTarget + AdvanceHopperGrass;
			if (pbTarget > pbTargetMax) return(NULL);
		}
// Deal with the remainder (starts right after the last chunk) with Scalar code ]

		} //if (cbTarget<128) {
	} //if ( cbPattern<4 ) { needle 2..3; SCALAR

return(NULL);
}

If you can't meet my request, then please stop pinging me.

I was pinging out of courtesy, to let you know how ripgrep performs in other benchmarks, you obviously are not pleased, so won't bother you again.

BurntSushi · 2020-10-14T15:21:04Z

but by speed stats I meant human readable (and built-in) as Bytes/second

Yes, the --stats flag.

If you want to speak more with me, you can come to my issue tracker. But I'd like to request that you re-read my invitation more clearly. A simple code dump isn't good enough. Look at the other tools I've benchmarked. Those weren't benchmarked by getting a code dump on github. They were tools with well documented build processes.

Sanmayce · 2020-11-30T06:24:58Z

Finally, did what I wanted - a side by side comparison of GCC 7.3.0 vs ICL v19.0 and latest Intel's Ice Lake vs AMD's Renoir:
https://www.overclock.net/threads/cpu-benchmark-finding-linus-torvalds.1754066/page-2#post-28681743

Yet, still wonder how things can be bettered, thinking about streaming, not only the all-in-ram etude.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question #53

Question #53

Sanmayce commented Sep 15, 2020

lemire commented Sep 15, 2020

Sanmayce commented Sep 15, 2020

Sanmayce commented Sep 16, 2020

Sanmayce commented Oct 6, 2020

Sanmayce commented Oct 11, 2020

Sanmayce commented Oct 14, 2020

BurntSushi commented Oct 14, 2020

Sanmayce commented Oct 14, 2020

BurntSushi commented Oct 14, 2020

Sanmayce commented Nov 30, 2020

Question #53

Question #53

Comments

Sanmayce commented Sep 15, 2020

lemire commented Sep 15, 2020

Sanmayce commented Sep 15, 2020

Sanmayce commented Sep 16, 2020

Sanmayce commented Oct 6, 2020

Sanmayce commented Oct 11, 2020

Sanmayce commented Oct 14, 2020

BurntSushi commented Oct 14, 2020

Sanmayce commented Oct 14, 2020

BurntSushi commented Oct 14, 2020

Sanmayce commented Nov 30, 2020