Optimized shuffle for typesize=12 #649

froody · 2025-02-04T12:15:46Z

Describe the bug
Feature request, I'm happy to contribute some but I don't know if my solutions will be optimal. I compress a lot of data where typesize=12, and when using shuffle this falls back to unshuffle_generic, which is slow. It would be nice if there were 12-byte variants of all the platform-specific shuffle code. It might not be as fast as a power-of-2 typesize, but it's still much faster than generic.

To Reproduce
Decompress any data using shuffle with typesize=12, see that unshuffle_generic dominates the overall time.

Expected behavior
unshuffle for typesize=12 is approximately as fast as typesize=8 or typesize=16

Logs
If applicable, add logs to help explain your problem.

System information:

OS: [e.g. OSX]
Compiler [e.g. gcc, clang]
Version [e.g. 2.0.1]

Additional context
I think it would be nice to support all possible typesizes up to a point, as for most the could be quite a significant speedup compared to the generic implementation.

Here's my attempt at avx512-unshuffle: #648

The text was updated successfully, but these errors were encountered:

FrancescAlted · 2025-02-04T17:42:38Z

I concur that this is a nice goal to do. Thanks for you AVX2-unshuffle for 12-bytes. Other contributions for different type sizes are welcome indeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized shuffle for typesize=12 #649

Optimized shuffle for typesize=12 #649

froody commented Feb 4, 2025

FrancescAlted commented Feb 4, 2025

Optimized shuffle for typesize=12 #649

Optimized shuffle for typesize=12 #649

Comments

froody commented Feb 4, 2025

FrancescAlted commented Feb 4, 2025