You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Feature request, I'm happy to contribute some but I don't know if my solutions will be optimal. I compress a lot of data where typesize=12, and when using shuffle this falls back to unshuffle_generic, which is slow. It would be nice if there were 12-byte variants of all the platform-specific shuffle code. It might not be as fast as a power-of-2 typesize, but it's still much faster than generic.
To Reproduce
Decompress any data using shuffle with typesize=12, see that unshuffle_generic dominates the overall time.
Expected behavior
unshuffle for typesize=12 is approximately as fast as typesize=8 or typesize=16
Logs
If applicable, add logs to help explain your problem.
System information:
OS: [e.g. OSX]
Compiler [e.g. gcc, clang]
Version [e.g. 2.0.1]
Additional context
I think it would be nice to support all possible typesizes up to a point, as for most the could be quite a significant speedup compared to the generic implementation.
Describe the bug
Feature request, I'm happy to contribute some but I don't know if my solutions will be optimal. I compress a lot of data where typesize=12, and when using shuffle this falls back to
unshuffle_generic
, which is slow. It would be nice if there were 12-byte variants of all the platform-specific shuffle code. It might not be as fast as a power-of-2 typesize, but it's still much faster than generic.To Reproduce
Decompress any data using shuffle with typesize=12, see that
unshuffle_generic
dominates the overall time.Expected behavior
unshuffle for typesize=12 is approximately as fast as typesize=8 or typesize=16
Logs
If applicable, add logs to help explain your problem.
System information:
Additional context
I think it would be nice to support all possible typesizes up to a point, as for most the could be quite a significant speedup compared to the generic implementation.
Here's my attempt at avx512-unshuffle: #648
The text was updated successfully, but these errors were encountered: