Added fulmicoton codec 5,677,291 bytes#17
Conversation
|
Hey @fulmicoton - there's currently an outage with GHA, I'll rekick off the CI pipeline when it's resolved. Thanks for the submission this is exciting! |
d10d48a to
1d20fb2
Compare
1d20fb2 to
d3c555d
Compare
|
@agavra I shave off an extra 60KB :) |
Best version so far 5,677,291 bytes
d3c555d to
a7a83e1
Compare
π₯ nice, this looks like its leading by a good chunk now!
Very cool, I hadn't heard of this approach before! |
|
Me neither... The distribution of the data suggested it would be a good idea. |
|
Confirmed in CI/CD: Congrats! Will merge and update the leaderboard. |
Fulmicoton Codecs (5,677,291 bytes / ~97% compression):
This is obviously not a meaningful contest:
Everyone is welcome to steal ideas below, and save a few extra bytes by
adding their own. Please consider quoting this codec though.
I coded all of this using Claude. In fact, I mostly compete in this project
to get an idea of what LLM can do. I am bringing the ideas, and
Claude is doing the implementation.
This codec achieves 5,677,291 bytes / ~97% compression by:
Columnar layout: Events are transposed into separate column (or column families for repos)
(event IDs, event types, timestamps, (repo name + repo id)).
The rows are sorted by event Ids.
Specialized encodings per column:
Event IDs: Delta encoding + adaptive arithmetic coding.
Event types: ANS entropy coding (small alphabet)
Timestamps: The timestamps are almost sorted. There are several
events by seconds. RLE is quite efficient for this, but I went further:
I encode the small permutation required to sort the data.
After that I can "histogram encode" the result.
I use something that I call VIPCompression a lot.
I identify the top K most common things in a stream. I compress the stream of n elements by replacing by
a stream of tokens in 0..K+1 representing those top elements that I compress using ANS + a sentinel element representing "others". Then I represent the stream of not so common element using a different representation.
Repos are dictionary encoded:
the remaining indices over 3 bytes. The top byte is encoded using ANS. The two remaining are compressed with zstd.
In both case we use VIP Coding(1023). We then concatenate the remaining owner and suffix \n separated in the
same blob of text and apply zstd on this.
I think most of the headroom left is in the repository. I suspect it will be difficult to improve event ids/types/timestamp by a lot at this point.