Skip to content

Added fulmicoton codec 5,677,291 bytes#17

Merged
agavra merged 1 commit intoagavra:mainfrom
fulmicoton:fulmicoton
Feb 2, 2026
Merged

Added fulmicoton codec 5,677,291 bytes#17
agavra merged 1 commit intoagavra:mainfrom
fulmicoton:fulmicoton

Conversation

@fulmicoton
Copy link
Contributor

@fulmicoton fulmicoton commented Feb 1, 2026

Fulmicoton Codecs (5,677,291 bytes / ~97% compression):

This is obviously not a meaningful contest:
Everyone is welcome to steal ideas below, and save a few extra bytes by
adding their own. Please consider quoting this codec though.

I coded all of this using Claude. In fact, I mostly compete in this project
to get an idea of what LLM can do. I am bringing the ideas, and
Claude is doing the implementation.

This codec achieves 5,677,291 bytes / ~97% compression by:

  1. Columnar layout: Events are transposed into separate column (or column families for repos)
    (event IDs, event types, timestamps, (repo name + repo id)).
    The rows are sorted by event Ids.

  2. Specialized encodings per column:

    • Event IDs: Delta encoding + adaptive arithmetic coding.

    • Event types: ANS entropy coding (small alphabet)

    • Timestamps: The timestamps are almost sorted. There are several
      events by seconds. RLE is quite efficient for this, but I went further:
      I encode the small permutation required to sort the data.
      After that I can "histogram encode" the result.
      I use something that I call VIPCompression a lot.
      I identify the top K most common things in a stream. I compress the stream of n elements by replacing by
      a stream of tokens in 0..K+1 representing those top elements that I compress using ANS + a sentinel element representing "others". Then I represent the stream of not so common element using a different representation.

    • Repos are dictionary encoded:

      • the repo indices indices are encoded as follows. VIP Coding (2047) + we encode
        the remaining indices over 3 bytes. The top byte is encoded using ANS. The two remaining are compressed with zstd.
      • Dictionary (we sort the repo id + repo name by repo id and we compress them in different columns)
        • repo IDs: BIC (Binary Interpolative Coding) for sorted sequences
        • repo names: We split owner and suffix from the repo names and encode them separately.
          In both case we use VIP Coding(1023). We then concatenate the remaining owner and suffix \n separated in the
          same blob of text and apply zstd on this.

I think most of the headroom left is in the repository. I suspect it will be difficult to improve event ids/types/timestamp by a lot at this point.

@fulmicoton fulmicoton changed the title Added fulmicoton codec Added fulmicoton codec 5,841,977 bytes Feb 1, 2026
@fulmicoton fulmicoton closed this Feb 1, 2026
@fulmicoton fulmicoton reopened this Feb 1, 2026
@fulmicoton fulmicoton changed the title Added fulmicoton codec 5,841,977 bytes Added fulmicoton codec 5,749,789 bytes Feb 1, 2026
@fulmicoton fulmicoton changed the title Added fulmicoton codec 5,749,789 bytes Added fulmicoton codec 5,736,674 bytes Feb 2, 2026
@agavra
Copy link
Owner

agavra commented Feb 2, 2026

Hey @fulmicoton - there's currently an outage with GHA, I'll rekick off the CI pipeline when it's resolved. Thanks for the submission this is exciting!

@fulmicoton fulmicoton changed the title Added fulmicoton codec 5,736,674 bytes Added fulmicoton codec 5,677,291 bytes Feb 2, 2026
@fulmicoton
Copy link
Contributor Author

@agavra I shave off an extra 60KB :)

Best version so far
5,677,291 bytes
@agavra
Copy link
Owner

agavra commented Feb 2, 2026

I shave off an extra 60KB :)

πŸ”₯ nice, this looks like its leading by a good chunk now!

I identify the top K most common things in a stream. I compress the stream of n elements by replacing by a stream of tokens in 0..K+1 representing those top elements that I compress using ANS + a sentinel element representing "others". Then I represent the stream of not so common element using a different representation.

Very cool, I hadn't heard of this approach before!

@fulmicoton
Copy link
Contributor Author

Me neither... The distribution of the data suggested it would be a good idea.

@agavra
Copy link
Owner

agavra commented Feb 2, 2026

Confirmed in CI/CD:


β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Codec                  β”‚           Size β”‚ vs Naive   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Naive                  β”‚    210,727,389 β”‚   baseline β”‚
β”‚ fulmicoton             β”‚      5,677,291 β”‚     -97.3% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Congrats! Will merge and update the leaderboard.

@agavra agavra merged commit 4ba8608 into agavra:main Feb 2, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments