Added fulmicoton codec 5,677,291 bytes by fulmicoton · Pull Request #17 · agavra/compression-golf

fulmicoton · 2026-02-01T09:00:42Z

Fulmicoton Codecs (5,677,291 bytes / ~97% compression):

This is obviously not a meaningful contest:
Everyone is welcome to steal ideas below, and save a few extra bytes by
adding their own. Please consider quoting this codec though.

I coded all of this using Claude. In fact, I mostly compete in this project
to get an idea of what LLM can do. I am bringing the ideas, and
Claude is doing the implementation.

This codec achieves 5,677,291 bytes / ~97% compression by:

Columnar layout: Events are transposed into separate column (or column families for repos)
(event IDs, event types, timestamps, (repo name + repo id)).
The rows are sorted by event Ids.
Specialized encodings per column:
- Event IDs: Delta encoding + adaptive arithmetic coding.
- Event types: ANS entropy coding (small alphabet)
- Timestamps: The timestamps are almost sorted. There are several
  events by seconds. RLE is quite efficient for this, but I went further:
  I encode the small permutation required to sort the data.
  After that I can "histogram encode" the result.
  I use something that I call VIPCompression a lot.
  I identify the top K most common things in a stream. I compress the stream of n elements by replacing by
  a stream of tokens in 0..K+1 representing those top elements that I compress using ANS + a sentinel element representing "others". Then I represent the stream of not so common element using a different representation.
- Repos are dictionary encoded:
  - the repo indices indices are encoded as follows. VIP Coding (2047) + we encode
    the remaining indices over 3 bytes. The top byte is encoded using ANS. The two remaining are compressed with zstd.
  - Dictionary (we sort the repo id + repo name by repo id and we compress them in different columns)
    - repo IDs: BIC (Binary Interpolative Coding) for sorted sequences
    - repo names: We split owner and suffix from the repo names and encode them separately.
      In both case we use VIP Coding(1023). We then concatenate the remaining owner and suffix \n separated in the
      same blob of text and apply zstd on this.

I think most of the headroom left is in the repository. I suspect it will be difficult to improve event ids/types/timestamp by a lot at this point.

agavra · 2026-02-02T20:37:01Z

Hey @fulmicoton - there's currently an outage with GHA, I'll rekick off the CI pipeline when it's resolved. Thanks for the submission this is exciting!

fulmicoton · 2026-02-02T21:46:28Z

@agavra I shave off an extra 60KB :)

Best version so far 5,677,291 bytes

agavra · 2026-02-02T22:18:33Z

I shave off an extra 60KB :)

🔥 nice, this looks like its leading by a good chunk now!

I identify the top K most common things in a stream. I compress the stream of n elements by replacing by a stream of tokens in 0..K+1 representing those top elements that I compress using ANS + a sentinel element representing "others". Then I represent the stream of not so common element using a different representation.

Very cool, I hadn't heard of this approach before!

fulmicoton · 2026-02-02T22:39:48Z

Me neither... The distribution of the data suggested it would be a good idea.

agavra · 2026-02-02T23:54:55Z

Confirmed in CI/CD:


┌────────────────────────┬────────────────┬────────────┐
│ Codec                  │           Size │ vs Naive   │
├────────────────────────┼────────────────┼────────────┤
│ Naive                  │    210,727,389 │   baseline │
│ fulmicoton             │      5,677,291 │     -97.3% │
└────────────────────────┴────────────────┴────────────┘

Congrats! Will merge and update the leaderboard.

fulmicoton changed the title ~~Added fulmicoton codec~~ Added fulmicoton codec 5,841,977 bytes Feb 1, 2026

fulmicoton closed this Feb 1, 2026

fulmicoton reopened this Feb 1, 2026

fulmicoton changed the title ~~Added fulmicoton codec 5,841,977 bytes~~ Added fulmicoton codec 5,749,789 bytes Feb 1, 2026

fulmicoton changed the title ~~Added fulmicoton codec 5,749,789 bytes~~ Added fulmicoton codec 5,736,674 bytes Feb 2, 2026

fulmicoton-dd force-pushed the fulmicoton branch from d10d48a to 1d20fb2 Compare February 2, 2026 21:42

fulmicoton changed the title ~~Added fulmicoton codec 5,736,674 bytes~~ Added fulmicoton codec 5,677,291 bytes Feb 2, 2026

fulmicoton-dd force-pushed the fulmicoton branch from 1d20fb2 to d3c555d Compare February 2, 2026 21:45

Fulmicoton codec

a7a83e1

Best version so far 5,677,291 bytes

fulmicoton-dd force-pushed the fulmicoton branch from d3c555d to a7a83e1 Compare February 2, 2026 21:57

agavra merged commit 4ba8608 into agavra:main Feb 2, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added fulmicoton codec 5,677,291 bytes#17

Added fulmicoton codec 5,677,291 bytes#17
agavra merged 1 commit intoagavra:mainfrom
fulmicoton:fulmicoton

fulmicoton commented Feb 1, 2026 •

edited

Loading

Uh oh!

agavra commented Feb 2, 2026

Uh oh!

fulmicoton commented Feb 2, 2026

Uh oh!

agavra commented Feb 2, 2026

Uh oh!

fulmicoton commented Feb 2, 2026

Uh oh!

agavra commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

fulmicoton commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fulmicoton Codecs (5,677,291 bytes / ~97% compression):

Uh oh!

agavra commented Feb 2, 2026

Uh oh!

fulmicoton commented Feb 2, 2026

Uh oh!

agavra commented Feb 2, 2026

Uh oh!

fulmicoton commented Feb 2, 2026

Uh oh!

agavra commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

fulmicoton commented Feb 1, 2026 •

edited

Loading