Add jakedgy codec: 6,402,499 bytes by jakedgy · Pull Request #11 · agavra/compression-golf

jakedgy · 2026-01-30T18:35:46Z

Summary

This codec was developed as an experiment with Claude Code's Ralph Loop - an automated iteration system that explores solution spaces. The goal was to see how far automated exploration could push compression on this problem.

The Ralph Loop iteratively tested various compression techniques, keeping what worked and discarding what didn't. This submission represents the result of that exploration rather than a carefully hand-crafted solution.

Result: 6,402,499 bytes (~6.40 MB) on the 1M event dataset.

Techniques Discovered by the Loop

2-bit category encoding for ID deltas - Common zigzag values (0, 2, 4 representing deltas of 0, +1, +2) encoded inline; others as varint exceptions
2-bit category encoding for timestamp deltas - Common values (0, 1, 2 seconds) encoded inline
Columnar layout with per-column zstd level 22 compression
Timestamp sorting within row groups for optimal ID delta encoding
140K row group size (empirically tuned)

What Didn't Work

The loop also tried several approaches that made compression worse:

RLE for sparse columns (zstd already handles this)
Delta encoding for repo indices (no locality in alphabetical order)
3-bit category encoding (overhead > benefit)
Sorting by (type, timestamp) (breaks ID delta locality)
Skipping zstd for category-encoded columns

Validation

Tested on 5 different GitHub Archive datasets spanning March 2023 to January 2025 to verify the techniques generalize and aren't overfit to the training data.

🤖 Generated with Claude Code using the Ralph Loop

Experimental codec developed using Claude Code's Ralph Loop for automated iteration. Achieves ~6.40MB on the 1M event dataset. Key techniques: - 2-bit category encoding for ID deltas (0/2/4 as common zigzag values) - 2-bit category encoding for timestamp deltas (0/1/2 as common values) - Columnar layout with per-column zstd compression - Timestamp sorting within row groups for optimal ID delta encoding - 140K row group size Validated on 5 different GitHub Archive datasets spanning March 2023 to January 2025 - consistently beats previous best by 5.6-10%. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-Authored-By: Claude Opus 4.5 <[email protected]>

agavra · 2026-01-30T18:39:59Z

Thanks for the submission! I guess this proves that our AI overlords are pretty darn good 😨 now I'm extra hoping that someone comes in and improves on this with manual human intuition haha!

Just a format issue causing check failure. I'll merge and address when I update the leaderboard.

Confirmed with CI/CD:

┌────────────────────────┬────────────────┬────────────┐
│ Codec                  │           Size │ vs Naive   │
├────────────────────────┼────────────────┼────────────┤
│ Naive                  │    210,727,389 │   baseline │
│ jakedgy                │      6,402,499 │     -97.0% │
└────────────────────────┴────────────────┴────────────┘

jakedgy and others added 2 commits January 30, 2026 12:33

Fix formatting

4080a7d

Co-Authored-By: Claude Opus 4.5 <[email protected]>

agavra merged commit c70ccc3 into agavra:main Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jakedgy codec: 6,402,499 bytes#11

Add jakedgy codec: 6,402,499 bytes#11
agavra merged 2 commits intoagavra:mainfrom
jakedgy:jakedgy-codec

jakedgy commented Jan 30, 2026

Uh oh!

agavra commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

jakedgy commented Jan 30, 2026

Summary

Techniques Discovered by the Loop

What Didn't Work

Validation

Uh oh!

agavra commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments