Skip to content

Add jakedgy codec: 6,402,499 bytes#11

Merged
agavra merged 2 commits intoagavra:mainfrom
jakedgy:jakedgy-codec
Jan 30, 2026
Merged

Add jakedgy codec: 6,402,499 bytes#11
agavra merged 2 commits intoagavra:mainfrom
jakedgy:jakedgy-codec

Conversation

@jakedgy
Copy link
Contributor

@jakedgy jakedgy commented Jan 30, 2026

Summary

This codec was developed as an experiment with Claude Code's Ralph Loop - an automated iteration system that explores solution spaces. The goal was to see how far automated exploration could push compression on this problem.

The Ralph Loop iteratively tested various compression techniques, keeping what worked and discarding what didn't. This submission represents the result of that exploration rather than a carefully hand-crafted solution.

Result: 6,402,499 bytes (~6.40 MB) on the 1M event dataset.

Techniques Discovered by the Loop

  • 2-bit category encoding for ID deltas - Common zigzag values (0, 2, 4 representing deltas of 0, +1, +2) encoded inline; others as varint exceptions
  • 2-bit category encoding for timestamp deltas - Common values (0, 1, 2 seconds) encoded inline
  • Columnar layout with per-column zstd level 22 compression
  • Timestamp sorting within row groups for optimal ID delta encoding
  • 140K row group size (empirically tuned)

What Didn't Work

The loop also tried several approaches that made compression worse:

  • RLE for sparse columns (zstd already handles this)
  • Delta encoding for repo indices (no locality in alphabetical order)
  • 3-bit category encoding (overhead > benefit)
  • Sorting by (type, timestamp) (breaks ID delta locality)
  • Skipping zstd for category-encoded columns

Validation

Tested on 5 different GitHub Archive datasets spanning March 2023 to January 2025 to verify the techniques generalize and aren't overfit to the training data.

πŸ€– Generated with Claude Code using the Ralph Loop

jakedgy and others added 2 commits January 30, 2026 12:33
Experimental codec developed using Claude Code's Ralph Loop for
automated iteration. Achieves ~6.40MB on the 1M event dataset.

Key techniques:
- 2-bit category encoding for ID deltas (0/2/4 as common zigzag values)
- 2-bit category encoding for timestamp deltas (0/1/2 as common values)
- Columnar layout with per-column zstd compression
- Timestamp sorting within row groups for optimal ID delta encoding
- 140K row group size

Validated on 5 different GitHub Archive datasets spanning March 2023
to January 2025 - consistently beats previous best by 5.6-10%.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
@agavra
Copy link
Owner

agavra commented Jan 30, 2026

Thanks for the submission! I guess this proves that our AI overlords are pretty darn good 😨 now I'm extra hoping that someone comes in and improves on this with manual human intuition haha!

Just a format issue causing check failure. I'll merge and address when I update the leaderboard.

Confirmed with CI/CD:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Codec                  β”‚           Size β”‚ vs Naive   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Naive                  β”‚    210,727,389 β”‚   baseline β”‚
β”‚ jakedgy                β”‚      6,402,499 β”‚     -97.0% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

@agavra agavra merged commit c70ccc3 into agavra:main Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments