Skip to content

[Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean)#1433

Open
mtybadger wants to merge 3 commits intoopenai:mainfrom
mtybadger:spruce/04-02
Open

[Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean)#1433
mtybadger wants to merge 3 commits intoopenai:mainfrom
mtybadger:spruce/04-02

Conversation

@mtybadger
Copy link
Copy Markdown

@mtybadger mtybadger commented Apr 7, 2026

n.b this is not a competitive record submission, but it was done under record conditions and hopefully will make its way into a leaderboard submission at some point!

val bpb: 1.20667 (3-seed mean, std=0.00365)

Seed Steps Pre-quant BPB Post-quant BPB Sliding BPB Artifact
42 4822 1.10450 1.2280 1.211 15863950
1024 4826 1.10397 1.21940 1.20207 15881168
1337 4866 1.10417 1.22427 1.20694 15859963
Mean 4838 1.10421 1.22389 1.20667 15868360

I've been back for a day or two and have been messing about with VQ/codebook approaches; it seems like the competition is dying down a bit so I thought I'd do a little write-up for the benefit of anyone else interested in this line of work. Putting together a record submission requires a bunch of systems/TTT stuff anyway that I don't want to do. This PR is based on the baseline in #1218 by @clarkkev.

In general, the motivation for trying codebooks is that vector quantization may be able to get us under the int6 limit for MLP/attn weights and get down to 1-3 bits per weight. Codebooks are certainly the most powerful mode of compression, if you know what codes to use, and that's downstream of knowing more about our model's structure than Brotli/LZMA does. Unfortunately I'm not there yet - while I can get down to around ~1.20 bpb in competition conditions with this setup, and I can squeeze another 2 layers in, I can't close the quant gap. I do want to work a little harder on this over the next few weeks, but I'm going to do some systems work elsewhere first because I wanna learn CuTeDSL.

EP8 Lattice Fixed Codebook

I took this from the QuIP# paper, which was one of several, together with AQLM and VPTQ, that I've nabbed. In our environment there's huge upside to having a fixed codebook, since then we don't need to store the codebook and can save 1-2MB. In particular, this codebook is the most dense 8D spherical pack, and so it should be great. I chunk the weights into 8D blocks and then store 16-bit indices, for a total of 2.0bpw, and add another 8-bit scale for 3.0 bpw. Pushing the scale vector lower than 8-bits appears to damage things significantly.

Hadamard Transform

This was the other part of QuIP#, the idea is applying a random sign-flip+rotation to the blocked weights makes them more isotropic and iid Gaussian. Weirdly I didn't find this worked as well as they said it would, and I think that's because the model weights are already pretty isotropic. However, it may confer some small benefit on the order of 0.002 bpb, so it stays.

Hessian-aware Assignment + Scales

This was definitely the best thing I did, I used the GPTQ stuff already in the baseline and repurposed hessians to produce metrics by which to select the codebook index and scales. This was dramatically better than Euclidean distance at maintaining val_bpb which is understandable - raw MSE does not necessarily imply you've captured downstream performance, and this allows us to pick the codebook compression that is least damaging to the weight's role in the loss, similar to GPTQ.

Lightweight Codebook Penalties

Unfortunately, while I would really like to do QAT with this setup, it's painfully slow - the rotation part is relatively fast, but materializing the codebook and doing the assignment above is very time-consuming, and there's the usual problem with VQ where it doesn't have an obvious backwards pass and must use STE or other hacks. Since we're in such a compute-constrained regime, I have to settle for proxies to QAT, and indeed QAT hasn't worked great so far in the other record entries. I might do a non-record submission with super-long step times where I can do codebook quantization in the forward pass soon.

For now, I simply run an approximate version of the codebook every 16 steps, and then have an auxiliary L2 loss that should force weights close to their codebook counterparts, which I turn on at the end of training. I tried some cooler ideas but they worked about as well; again I think doing full QAT would be ideal.

Outlier Paths

One gimme is always to provide a route around quantization for particularly difficult tensors; I had about 700kb left, so I decided to simply allow that to fallback the tensors with the worst reconstruction error to int8. This earned me back a tiny bit of bpb

Reject Bin

Some things I tried that didn't work:

Multiple codebooks sound like an absolutely awesome idea (I love the AQLM paper), but I found them hard to optimize, particularly codebooks intended to store residual corrections. AQLM itself has some really gnarly stuff, since you're solving this joint optimization of multiple discrete objects. They also take up a lot of space. I think doing some kind of hierarchical/residual/additive codebooks thing would be cool, but I need to figure out why this codebook isn't working great first

Shared codebooks: One idea that sounds great is storing one codebook for MLP and one for Attn, but obviously that requires storing 2 codebooks, which wastes space. The sharing worked well, but it worked well enough to justify using a shared codebook between all tensors.

Learning codebooks in general; again, since these are discrete clusterings, we can't really use gradient descent and so commonly people use k-means; this takes a lot of time since it's not really accelerated by GPUs, and doesn't let you optimize codebooks for our downstream goal of compression, which is what we want. Ultimately we have a choice over a) what the entries in the codebook are, b) which index to pick, and we can only optimize one at once.

Entropy-weighted assignment: Tried various gambits to encourage the model to reuse codes when it could; this worked and decreased compressed storage size as expected, but damaged performance more.

Mega-bitcrushed scales: as expected, going down below 2 BPW with this setup produced completely incoherent models, which makes sense, these are not bitnets.

Voronoi auxiliary loss: I had this idea where, if I had some kind of loss that punished being on the boundary between cells of the codebook, it would encourage regularization; it kind of worked, but not as well as the simpler L2 auxiliary loss described above.

KL/distilling quantized version: too slow

Snapping: I'm a huge proponent of doing dumb stuff first, so I tried just snapping the model to the quantized vector locations every few steps while training. This actually worked surprisingly well, better than many of the gigabrain methods I tried.

Conclusion

sad I wasn't able to reduce the quantization gap further; the raw model before quantization with 13 layers is certainly able to be top of the leaderboard, without any TTT, so the challenge is just fitting the codebook structure to the model. As I said, I suspect that better QAT may unlock codebooks at a competitive level.

The idea is this all gives me very fine-grained control over where in the model to spend bytes; the codebook allows you to know BPW of a given tensor, and allocate more or less to embedding, more or less to norms vs directions, etc. I think with a better understanding (or knowledge via more regularization) of the latent space it should be possible to design a codebook for this particular parameter-golf family of models.

As I said, I think the competition is dying down, excited to get to work on some kernels.

taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…ramLite reversal, new directions

Subagent re-verified the 3 still-novel patches (TabHash, GatedAttention, MTP)
against the latest 25 open PRs. Zero hits — they remain uncontested, even
though only MTP shows marginal training-loss benefit at our scale.

EngramLite (Patch 22) verdict SOFT-REVERSED: EL2 cycle-2 = 3.2742, only
+0.0008 above champion. Tied within noise, not falsified.

Spend ~$1.40 / $36 (6% utilization). Pod healthy.

New comp directions worth considering for next research fire: Per-Sample
SLOT (legal variant of suspicious PR openai#1430), Codebook VQ compression
(PR openai#1433), ByteJEPA (PR openai#1443 — non-competitive but novel category).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — [Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean)

BPB: 1.2067 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA d5d10990ab97, file records/track_non_record_16mb/2026-04-07_Codebooks/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 17.86s, dim=512, layers=11, vocab=4096, code=86163 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 17.86s, dim=512, layers=11, vocab=4096, code=86163 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants