[Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean)#1433
[Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean)#1433mtybadger wants to merge 3 commits intoopenai:mainfrom
Conversation
…ramLite reversal, new directions Subagent re-verified the 3 still-novel patches (TabHash, GatedAttention, MTP) against the latest 25 open PRs. Zero hits — they remain uncontested, even though only MTP shows marginal training-loss benefit at our scale. EngramLite (Patch 22) verdict SOFT-REVERSED: EL2 cycle-2 = 3.2742, only +0.0008 above champion. Tied within noise, not falsified. Spend ~$1.40 / $36 (6% utilization). Pod healthy. New comp directions worth considering for next research fire: Per-Sample SLOT (legal variant of suspicious PR openai#1430), Codebook VQ compression (PR openai#1433), ByteJEPA (PR openai#1443 — non-competitive but novel category). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Community Review — [Non-record] Codebooks! - val_bpb 1.2067 (3-seed mean)BPB: 1.2067 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 17.86s, dim=512, layers=11, vocab=4096, code=86163 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 17.86s, dim=512, layers=11, vocab=4096, code=86163 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
n.b this is not a competitive record submission, but it was done under record conditions and hopefully will make its way into a leaderboard submission at some point!
val bpb: 1.20667 (3-seed mean, std=0.00365)
I've been back for a day or two and have been messing about with VQ/codebook approaches; it seems like the competition is dying down a bit so I thought I'd do a little write-up for the benefit of anyone else interested in this line of work. Putting together a record submission requires a bunch of systems/TTT stuff anyway that I don't want to do. This PR is based on the baseline in #1218 by @clarkkev.
In general, the motivation for trying codebooks is that vector quantization may be able to get us under the int6 limit for MLP/attn weights and get down to 1-3 bits per weight. Codebooks are certainly the most powerful mode of compression, if you know what codes to use, and that's downstream of knowing more about our model's structure than Brotli/LZMA does. Unfortunately I'm not there yet - while I can get down to around ~1.20 bpb in competition conditions with this setup, and I can squeeze another 2 layers in, I can't close the quant gap. I do want to work a little harder on this over the next few weeks, but I'm going to do some systems work elsewhere first because I wanna learn CuTeDSL.
EP8 Lattice Fixed Codebook
I took this from the QuIP# paper, which was one of several, together with AQLM and VPTQ, that I've nabbed. In our environment there's huge upside to having a fixed codebook, since then we don't need to store the codebook and can save 1-2MB. In particular, this codebook is the most dense 8D spherical pack, and so it should be great. I chunk the weights into 8D blocks and then store 16-bit indices, for a total of 2.0bpw, and add another 8-bit scale for 3.0 bpw. Pushing the scale vector lower than 8-bits appears to damage things significantly.
Hadamard Transform
This was the other part of QuIP#, the idea is applying a random sign-flip+rotation to the blocked weights makes them more isotropic and iid Gaussian. Weirdly I didn't find this worked as well as they said it would, and I think that's because the model weights are already pretty isotropic. However, it may confer some small benefit on the order of 0.002 bpb, so it stays.
Hessian-aware Assignment + Scales
This was definitely the best thing I did, I used the GPTQ stuff already in the baseline and repurposed hessians to produce metrics by which to select the codebook index and scales. This was dramatically better than Euclidean distance at maintaining val_bpb which is understandable - raw MSE does not necessarily imply you've captured downstream performance, and this allows us to pick the codebook compression that is least damaging to the weight's role in the loss, similar to GPTQ.
Lightweight Codebook Penalties
Unfortunately, while I would really like to do QAT with this setup, it's painfully slow - the rotation part is relatively fast, but materializing the codebook and doing the assignment above is very time-consuming, and there's the usual problem with VQ where it doesn't have an obvious backwards pass and must use STE or other hacks. Since we're in such a compute-constrained regime, I have to settle for proxies to QAT, and indeed QAT hasn't worked great so far in the other record entries. I might do a non-record submission with super-long step times where I can do codebook quantization in the forward pass soon.
For now, I simply run an approximate version of the codebook every 16 steps, and then have an auxiliary L2 loss that should force weights close to their codebook counterparts, which I turn on at the end of training. I tried some cooler ideas but they worked about as well; again I think doing full QAT would be ideal.
Outlier Paths
One gimme is always to provide a route around quantization for particularly difficult tensors; I had about 700kb left, so I decided to simply allow that to fallback the tensors with the worst reconstruction error to int8. This earned me back a tiny bit of bpb
Reject Bin
Some things I tried that didn't work:
Multiple codebooks sound like an absolutely awesome idea (I love the AQLM paper), but I found them hard to optimize, particularly codebooks intended to store residual corrections. AQLM itself has some really gnarly stuff, since you're solving this joint optimization of multiple discrete objects. They also take up a lot of space. I think doing some kind of hierarchical/residual/additive codebooks thing would be cool, but I need to figure out why this codebook isn't working great first
Shared codebooks: One idea that sounds great is storing one codebook for MLP and one for Attn, but obviously that requires storing 2 codebooks, which wastes space. The sharing worked well, but it worked well enough to justify using a shared codebook between all tensors.
Learning codebooks in general; again, since these are discrete clusterings, we can't really use gradient descent and so commonly people use k-means; this takes a lot of time since it's not really accelerated by GPUs, and doesn't let you optimize codebooks for our downstream goal of compression, which is what we want. Ultimately we have a choice over a) what the entries in the codebook are, b) which index to pick, and we can only optimize one at once.
Entropy-weighted assignment: Tried various gambits to encourage the model to reuse codes when it could; this worked and decreased compressed storage size as expected, but damaged performance more.
Mega-bitcrushed scales: as expected, going down below 2 BPW with this setup produced completely incoherent models, which makes sense, these are not bitnets.
Voronoi auxiliary loss: I had this idea where, if I had some kind of loss that punished being on the boundary between cells of the codebook, it would encourage regularization; it kind of worked, but not as well as the simpler L2 auxiliary loss described above.
KL/distilling quantized version: too slow
Snapping: I'm a huge proponent of doing dumb stuff first, so I tried just snapping the model to the quantized vector locations every few steps while training. This actually worked surprisingly well, better than many of the gigabrain methods I tried.
Conclusion
sad I wasn't able to reduce the quantization gap further; the raw model before quantization with 13 layers is certainly able to be top of the leaderboard, without any TTT, so the challenge is just fitting the codebook structure to the model. As I said, I suspect that better QAT may unlock codebooks at a competitive level.
The idea is this all gives me very fine-grained control over where in the model to spend bytes; the codebook allows you to know BPW of a given tensor, and allocate more or less to embedding, more or less to norms vs directions, etc. I think with a better understanding (or knowledge via more regularization) of the latent space it should be possible to design a codebook for this particular parameter-golf family of models.
As I said, I think the competition is dying down, excited to get to work on some kernels.