Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean) by iverbovoy · Pull Request #1453 · openai/parameter-golf

iverbovoy · 2026-04-07T21:55:25Z

Summary

val_bpb: 1.1324 (3-seed mean, std 0.0131) | ~15.40 MB | 8×H100 SXM, 600s
3 shared blocks × 4 repeats (12 effective layers) with MLP 3× (d=880)
Int7 attention (63 levels) + Int5 MLP (16 levels) mixed quantization
8-GPU parallel Hedge Mixer eval (164s)
Improves on PR #1384 (1.1441 bpb) by −0.012 bpb

Key Finding

Int7 (63 quantization levels) for attention is the sweet spot between int6 (31) and int8 (127). It recovers 98% of int8's hedge mixer quality while saving ~2MB — enough to widen the model from d=832 MLP 2× to d=880 MLP 3×.

Quant config	Sliding	Hedge	Size
Int8 attn + Int5 MLP (d=896)	1.1760	1.1349	17.4 MB ✗
Int7 attn + Int5 MLP (d=880)	1.1832	1.1324	15.4 MB ✓
Int6 attn + Int5 MLP (d=896)	1.1870	1.1480	15.4 MB ✓

Evolution

PR	Score	What changed
#148	1.2196	Depth recurrence (3×4), cross-repeat skip
#784	1.2065	+ XSA(4), LeakyReLU², GPTQ-lite
#835	1.1980	+ Progressive depth training
#1384	1.1441	+ Hedge Mixer
This	1.1324	+ Int7 mixed quant, MLP 3×, parallel hedge

Test plan

3-seed validation (1337, 42, 7) — mean 1.1324
5-seed variance analysis — mean 1.1361, std 0.0095
Artifact size < 16 MB (max 15.40 MB)
Eval time < 600s (164s with parallel hedge)
Reproducing command in README

…eed mean) 3 shared blocks × 4 repeats (12 effective layers), MLP 3× (d=880), int7 attention (63 levels) + int5 MLP (16 levels) mixed quantization, 8-GPU parallel Hedge Mixer eval (164s). Key finding: int7 is the sweet spot for attention quantization — recovers 98% of int8 hedge quality while saving 2MB for a wider model. Improves on PR openai#1384 (1.1441) by −0.012 bpb.

MatoTeziTanka · 2026-04-12T05:19:45Z

Community Review — Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Summary PR #1453 implements a Progressive Depth Recurrence model with an Int7 mixed-quantization scheme and a HedgeMixer online ensemble at eval time. The submission is clean. ## Key Checks ### N-gram / Hash Bug (ILLEGAL pattern: target XOR'd into hash key) NOT PRESENT. The trigram hash key is computed at line 40 (update) and line 62 (scoring) as: `ctx_hash = ((prev2 * 36313) ^ (x_batch * 27191)) % self.TRI_HASH` This hashes only context tokens `t[i-2]` and `t[i-1]` (or `prev2` and `x_batch`). The target `y_batch` / `t[2:]` is used only as the lookup dimension into `tri_counts[ctx_hash, y_batch]` to retrieve the conditional probability — this is legal and identical to how a standard n-gram table is queried. ### BigramHashEmbedding (line 865-876) Hash formula: `(36313 * t[i] ^ 27191 * t[i-1]) % mod`. Both operands are input context positions (the current token and the prior token). Target never enters the hash key. Legal. ### Pre-Quant TTT (ILLEGAL: multi-epoch AdamW on val_tokens) NOT PRESENT. There are no gradient operations (`.backward()`, `optimizer.step()`) inside any eval function. `eval_val` and `eval_val_sliding` both run fully under `torch.inference_mode()`. ### Score-First TTT / HedgeMixer Online Update The HedgeMixer performs online Hedge algorithm mixing at eval time. In `mix_and_score` (lines 46-78): 1. Line 70-71: `mixed_nll` is computed using current `log_weights` (frozen snapshot for this batch). 2. Line 77: `self.log_weights` is updated after `mixed_nll` is already computed. 3. Lines 406-420 vs 421-424: The scored NLL is accumulated in the first loop; `mixer.update()` (n-gram count update) is called in a...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

32-day journey: architecture, experiments catalog (what worked / did not), GPTQ with Hessian error compensation results (3-seed validated), hedge-variance finding, reproduction config.

iverbovoy · 2026-04-20T09:48:28Z

Research update & summary — 32-day exploration

Wanted to share a retrospective of the work around this PR, in case useful for anyone exploring parameter-constrained recurrent architectures or for challenge post-mortems. Submission holds at 3-seed mean val_bpb 1.1324 (seeds 1337/42/7, sliding 1.1834, roundtrip 1.2168, 15.40 MB).

Architecture recap (shared-block recurrence 3×4)

3 shared transformer blocks × 4 repeats = 12 effective layers, d=880, MLP 3×, 23.7M params.

value_embeds (2 tables, per-effective-layer scales) — removing them regresses by −0.07 bpb, largest single contributor in our stack
cross_repeat_scales — per-block residual from its own output in the previous repeat; makes stateless weight-sharing stateful
loop_embed — per-effective-layer positional signal
Mixed int7 attn / int5 MLP — key compression win (−0.012 from int8 uniform), frees ~2 MB for wider d=880 MLP 3× vs d=832 MLP 2×
Progressive depth 0.30:2, 0.50:3, 1.0:4 — unique to shared-weight recurrence; flat architectures can't ramp depth mid-training
XSA last 4 layers, LeakyReLU(0.5)² MLP, Muon WD=0.04, 44 SWA checkpoints
5-expert Hedge Mixer at eval (neural + unigram + bigram + trigram + entropy), adapted from Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688 (@RoyiRa) and Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745 (@stukenov)

Evolution across our PRs

PR	Score	Key addition	Status
#148	1.2196	Depth recurrence 3×4 + cross-repeat skip (novel stateful recurrence)	closed in favor of later PRs
#784	1.2065	+ XSA(4), LeakyReLU², GPTQ-lite	closed
#835	1.1980	+ Progressive depth 2→3→4	closed
#856	1.1454	+ Hedge Mixer	closed
#1384	1.1441	+ Clean 3-seed validation	closed in favor of #1453
This PR	1.1324	+ Int7/Int5 mixed quant, MLP 3×, parallel hedge	open
#895	1.0889	4-hour non-record companion (5 repeats, 132K steps)	open

GPTQ with Hessian error compensation (new, 3-seed validated)

Added column-wise GPTQ (Frantar et al.) on top of this PR's config: X^T X collected per nn.Linear over 5 training-data calibration batches, Cholesky-based error compensation. ~100 lines added, stays within the 1500-line limit.

Seed	roundtrip Δ	sliding Δ	hedge Δ
1337	−0.0034	−0.0033	+0.008
42	−0.0007	−0.0008	−0.0006
7	−0.0013	−0.0013	+0.023
3-seed mean	−0.0018	−0.0018	+0.010

Deterministic metrics (sliding, roundtrip) improve by −0.002 consistently. Hedge 3-seed mean is +0.010 worse — seed 7 hedge came in at 1.1427 versus the lucky 1.1193 in the original PR #1453 run (see next section on hedge variance). Not replacing the submission since scoring is on hedge mean.

Hedge Mixer variance — a finding worth flagging

Side observation that may be useful to others using Hedge-based eval:

Running a GPTQ_CALIB_BATCHES=0 sanity (behaviorally identical to the #1453 submission code) on a fresh pod gave:

roundtrip matched to 0.0002 bpb ✓
sliding matched to 0.0002 bpb ✓
hedge drifted by +0.008 bpb ✗

Same model weights, same data, same code — hedge drifts ±0.008 between sessions and ±0.013 between seeds. The bf16 forward numerics + online log_weights updates compound the stochasticity. Any sub-0.01 improvement on hedge mean requires many-seed averaging to separate signal from noise; we consistently saw architectural gains get absorbed by hedge variance.

Experiments that did NOT improve 3-seed hedge mean

Technique	Effect	Why we think it failed
BigramHash 2048×112	+0.005	Too few buckets, hash collisions dominate
BigramHash 3072×112	+0.005	Single-seed looked great (−0.003), but 3-seed mean worse: stabilizes hedge on middle seeds but cuts peak on seed 7 (1.1193 → 1.1444)
BigramHash 4096×112	+0.004	Past sweet spot, sparse buckets degrade
Noisy QAT (Geiping-style calibrated uniform noise during warmdown)	+0.011	Int5 MLP noise amplitude too large for our setup; SWA collects pre-QAT checkpoints, averaging dilutes
LoRA rank-2 per-repeat on `attn.proj` / `mlp.proj`	+0.013	Per-repeat signal already saturated by `loop_embed` + `cross_repeat_scales` + `value_scales`; adding weight-level variance on top hurts
XSA on all 12 effective layers	worse	Optimum is last-4; early XSA disrupts residual stream
Inter-repeat RMSNorm	worse	Breaks scaling balance
EMA (τ=0.997)	+22 ms/step	CPU overhead > benefit at our 600s budget
MuonEq-R optimizer	diverged	Incompatible with our Muon config
Partial RoPE + VRL + LN-Scale (combined)	+12 ms/step	Too many interacting changes

Companion PR (still open)

#895 — same architecture, 4-hour non-record track, val_bpb 1.0889. As far as I can see, the only depth-recurrence entry in the 4-hour non-record section.

Bonus: live run-monitoring dashboard

While iterating I built a small fastapi + plotly.js dashboard for watching runs live over SSH — tails the remote log, parses val_bpb / step_avg / peak_mem, renders side-by-side comparisons of two runs. Single-file, no build step. Issue #1455 describes it in case it's useful to others.

Full summary in fork

Detailed journey + reproduction config + full experiments table: SUMMARY.md on the submission branch.

Happy to hear feedback on the submission, the hedge-variance observation, or whether the non-record format here is still the right fit. Thanks for running the challenge — the depth-recurrence angle turned out to be a genuinely interesting direction even if it didn't match the flat-layer SOTA.

3 shared blocks with progressive depth (2->3->4->5 repeats, 15 effective layers), 132K steps on 8xH100, 38 SWA checkpoints, Hedge Mixer eval. Architecture is the same recurrent design as 10-min submission openai#1453 (val_bpb 1.1324). This PR is the 4-hour companion exploring how shared-weight recurrence scales with extended compute. Beats existing non-record 4-hour entries: - Will DePue 4-hour flat baseline (1.2074): -0.119 better - Ciprian-Florin Ifrim 2-hour 1-bit (1.1239): -0.035 better

iverbovoy mentioned this pull request Apr 7, 2026

Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean) #1384

Closed

Add README.md to submission

b4fe755

Add SUMMARY.md — research retrospective

a310af3

32-day journey: architecture, experiments catalog (what worked / did not), GPTQ with Hessian error compensation results (3-seed validated), hedge-variance finding, reproduction config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)#1453

Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)#1453
iverbovoy wants to merge 3 commits intoopenai:mainfrom
iverbovoy:submission/depth-recurrence-int7-mixed-quant

iverbovoy commented Apr 7, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

iverbovoy commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iverbovoy commented Apr 7, 2026

Summary

Key Finding

Evolution

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)

Uh oh!

iverbovoy commented Apr 20, 2026

Research update & summary — 32-day exploration

Architecture recap (shared-block recurrence 3×4)

Evolution across our PRs

GPTQ with Hessian error compensation (new, 3-seed validated)

Hedge Mixer variance — a finding worth flagging

Experiments that did NOT improve 3-seed hedge mean

Companion PR (still open)

Bonus: live run-monitoring dashboard

Full summary in fork

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants