Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)#1453
Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)#1453iverbovoy wants to merge 3 commits intoopenai:mainfrom
Conversation
…eed mean) 3 shared blocks × 4 repeats (12 effective layers), MLP 3× (d=880), int7 attention (63 levels) + int5 MLP (16 levels) mixed quantization, 8-GPU parallel Hedge Mixer eval (164s). Key finding: int7 is the sweet spot for attention quantization — recovers 98% of int8 hedge quality while saving 2MB for a wider model. Improves on PR openai#1384 (1.1441) by −0.012 bpb.
Community Review — Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean)Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) Summary PR #1453 implements a Progressive Depth Recurrence model with an Int7 mixed-quantization scheme and a HedgeMixer online ensemble at eval time. The submission is clean. ## Key Checks ### N-gram / Hash Bug (ILLEGAL pattern: target XOR'd into hash key) NOT PRESENT. The trigram hash key is computed at line 40 (update) and line 62 (scoring) as:
|
32-day journey: architecture, experiments catalog (what worked / did not), GPTQ with Hessian error compensation results (3-seed validated), hedge-variance finding, reproduction config.
Research update & summary — 32-day explorationWanted to share a retrospective of the work around this PR, in case useful for anyone exploring parameter-constrained recurrent architectures or for challenge post-mortems. Submission holds at 3-seed mean val_bpb 1.1324 (seeds 1337/42/7, sliding 1.1834, roundtrip 1.2168, 15.40 MB). Architecture recap (shared-block recurrence 3×4)3 shared transformer blocks × 4 repeats = 12 effective layers, d=880, MLP 3×, 23.7M params.
Evolution across our PRs
GPTQ with Hessian error compensation (new, 3-seed validated)Added column-wise GPTQ (Frantar et al.) on top of this PR's config:
Deterministic metrics (sliding, roundtrip) improve by −0.002 consistently. Hedge 3-seed mean is +0.010 worse — seed 7 hedge came in at 1.1427 versus the lucky 1.1193 in the original PR #1453 run (see next section on hedge variance). Not replacing the submission since scoring is on hedge mean. Hedge Mixer variance — a finding worth flaggingSide observation that may be useful to others using Hedge-based eval: Running a
Same model weights, same data, same code — hedge drifts ±0.008 between sessions and ±0.013 between seeds. The bf16 forward numerics + online Experiments that did NOT improve 3-seed hedge mean
Companion PR (still open)#895 — same architecture, 4-hour non-record track, val_bpb 1.0889. As far as I can see, the only depth-recurrence entry in the 4-hour non-record section. Bonus: live run-monitoring dashboardWhile iterating I built a small Full summary in forkDetailed journey + reproduction config + full experiments table: Happy to hear feedback on the submission, the hedge-variance observation, or whether the non-record format here is still the right fit. Thanks for running the challenge — the depth-recurrence angle turned out to be a genuinely interesting direction even if it didn't match the flat-layer SOTA. |
3 shared blocks with progressive depth (2->3->4->5 repeats, 15 effective layers), 132K steps on 8xH100, 38 SWA checkpoints, Hedge Mixer eval. Architecture is the same recurrent design as 10-min submission openai#1453 (val_bpb 1.1324). This PR is the 4-hour companion exploring how shared-weight recurrence scales with extended compute. Beats existing non-record 4-hour entries: - Will DePue 4-hour flat baseline (1.2074): -0.119 better - Ciprian-Florin Ifrim 2-hour 1-bit (1.1239): -0.035 better
Summary
Key Finding
Int7 (63 quantization levels) for attention is the sweet spot between int6 (31) and int8 (127). It recovers 98% of int8's hedge mixer quality while saving ~2MB — enough to widen the model from d=832 MLP 2× to d=880 MLP 3×.
Evolution
Test plan