Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA by msisovic · Pull Request #1204 · openai/parameter-golf

msisovic · 2026-04-01T00:46:06Z

Record: Parallel Residuals + Mini Depth Recurrence

val_bpb: 1.1063 (3-seed mean, std 0.0017) | 1.8679 nats | ~15.94 MB | 8×H100 SXM, 600s | No TTT

I started this submission from PR #1179, which gave me the base training stack I wanted to iterate on here. On top of that, I ported over the mixed-quantization and autoregressive GPTQ path from PR #1105. That was partly a modeling choice and partly a practical one: AR self-generated GPTQ calibration was already a known acceptable path for this challenge, and it let me avoid having the quantization step depend on last-minute training-data access in a way that makes the 10-minute budget awkward to manage.

Results (8×H100 80GB SXM, 600s, no TTT)

Seed	Steps	ms/step	Post-EMA BPB	Sliding BPB	val_loss (nats)	Artifact
1337	6,242	96.1	1.1232	1.1066	1.8684	15,942,395
42	6,248	96.0	1.1235	1.1077	1.8704	15,919,617
2024	6,240	96.2	1.1216	1.1044	1.8648	15,946,657
Mean	6,243	96.1	1.1228	1.1063	1.8679	15,936,223

Comparison baseline PR #1179: 1.11053346 BPB (1.87508426 nats).
This run's exact 3-seed mean: 1.10625353 BPB (1.86785780 nats).
Delta vs PR #1179: -0.00722646 nats (-0.00427993 BPB).

Current merged SOTA (2026-03-25 AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112): 1.11473509 BPB (1.88217853 nats).
Delta vs current merged SOTA: -0.01432073 nats (-0.00848156 BPB).

Parallel residuals

I took this idea from my modded-nanogpt record in KellerJordan/modded-nanogpt PR #230 and adapted it to this codebase.

Chronologically, this change actually came last. I am putting it first here because it ended up being the single biggest gain on top of the base + mini-depth-recurrence stack: relative to the under-budget mini-DR baseline (1.8705 val loss / 1.1078 BPB in sliding-window eval), it improved things by roughly another 0.0037 nats and 0.0022 BPB, landing around 1.8668 / 1.1056. But this is still a one-sample observation, so I do not want to overstate the precision of that delta.

Starting from layer 7, attention and MLP read from different residual lanes, and each sublayer learns how strongly to write back into both lanes.

One interesting pattern is that the learned routing is quite asymmetric, which is also what I saw in the modded-nanogpt run: MLP barely writes back into attention's residual stream, especially in the deeper partitioned layers.

Virtual layer	Physical layer	`attn_to_attn`	`attn_to_mlp`	`mlp_to_attn`	`mlp_to_mlp`
9	7	1.3030	0.8484	0.3851	1.3043
10	8	2.0972	0.8114	0.0557	1.7884
11	9	0.4523	0.9251	0.0098	0.2692
12	10	1.0153	-0.0160	0.0844	0.0844

Despite that pattern, I also tried the followup optimization from modded-nanogpt PR #241, where MLP simply does not write to the attention lane at all in order to get a speedup. In this repo that brought a slight regression, so I kept the original parallel-residual formulation instead.

Mini Depth Recurrence

Note: Most of the recurrence sweeps under this section were run on an older baseline, and I later transferred the final recipe over to the newer baseline used for this submission.

After some early failed attempts at full recurrence, I backed off to a much smaller version of the idea: instead of recurring the whole stack, I only repeated a couple of middle layers. I had already convinced myself from over-budget probes that extra depth was real, so the question became how much of that gain I could recover with minimal weight sharing.

The main sweeps were simple but informative. Repeating one layer helped, repeating two consecutive layers helped more, and repeating three was already losing to the step-time penalty. I also swept the position of the repeated pair and found a clear sweet spot at layers 4,5, right around the U-Net hinge point. So the useful regime here was not “add recurrence everywhere”, it was “reuse a very small part of the middle of the stack.”

The next improvement was to turn recurrence on only mid training. Since repeated layers slow every step down, I trained the cheaper non-recurrent model first and only activated recurrence later. In the earlier sweep, always-on recurrence reached about 1.1163 BPB post-TTT, while delayed recurrence improved that to about 1.1153, with RECUR_START_STEP=3000 working well.

Finally, because mixed precision left me some parameter budget headroom, I found that the best place to spend it was untying the repeated MLPs while leaving the rest of the recurrent block shared. That gave another small but real improvement. Roughly speaking, mini depth recurrence was worth about 0.003-0.004 nats and 0.002-0.003 BPB over the best under-budget non-recurrent depth probe I had at the time.

Reproducibility

The main training runs for this submission used the following command:

SEED=$SEED POST_GPTQ_EVAL_ONLY=0 BIGRAM_DIM=112 MIXED_QUANT=1 N_INT6_LAYERS=32 NUM_LAYERS=11 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 REPEAT_UNTIE_MLP=full REPEAT_UNTIE_MLP_LAYERS=4,5 DISABLE_LAYER0_ATTN=1 PARALLEL_RESIDUAL=1 PARALLEL_START_LAYER=7 torchrun --standalone --nproc_per_node=8 train_gpt.py

brotli also needs to be installed for the final artifact path. It is included in the copied requirements.txt.

…b 1.1105 (3-seed mean)

Architectural innovations from PR openai#1204 (1.1063 BPB record): - QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB) - Parallel Residuals: dual-lane from physical layer 7+ - Attn reads lane0, MLP reads lane1, learned cross-lane writes - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2] - Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder - Delayed activation at step 3000 (avoids disrupting early training) - Tied MLP weights (no extra params, keeps model within 16MB) - Bigram dim reduced 128->112 for budget headroom - Refactored forward into _run_backbone() for DRY encoder/decoder/parallel

msisovic · 2026-04-01T18:02:58Z

PS: One additional thing worth noting is that in order to save a bit more space for untying MLP in recurrence, apart from mixed quant, I dropped layer 0 attention. This was inspired by looking at the weights with which the layers/blocks contribute back to the residual connection. Layer 0 attention was about an OOM lower than the rest, and dropping it resulted in a very minor degradation, while saving on params.

valerio-oai · 2026-04-02T05:59:40Z

Hi! This looks potentially interesting: could you add train_gpt.py to make this reviewable?

msisovic · 2026-04-02T08:25:17Z

@valerio-oai Hi, thanks, glad that you found it interesting! I don't know how I missed that part, train_gpt.py is now included, and you can verify it's the same as the code prepended to the log files.

Edit: The base I was working from was a minified training script designed to save space, meaning the code doesn't really have space to breathe, multiple lines are merged into one, but I don't think it's too unreadable. LMK if this is a problem.

msisovic · 2026-04-02T08:30:36Z

+gptq:collecting hessians from autoregressive data...
+gptq:collected hessians for 70 layers (AR self-gen)
+gptq:done in 241.8s
+wallclock:post_gptq total_elapsed:858.2s train_budget:600.0s


Note for reviewers: this is a leftover log back from when gptq calibration was done on training data, so I had to double-check everything up until this point fits in the training budget. I have since switched to AR generation of the calibration sequences, so the log is irrelevant.

nestamidavaine · 2026-04-02T09:22:34Z

@msisovic Hi, this is a very nice approach. I came to the similar conclusion that the step time penalty is too high for high pass recurrence, but the capacity increase is there. So I also grow the amount of recurrence steps over time in #1231. I actually found that with the stabilizations I added to reduce error build-up when training a quantized mode, TTT can be quite effective. I added a regularization on the magnitudes of the hidden states of the recurring block. Maybe our approaches can be combined in some way.

Maybe by only adding the regularization you can do 3x or 4x recurrence passes.

@bigbag

Porting the full merged SOTA stack from bigbag/parameter-golf PR openai#1493: - SP8192 tokenizer (replaces SP1024) - 3-layer depth recurrence (L3-5, activate at 0.35 × iter) - Parallel residuals (GPT-J style) on L>=7 - QK-Gain 5.0 (default) / 5.25 (SOTA config) - Score-first TTT: SGD lr=0.005, momentum=0.9, 3 epochs - GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0) - LZMA+b85 code wrapper pattern - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 This is the clean, legal, compliant baseline. All 4 Issue openai#1017 conditions satisfied. Next: validate reproduction on 3 seeds, then add VarLen attention. Source: records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/ from upstream/main, decompressed from the lzma+b85 wrapper. Credits: @bigbag (PR openai#1493), @clarkkev (PR openai#1394), @dexhunter (PR openai#1413), @abaybektursun (PR openai#549), @Robby955 (PR openai#1412), @msisovic (PR openai#1204) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Robby955

Change Block.forward() so attention and MLP read the same pre-residual input and sum into one residual update (GPT-J style), instead of the sequential form where MLP reads the post-attention state. Before: z1 = x + attn(x); z2 = z1 + mlp(z1) After: z2 = x + attn(x) + mlp(x) In DEQ terms this replaces f(z) = attn(z) + mlp(attn(z) + z) with f(z) = attn(z) + mlp(z). The parallel form has a more isotropic Jacobian (no sequential composition of the two branches) and is typically a tighter contraction for the solver, which is what we want given the baseline's deq_iter_conv_rel degradation over training. RevDEQ reversibility is preserved: the residual update is still a pure linear combination z_next = (1-gg)*z_in + gg*z2, and the fp64-accumulated backward that reverses it is structurally unchanged. CPU forward+backward passes a finite-grad sanity check. Also updates ortho_aux() so the mu_mlp diagnostic reads x (not z1), keeping it aligned with forward(). Reference: records/2026-04-08_SP8192_ParallelResid_ScoreFirstTTT (PR openai#1412 @Robby955), PR openai#1204 @msisovic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ting The previous d4a2208 fix (bundle local modules/) made the submission runnable but introduced a byte-accounting hole: the harness reported code_bytes=104,676 (just train_gpt.py) while the artifact actually shipped 146,844 bytes of code (train_gpt.py + 3,674-byte modules/bitlinear.py + 38,494-byte modules/trigram_side_memory.py). The upstream rule is "All counted code should live in the train_gpt.py script." This commit makes the submission single-file and re-aligns the numbers: - Inline BitLinear + pack_ternary + unpack_ternary into train_gpt.py near line 705 (the original BitLinear import site). All four import sites that previously read `from modules.bitlinear import ...` (lines 502, 575, 705, 2178) now resolve in-module. - Delete modules/trigram_side_memory.py entirely. It was 38,494 bytes of dead code under default TRIGRAM_SIDE_MEMORY=0; all three lazy imports inside train_gpt.py were behind conditional guards that don't fire under this submission's config. The three import sites (lines 1482, 1519, 2000) now raise NotImplementedError so a future reviewer who flips the flag gets a clear error instead of a silent ModuleNotFoundError. - env.sh switches back to the repo-root run convention (no more DATA_PATH=../../../...). DATA_PATH/TOKENIZER_PATH defaults already resolve from repo root, matching every other records-folder submission. - README "Command", "Key metrics", "Files" sections rewritten: code_bytes: 104,676 -> 106,722 (single-file train_gpt.py) payload: 11,969,746 (unchanged; same trained checkpoint) total: 12,074,422 -> 12,076,468 (~12.08 MB) headroom: ~3.92 MB under 16 MB cap - README "Comparison" + submission.json comparison_baseline now explicitly call out the naive records-track baseline (1.2244) and note +0.076 BPB worse despite ~9x compute. Previous wording called 1.1063 "the records baseline" which was wrong (it's a mid-tier PR openai#1204 entry). - submission.json adds payload_bytes alongside code_bytes/bytes_total so the accounting is self-explanatory; notes field acknowledges the train_seed1337.log carries the pre-cleanup numbers from the original three-file run while the shipped artifact is single-file. Trained model checkpoint unchanged — only code-side accounting moved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dexhunter and others added 19 commits March 31, 2026 11:15

Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bp…

926a242

…b 1.1105 (3-seed mean)

fix: clarify PR openai#1019 attribution (not our merged PR)

6772e6a

Copy PR 1179 record trainer to repo root

1a2ba5d

Decompress PR 1179 root trainer wrapper

d95a637

Port mixed quant to PR 1179 root trainer

23e26fb

Port depth recurrence to PR 1179 root trainer

2571aa6

Log full untie depth recurrence result

1486b6a

Prioritize shared params in GPTQ

101fe21

Somewhat working

d9a7d19

Log partitioned residual result

2a2fe32

Train wallclock log

60857e5

logging fix

0ebead6

First run in

e5a01dc

Parallel Residuals readme entry

78cf56a

Update submission README and add seed logs

ad65f02

Update submission reproducibility notes

b7c4931

Add submission metadata for ParallelResiduals run

18e14d3

Clean root for submission branch

7441f45

Restore root files for submission

ae2f7b7

dexhunter mentioned this pull request Apr 1, 2026

Non-record: Custom tokenizer with web-content symbols + pre-tokenized dataset #1210

Open

4 tasks

Add train_gpt.py to submission

e01b7e1

msisovic commented Apr 2, 2026

View reviewed changes

dexhunter mentioned this pull request Apr 2, 2026

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260

Open

5 tasks

aazizyan mentioned this pull request Apr 2, 2026

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks) #855

Open

MatoTeziTanka mentioned this pull request Apr 2, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Apr 11, 2026

Record: SLOT-32 + Partial Depth Recurrence — val_bpb 0.7736 (3-seed mean) #1278

Open

Record: SP1024 + Pre-quant TTT + Parallel Residuals - 1.0736 BPB (beats 1.1147 by 3.66%) #1489

Closed

translatingthename mentioned this pull request Apr 11, 2026

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean) #1550

Open

4 tasks

This was referenced Apr 11, 2026

Record: Asymmetric Two-Lane Parallel Routing + Tap-In V6 + Legal TTT (1.073938) #1518

Open

Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 #1529

Open

dentity007 mentioned this pull request Apr 11, 2026

Non-record: Universal Transformer + Legal Pre-Quant TTT (Training-Slice Variant) #1554

Open

abbudjoe mentioned this pull request Apr 12, 2026

[Non Record] Fractal recurrent primitive hybrid - SP1024 1xH100 #1569

Closed

vlivashkin mentioned this pull request Apr 13, 2026

Record: Split-LR + N-gram Agreement + Full GPTQ — val_bpb 1.1079 (3-seed mean) #1302

Open

7 tasks

inin-zou mentioned this pull request Apr 14, 2026

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB) #1607

Open

3 tasks

dexhunter mentioned this pull request Apr 17, 2026

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean) #1693

Open

7 tasks

himanshudongre mentioned this pull request Apr 18, 2026

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean) #1716

Open

kiyoaki mentioned this pull request Apr 18, 2026

Notable: SP8192 + 3-Layer Recurrence + Parallel Residuals - 5-Seed Quantization Reference and SDClip Ablations #1720

Open

5 tasks

msisovic mentioned this pull request Apr 18, 2026

Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS #1674

Open

Victory963 mentioned this pull request Apr 19, 2026

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean) #1731

Closed

aamodbhatt mentioned this pull request Apr 24, 2026

Record: SP8192 + Polar Express NS + Multi-Phase Global TTT — val_bpb 1.0771 (3-seed mean) #1802

Open

PranavViswanath mentioned this pull request Apr 24, 2026

Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809

Open

This was referenced Apr 27, 2026

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876

Open

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean) #1880

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

G3sparky mentioned this pull request Apr 29, 2026

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

hilbertmeng mentioned this pull request Apr 29, 2026

[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean) #1936

Open

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

G3sparky mentioned this pull request Apr 30, 2026

Flower Brain v3: SmearGate + LoRA-TTT + GPTQ — val_bpb 1.0680 (unlimited compute, 2hr 1xH100) #1896

Open

jamesEmerson112 mentioned this pull request Apr 30, 2026

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed) #1992

Closed

potatonyliu mentioned this pull request Apr 30, 2026

Non-record: First SSM entry — kill-Mamba-2 + Ternary + n=7 (1.30040) #1994

Open

5 tasks

This was referenced Apr 30, 2026

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed) #2005

Open

Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed) #2071

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA#1204

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA#1204
cocohearts merged 20 commits intoopenai:mainfrom
msisovic:hyperconnections_submission

msisovic commented Apr 1, 2026

Uh oh!

msisovic commented Apr 1, 2026

Uh oh!

valerio-oai commented Apr 2, 2026

Uh oh!

msisovic commented Apr 2, 2026 •

edited

Loading

Uh oh!

msisovic Apr 2, 2026

Uh oh!

nestamidavaine commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

msisovic commented Apr 1, 2026

Record: Parallel Residuals + Mini Depth Recurrence

Results (8×H100 80GB SXM, 600s, no TTT)

Parallel residuals

Mini Depth Recurrence

Reproducibility

Uh oh!

msisovic commented Apr 1, 2026

Uh oh!

valerio-oai commented Apr 2, 2026

Uh oh!

msisovic commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msisovic Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

nestamidavaine commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

msisovic commented Apr 2, 2026 •

edited

Loading

nestamidavaine commented Apr 2, 2026 •

edited

Loading