Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 by msisovic · Pull Request #1529 · openai/parameter-golf

msisovic · 2026-04-11T00:04:56Z

Record: Improved Parallel Residuals

val_bpb: 1.07578747 (3-seed mean, std 0.0007) | 2.77887078 nats | ~15.98 MB | 8xH100 SXM, 600s | Legal TTT

This submission starts from PR #1523. Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation.

The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block:

next_lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out
next_lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out

That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into lane0, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixed lane0/x0 path, while MLP reads raw lane1. Final output uses the mean of the two lanes.

In practice, that is pretty much the only modeling change here versus PR #1523, together with moving PARALLEL_RESIDUAL_START from the baseline's 7 to 8. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed the CUTLASS EVT path to recover the full throughput. In this iteration the CUDA/C++ source is inlined into the training script itself and built against a standard /opt/cutlass checkout rather than shipping a separate prebuilt .so.

Results (8xH100 80GB SXM, 600s)

Seed	Steps	ms/step	Post-EMA BPB	Legal TTT BPB	val_loss (nats)	Artifact
1337	4,655	126.13	1.0830	1.0751	2.7770	15,983,095
2024	4,689	125.20	1.0843	1.0765	2.7806	15,987,382
42	4,696	125.04	1.0837	1.0759	2.7790	15,982,563
Mean	4680.00	125.46	1.0837	1.07578747	2.77887078	15984347

Reproducibility

pip install brotli sentencepiece
git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass
cd /opt/cutlass
git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157
cd -
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
for SEED in 1337 2024 42; do
    SEED=$SEED TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 PARALLEL_RESIDUAL_START=8 GPTQ_RESERVE_SECONDS=13 \
    torchrun --standalone --nproc_per_node=8 train_gpt.py
done

mikeapedia · 2026-04-11T01:48:09Z

Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs.

msisovic · 2026-04-11T10:23:46Z

Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs.

Awesome, thanks for noticing! I reran it with 13s reserved, and as expected it didn't noticeably change the score. However, I have noticed that I have on accident ran all three runs with seed 1337, so I have corrected that as well, which was a bit of a hit on the score, but it still clears the bar.

Casefold v2 vocabulary on PR openai#1529 parallel residuals architecture. Eliminates case-duplicate tokens (21.1% of SP8192 vocab), refills with BPB-optimized subwords for 10.38% better compression. Byte counting verified correct on 15.4M FineWeb docs (0 mismatches).

…1.0752 Systems-level optimizations (fused Muon, EMA foreach, loader prealloc) on PR openai#1529's dual-lane parallel residual architecture. Identical ML; faster step time yields extra training steps. 3-seed mean: 1.0752 BPB / 2.7773 nats. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…0639 Systems-level optimizations (fused Muon, EMA foreach, loader prealloc) on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals. Identical ML; faster step time yields extra training steps. 3-seed mean: 1.0639 BPB / 3.0705 nats. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

msisovic · 2026-04-13T14:28:24Z

I've inlined the custom cutlass kernel from the base PR into my train_gpt script to avoid any compliance issues. The core logic remains unchanged, to keep it fair to submissions that came in the meantime, the results are even a bit worse with the new run, probably down to GPU cluster variance.

simonbissonnette · 2026-04-13T14:30:37Z

I've inlined the custom cutlass kernel from the base PR into my train_gpt script to avoid any compliance issues. The core logic remains unchanged, to keep it fair to submissions that came in the meantime, the results are even a bit worse with the new run, probably down to GPU cluster variance.

I hope it helps to get it approved. Good luck !

First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-authored-by: Codex <[email protected]>

* Update parameter golf leaderboard with BOS fix Co-authored-by: Codex <[email protected]> * Credit PR 1797 in leaderboard update Co-authored-by: Codex <[email protected]> * Credit CaseOps and PR 1787 leaderboard rows Co-authored-by: Codex <[email protected]> * Apply p-value progression leaderboard cutoff Co-authored-by: Codex <[email protected]> * Address leaderboard review comments Co-authored-by: Codex <[email protected]> * Clarify BOS fix leaderboard evidence Co-authored-by: Codex <[email protected]> * Shorten leaderboard p-value notes Co-authored-by: Codex <[email protected]> * Remove non-frontier leaderboard rows Co-authored-by: Codex <[email protected]> * Clarify SmearGate BOS fix attribution Co-authored-by: Codex <[email protected]> * Exclude #1518 from chronological frontier Co-authored-by: Codex <[email protected]> * Use submitted #1855 score Co-authored-by: Codex <[email protected]> * Restore #1529 chronological frontier Co-authored-by: Codex <[email protected]> * Restore #1529 chronological frontier Co-authored-by: Codex <[email protected]> --------- Co-authored-by: Codex <[email protected]>

cocohearts

This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.

cocohearts

This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.

Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.

msisovic · 2026-04-29T20:38:15Z

This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.

Thanks for the review, will address this shortly.

* Update parameter golf leaderboard with BOS fix Co-authored-by: Codex <[email protected]> * Credit PR 1797 in leaderboard update Co-authored-by: Codex <[email protected]> * Credit CaseOps and PR 1787 leaderboard rows Co-authored-by: Codex <[email protected]> * Apply p-value progression leaderboard cutoff Co-authored-by: Codex <[email protected]> * Address leaderboard review comments Co-authored-by: Codex <[email protected]> * Clarify BOS fix leaderboard evidence Co-authored-by: Codex <[email protected]> * Shorten leaderboard p-value notes Co-authored-by: Codex <[email protected]> * Remove non-frontier leaderboard rows Co-authored-by: Codex <[email protected]> * Clarify SmearGate BOS fix attribution Co-authored-by: Codex <[email protected]> * Exclude openai#1518 from chronological frontier Co-authored-by: Codex <[email protected]> * Use submitted openai#1855 score Co-authored-by: Codex <[email protected]> * Restore openai#1529 chronological frontier Co-authored-by: Codex <[email protected]> * Restore openai#1529 chronological frontier Co-authored-by: Codex <[email protected]> --------- Co-authored-by: Codex <[email protected]>

msisovic · 2026-04-30T01:57:43Z

@cocohearts Comment addressed, should be ready for merge now.

msisovic added 2 commits April 11, 2026 02:02

Add parallel residual CUTLASS EVT record

1ac984f

Rename record title to Parallel Residuals

5f7f3fc

msisovic changed the title ~~Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats vs PR #1523~~ Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Apr 11, 2026

msisovic changed the title ~~Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523~~ Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Apr 11, 2026

msisovic added 2 commits April 11, 2026 02:07

title change

652eedb

Rename and clean improved parallel residuals record

5baba3f

Update improved parallel residuals record metrics

f93838b

msisovic changed the title ~~Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523~~ Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 Apr 11, 2026

Update improved parallel residuals seed logs

07a67d4

bigbag mentioned this pull request Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541

Open

msisovic added 2 commits April 11, 2026 16:23

Clarify CUTLASS EVT verification path in record README

a1d3303

Clarify CUTLASS artifact accounting

423eb74

samacqua mentioned this pull request Apr 11, 2026

Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530

Merged

mikeapedia mentioned this pull request Apr 13, 2026

Record: Custom Casefold Tokenizer — 1.0668 BPB #1578

Open

5 tasks

msisovic added 2 commits April 13, 2026 12:06

Inlined custom kernel into train_gpt

bf6d081

Update parallel residuals record logs

b98dae2

msisovic changed the title ~~Record: ImprovedParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523~~ Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 Apr 13, 2026

mikeapedia mentioned this pull request Apr 15, 2026

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence #1648

Open

6 tasks

X-Abhishek-X mentioned this pull request Apr 17, 2026

[Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695

Open

msisovic mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard #1900

Open

codemath3000 mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

cocohearts added a commit that referenced this pull request Apr 29, 2026

Restore #1529 chronological frontier

53ee400

Co-authored-by: Codex <[email protected]>

cocohearts added a commit that referenced this pull request Apr 29, 2026

Restore #1529 chronological frontier

ff13842

Co-authored-by: Codex <[email protected]>

cocohearts requested changes Apr 29, 2026

View reviewed changes

cocohearts previously requested changes Apr 29, 2026

View reviewed changes

Clean up improved parallel residuals record

26acc05

msisovic requested a review from cocohearts April 30, 2026 01:57

MaxIv25 mentioned this pull request May 1, 2026

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis #2102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523#1529

Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523#1529
msisovic wants to merge 12 commits intoopenai:mainfrom
msisovic:record-parallel-residuals-cutlass-evt

msisovic commented Apr 11, 2026 •

edited

Loading

Uh oh!

mikeapedia commented Apr 11, 2026

Uh oh!

msisovic commented Apr 11, 2026

Uh oh!

msisovic commented Apr 13, 2026

Uh oh!

simonbissonnette commented Apr 13, 2026

Uh oh!

cocohearts left a comment

Uh oh!

cocohearts left a comment

Uh oh!

msisovic commented Apr 29, 2026

Uh oh!

msisovic commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

msisovic commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Improved Parallel Residuals

Results (8xH100 80GB SXM, 600s)

Reproducibility

Uh oh!

mikeapedia commented Apr 11, 2026

Uh oh!

msisovic commented Apr 11, 2026

Uh oh!

msisovic commented Apr 13, 2026

Uh oh!

simonbissonnette commented Apr 13, 2026

Uh oh!

cocohearts left a comment

Choose a reason for hiding this comment

Uh oh!

cocohearts left a comment

Choose a reason for hiding this comment

Uh oh!

msisovic commented Apr 29, 2026

Uh oh!

msisovic commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

msisovic commented Apr 11, 2026 •

edited

Loading