Skip to content

Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523#1529

Open
msisovic wants to merge 12 commits intoopenai:mainfrom
msisovic:record-parallel-residuals-cutlass-evt
Open

Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523#1529
msisovic wants to merge 12 commits intoopenai:mainfrom
msisovic:record-parallel-residuals-cutlass-evt

Conversation

@msisovic
Copy link
Copy Markdown
Contributor

@msisovic msisovic commented Apr 11, 2026

Record: Improved Parallel Residuals

val_bpb: 1.07578747 (3-seed mean, std 0.0007) | 2.77887078 nats | ~15.98 MB | 8xH100 SXM, 600s | Legal TTT

This submission starts from PR #1523. Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation.

The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block:

next_lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out
next_lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out

That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into lane0, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixed lane0/x0 path, while MLP reads raw lane1. Final output uses the mean of the two lanes.

In practice, that is pretty much the only modeling change here versus PR #1523, together with moving PARALLEL_RESIDUAL_START from the baseline's 7 to 8. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed the CUTLASS EVT path to recover the full throughput. In this iteration the CUDA/C++ source is inlined into the training script itself and built against a standard /opt/cutlass checkout rather than shipping a separate prebuilt .so.

Results (8xH100 80GB SXM, 600s)

Seed Steps ms/step Post-EMA BPB Legal TTT BPB val_loss (nats) Artifact
1337 4,655 126.13 1.0830 1.0751 2.7770 15,983,095
2024 4,689 125.20 1.0843 1.0765 2.7806 15,987,382
42 4,696 125.04 1.0837 1.0759 2.7790 15,982,563
Mean 4680.00 125.46 1.0837 1.07578747 2.77887078 15984347

Reproducibility

pip install brotli sentencepiece
git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass
cd /opt/cutlass
git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157
cd -
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
for SEED in 1337 2024 42; do
    SEED=$SEED TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 PARALLEL_RESIDUAL_START=8 GPTQ_RESERVE_SECONDS=13 \
    torchrun --standalone --nproc_per_node=8 train_gpt.py
done

@msisovic msisovic changed the title Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats vs PR #1523 Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Apr 11, 2026
@msisovic msisovic changed the title Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Apr 11, 2026
@mikeapedia
Copy link
Copy Markdown

Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs.

@msisovic msisovic changed the title Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 Apr 11, 2026
@msisovic
Copy link
Copy Markdown
Contributor Author

Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs.

Awesome, thanks for noticing! I reran it with 13s reserved, and as expected it didn't noticeably change the score. However, I have noticed that I have on accident ran all three runs with seed 1337, so I have corrected that as well, which was a bit of a hit on the score, but it still clears the bar.

mikeapedia added a commit to mikeapedia/parameter-golf-1 that referenced this pull request Apr 13, 2026
Casefold v2 vocabulary on PR openai#1529 parallel residuals architecture.
Eliminates case-duplicate tokens (21.1% of SP8192 vocab), refills with
BPB-optimized subwords for 10.38% better compression. Byte counting
verified correct on 15.4M FineWeb docs (0 mismatches).
codemath3000 added a commit to codemath3000/parameter-golf that referenced this pull request Apr 13, 2026
…1.0752

Systems-level optimizations (fused Muon, EMA foreach, loader prealloc)
on PR openai#1529's dual-lane parallel residual architecture. Identical ML;
faster step time yields extra training steps. 3-seed mean: 1.0752 BPB
/ 2.7773 nats.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
codemath3000 added a commit to codemath3000/parameter-golf that referenced this pull request Apr 13, 2026
…0639

Systems-level optimizations (fused Muon, EMA foreach, loader prealloc)
on PR openai#1578's casefold tokenizer + PR openai#1529's parallel residuals.
Identical ML; faster step time yields extra training steps. 3-seed mean:
1.0639 BPB / 3.0705 nats.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@msisovic msisovic changed the title Record: ImprovedParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 Apr 13, 2026
@msisovic
Copy link
Copy Markdown
Contributor Author

I've inlined the custom cutlass kernel from the base PR into my train_gpt script to avoid any compliance issues. The core logic remains unchanged, to keep it fair to submissions that came in the meantime, the results are even a bit worse with the new run, probably down to GPU cluster variance.

@simonbissonnette
Copy link
Copy Markdown

I've inlined the custom cutlass kernel from the base PR into my train_gpt script to avoid any compliance issues. The core logic remains unchanged, to keep it fair to submissions that came in the meantime, the results are even a bit worse with the new run, probably down to GPU cluster variance.

I hope it helps to get it approved. Good luck !

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
First lever layered on the new openai#1736 baseline. Hadamard rotation of
weight matrices before GPTQ quantization, hotstarted off spec 008's
pre_gptq.pt FP checkpoint. No retraining.

Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a
openai#1529-adjacent base; expected to compose cleanly with openai#1736 since
the quant stage is orthogonal to CaseOps / attention gates / phased
TTT. Rotation is a post-training transform with three classes
(residual-stream, per-layer attn, per-layer MLP); FP forward pass is
invariant by construction, only quantization error drops.

Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full
retrain. Same hotstart checkpoint reused by future quant
experiments (per-group bit, AR-selfgen calib, AWQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
cocohearts added a commit that referenced this pull request Apr 29, 2026
cocohearts added a commit that referenced this pull request Apr 29, 2026
cocohearts added a commit that referenced this pull request Apr 29, 2026
* Update parameter golf leaderboard with BOS fix

Co-authored-by: Codex <[email protected]>

* Credit PR 1797 in leaderboard update

Co-authored-by: Codex <[email protected]>

* Credit CaseOps and PR 1787 leaderboard rows

Co-authored-by: Codex <[email protected]>

* Apply p-value progression leaderboard cutoff

Co-authored-by: Codex <[email protected]>

* Address leaderboard review comments

Co-authored-by: Codex <[email protected]>

* Clarify BOS fix leaderboard evidence

Co-authored-by: Codex <[email protected]>

* Shorten leaderboard p-value notes

Co-authored-by: Codex <[email protected]>

* Remove non-frontier leaderboard rows

Co-authored-by: Codex <[email protected]>

* Clarify SmearGate BOS fix attribution

Co-authored-by: Codex <[email protected]>

* Exclude #1518 from chronological frontier

Co-authored-by: Codex <[email protected]>

* Use submitted #1855 score

Co-authored-by: Codex <[email protected]>

* Restore #1529 chronological frontier

Co-authored-by: Codex <[email protected]>

* Restore #1529 chronological frontier

Co-authored-by: Codex <[email protected]>

---------

Co-authored-by: Codex <[email protected]>
Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.

Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.

@cocohearts cocohearts dismissed their stale review April 29, 2026 19:20

Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.

@msisovic
Copy link
Copy Markdown
Contributor Author

This submission is accepted on substance, but please do a format-only cleanup before we merge it. The record directory is currently records/track_10min_16mb/2026-04-11_ImrpovedParallelResiduals; please rename it to fix the typo and use a descriptive standard name such as records/track_10min_16mb/2026-04-11_ImprovedParallelResiduals_CUTLASS_EVT_LegalTTT. Please also remove train_gpt_human.py unless it is required for the submitted artifact; the mergeable record package should be the scored train_gpt.py, README.md, submission.json, requirements if needed, and the seed logs.

Thanks for the review, will address this shortly.

hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
* Update parameter golf leaderboard with BOS fix

Co-authored-by: Codex <[email protected]>

* Credit PR 1797 in leaderboard update

Co-authored-by: Codex <[email protected]>

* Credit CaseOps and PR 1787 leaderboard rows

Co-authored-by: Codex <[email protected]>

* Apply p-value progression leaderboard cutoff

Co-authored-by: Codex <[email protected]>

* Address leaderboard review comments

Co-authored-by: Codex <[email protected]>

* Clarify BOS fix leaderboard evidence

Co-authored-by: Codex <[email protected]>

* Shorten leaderboard p-value notes

Co-authored-by: Codex <[email protected]>

* Remove non-frontier leaderboard rows

Co-authored-by: Codex <[email protected]>

* Clarify SmearGate BOS fix attribution

Co-authored-by: Codex <[email protected]>

* Exclude openai#1518 from chronological frontier

Co-authored-by: Codex <[email protected]>

* Use submitted openai#1855 score

Co-authored-by: Codex <[email protected]>

* Restore openai#1529 chronological frontier

Co-authored-by: Codex <[email protected]>

* Restore openai#1529 chronological frontier

Co-authored-by: Codex <[email protected]>

---------

Co-authored-by: Codex <[email protected]>
@msisovic msisovic requested a review from cocohearts April 30, 2026 01:57
@msisovic
Copy link
Copy Markdown
Contributor Author

@cocohearts Comment addressed, should be ready for merge now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants