Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64 by skarakulak · Pull Request #507 · openai/parameter-golf

skarakulak · 2026-03-23T05:46:52Z

Summary

val_bpb: 1.1558 (sliding window, stride=64)
Artifact: 15.1 MB (15,192,709 bytes)
8×H100 SXM, 6,898 steps in 600s (87ms/step)

Techniques

11 transformer layers with gated U-Net skip connections (sigmoid-gated encoder→decoder blending)
Catalytic residuals (PR Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds) #450): learned per-dim gates on attn/MLP outputs, init=1.0
SwiGLU MLP with 3× expansion
Value residual (ResFormer): first-layer V blended into all subsequent layers
LN scale dampening: 1/√(layer_idx+1) on RMSNorm inputs
Decoder LR multiplier (2×) for Muon and Adam
Int5/Int6 mixed quantization + zstd-22 compression
Sliding window eval (stride=64, seq_len=1024)
BigramHash (4096 buckets), partial RoPE (25%), XSA last 4 layers, gated attention
EMA (decay=0.9985), Muon (momentum=0.99, WD=0.04)

Results

Seed	val_loss	val_bpb	Steps	ms/step
1337	1.9516	1.1558	6898	87.0

Pre-quant EMA: 1.1606 → Post-quant int5/6+zstd: 1.1723 → Sliding window: 1.1558

Files

train_gpt.py — self-contained training + eval script
submission.json — structured results
train_seed1337.log — full training log

11 layers with gated U-Net skip connections, catalytic residuals, SwiGLU MLP, value residual, sliding window eval (stride=64). Int5/Int6 mixed quantization + zstd-22. 15.1MB artifact. Co-Authored-By: Claude Opus 4.6 <[email protected]>

MatoTeziTanka · 2026-04-12T14:24:18Z

Community Review — Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64

Compliance flag: Pre-Quant TTT violation

PR #507 — 11L U-Net + Catalytic + SwiGLU + SW64

Author: skarakulak
Head SHA: da436e0
Submitted BPB: 1.1558 (sliding window eval)

Check 1: N-gram Family Bug (target token in hash key)

CLEAN. compute_bigram_hash(tokens) at line 801–809 builds the key as hash(tokens[i], tokens[i-1]) applied to x_batch (input sequence). At position i the key is (current input token, previous input token) — this is conditioning on the tokens already seen, not the target. This is the legal BigramHash pattern identical to PR #1413 reference. No violation.

Check 2: Pre-Quant TTT — multi-epoch AdamW on val_tokens without score-first

VIOLATION. eval_val_ttt_sgd (line 1375) runs 10 epochs of AdamW over the full validation token stream before evaluating. The loop at line 1440 (for epoch in range(args.ttt_sgd_epochs)) trains on all val chunks unconditionally, then calls eval_val_sliding_window on the adapted model. There is no score-first gate — training happens on all chunks including the final scored positions before any score is recorded.

TTT_SGD_ENABLED defaults to "1" (line 199), so this path executes by default.

The submitted BPB (1.1558) comes from final_sliding_window (pre-TTT, plain sliding window eval), not from final_ttt_sgd (1.1922, worse). The author appears to have used the sliding window score for the submission record. However, the code as shipped runs Pre-Quant TTT by default, and the train.log confirms it executed (final_ttt_sgd val_bpb:1.1922). The violation is in the code, not in which metric was cherry-picked for the record — the illegal eval path is live and runs by default.

Check 3: Legal TTT (score-first-per-chunk)

eval_val_ttt_lora implements proper score-first LoRA TTT (scores chunk before training on it). This path is legal. However it is gated behind ttt_enabled AND NOT ttt_sgd_enabled, and since ttt_sgd_enabled=True by default, the LoRA path does not run in the default configuration. Moot for this submission.

eval_val_ttt_sgd runs 10 epochs of AdamW over the full validation token stream before any scoring occurs. TTT_SGD_ENABLED defaults to "1", so this path executes by default. This violates score-first discipline — the model sees all val chunks during training before the adapted model is evaluated.

Verdict: CLOSE — Pre-Quant TTT violation (10-epoch AdamW on val_tokens before scoring).

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author disables TTT_SGD_ENABLED and resubmits with the clean sliding-window score, or restructures to score-first-per-chunk (PR #1413 pattern).

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

bro4all mentioned this pull request Mar 25, 2026

Add non-record 10min/16MB submission: Wavelet-Lite PR549 Parallel Muon (1.1483) #680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64#507

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64#507
skarakulak wants to merge 1 commit intoopenai:mainfrom
skarakulak:submission/pr-11L-unet-catalytic

skarakulak commented Mar 23, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skarakulak commented Mar 23, 2026

Summary

Techniques

Results

Files

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64

PR #507 — 11L U-Net + Catalytic + SwiGLU + SW64

Check 1: N-gram Family Bug (target token in hash key)

Check 2: Pre-Quant TTT — multi-epoch AdamW on val_tokens without score-first

Check 3: Legal TTT (score-first-per-chunk)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants