Skip to content

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64#507

Open
skarakulak wants to merge 1 commit intoopenai:mainfrom
skarakulak:submission/pr-11L-unet-catalytic
Open

Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64#507
skarakulak wants to merge 1 commit intoopenai:mainfrom
skarakulak:submission/pr-11L-unet-catalytic

Conversation

@skarakulak
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1558 (sliding window, stride=64)
  • Artifact: 15.1 MB (15,192,709 bytes)
  • 8×H100 SXM, 6,898 steps in 600s (87ms/step)

Techniques

  • 11 transformer layers with gated U-Net skip connections (sigmoid-gated encoder→decoder blending)
  • Catalytic residuals (PR Record: 12L + Catalytic Residuals + BigramHash(10240) + SWA + Late QAT (val_bpb=1.1466, mean 3 seeds) #450): learned per-dim gates on attn/MLP outputs, init=1.0
  • SwiGLU MLP with 3× expansion
  • Value residual (ResFormer): first-layer V blended into all subsequent layers
  • LN scale dampening: 1/√(layer_idx+1) on RMSNorm inputs
  • Decoder LR multiplier (2×) for Muon and Adam
  • Int5/Int6 mixed quantization + zstd-22 compression
  • Sliding window eval (stride=64, seq_len=1024)
  • BigramHash (4096 buckets), partial RoPE (25%), XSA last 4 layers, gated attention
  • EMA (decay=0.9985), Muon (momentum=0.99, WD=0.04)

Results

Seed val_loss val_bpb Steps ms/step
1337 1.9516 1.1558 6898 87.0

Pre-quant EMA: 1.1606 → Post-quant int5/6+zstd: 1.1723 → Sliding window: 1.1558

Files

  • train_gpt.py — self-contained training + eval script
  • submission.json — structured results
  • train_seed1337.log — full training log

11 layers with gated U-Net skip connections, catalytic residuals,
SwiGLU MLP, value residual, sliding window eval (stride=64).
Int5/Int6 mixed quantization + zstd-22. 15.1MB artifact.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 1.1558 BPB — 11L U-Net + Catalytic + SwiGLU + SW64

Compliance flag: Pre-Quant TTT violation


PR #507 — 11L U-Net + Catalytic + SwiGLU + SW64

Author: skarakulak
Head SHA: da436e0
Submitted BPB: 1.1558 (sliding window eval)


Check 1: N-gram Family Bug (target token in hash key)

CLEAN. compute_bigram_hash(tokens) at line 801–809 builds the key as hash(tokens[i], tokens[i-1]) applied to x_batch (input sequence). At position i the key is (current input token, previous input token) — this is conditioning on the tokens already seen, not the target. This is the legal BigramHash pattern identical to PR #1413 reference. No violation.


Check 2: Pre-Quant TTT — multi-epoch AdamW on val_tokens without score-first

VIOLATION. eval_val_ttt_sgd (line 1375) runs 10 epochs of AdamW over the full validation token stream before evaluating. The loop at line 1440 (for epoch in range(args.ttt_sgd_epochs)) trains on all val chunks unconditionally, then calls eval_val_sliding_window on the adapted model. There is no score-first gate — training happens on all chunks including the final scored positions before any score is recorded.

TTT_SGD_ENABLED defaults to "1" (line 199), so this path executes by default.

The submitted BPB (1.1558) comes from final_sliding_window (pre-TTT, plain sliding window eval), not from final_ttt_sgd (1.1922, worse). The author appears to have used the sliding window score for the submission record. However, the code as shipped runs Pre-Quant TTT by default, and the train.log confirms it executed (final_ttt_sgd val_bpb:1.1922). The violation is in the code, not in which metric was cherry-picked for the record — the illegal eval path is live and runs by default.


Check 3: Legal TTT (score-first-per-chunk)

eval_val_ttt_lora implements proper score-first LoRA TTT (scores chunk before training on it). This path is legal. However it is gated behind ttt_enabled AND NOT ttt_sgd_enabled, and since ttt_sgd_enabled=True by default, the LoRA path does not run in the default configuration. Moot for this submission.

eval_val_ttt_sgd runs 10 epochs of AdamW over the full validation token stream before any scoring occurs. TTT_SGD_ENABLED defaults to "1", so this path executes by default. This violates score-first discipline — the model sees all val chunks during training before the adapted model is evaluated.

Verdict: CLOSE — Pre-Quant TTT violation (10-epoch AdamW on val_tokens before scoring).

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE unless the author disables TTT_SGD_ENABLED and resubmits with the clean sliding-window score, or restructures to score-first-per-chunk (PR #1413 pattern).


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants