RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139#1667
Conversation
Removed unused arg from the running command
MarioPaerle
left a comment
There was a problem hiding this comment.
Updated Readme
…Output Gate; PR openai#1670 dexhunter 1.05970 casefold pending; PR openai#1647 SLOT-4 risky; Session 15 https://claude.ai/code/session_01VS9iDJJ7C5Qqpk8AAd1Avv
MarioPaerle
left a comment
There was a problem hiding this comment.
Readme now includes more details on the submizzion code size
…TTT — val_bpb 1.05733 (3-seed mean) Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base. Zero-init gates (identity at init) add 1,056 + 13 parameters total. - Seed 42: val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B - Seed 0: val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B - Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B - 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats - Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar) - Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats Casefold legality pending organizer review at Issue openai#1604. AttnOutGate and SmearGate are pure architectural additions and comply with all Issue openai#1017 conditions (causality, normalized distribution, score-before- update, single pass).
The public PR body for openai#1667 claims a run with , , and , but the shipped default surface leaves the gates OFF and qk_gain at 5.0. This branch bakes the claimed settings into code defaults so the reproduction run actually tests the claimed surface rather than the inert default one. Constraint: Must preserve the rest of the public PR surface exactly; only claimed env settings are baked into defaults. Rejected: Reproduce with env vars only | the current evaluator path does not forward arbitrary env vars to remote jobs Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any public frontier PR used as a base must pass a self-containment/defaults-vs-claim check before being treated as a serious candidate surface Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall
The claimed openai#1667 surface currently keeps , and the live reproduction showed a full mid-run validation at step 4000 inside a 600-second wallclock budget. This lane disables periodic validation by default so the same family can spend those cycles on training instead. Constraint: Must remain a systems-only exploitation of the same claimed surface; no mechanism or scorer changes. Rejected: Leave periodic validation on | wastes wallclock on a non-essential mid-run diagnostic in the competition regime Confidence: medium Scope-risk: narrow Reversibility: clean Directive: In wallclock-capped rounds, periodic validation should never remain on by accident in a serious score lane Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall
MarioPaerle
left a comment
There was a problem hiding this comment.
Added dataset download command on readme and summary
The claimed openai#1667 surface reproduces cleanly enough to show real score signal, but the current lane is still failing at the tail. This branch removes compile from the final quantized eval and the TTT eval path, and skips the TTT compile warmup, so we can distinguish score quality from eval-compile fragility. Constraint: Must preserve the claimed train-time surface and only alter final-eval execution strategy. Rejected: Disable all compile everywhere | would change the train-time systems regime more than necessary Confidence: medium Scope-risk: narrow Reversibility: clean Directive: If this lane succeeds cleanly with similar score, treat eval compile as an optional optimization rather than a required part of the candidate surface Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall
Stacks 4-layer x 4-pass depth recurrence (23 virtual layers) on PR openai#1667's SmearGate + Attention Output Gate + legal TTT base (1.0714 BPB). Changes vs PR openai#1667: - LOOP_END: 5 -> 6 (includes layer 6 in loop) - NUM_LOOPS: 2 -> 3 (4 passes total) - Gate defaults flipped on so reproduction needs no env vars
Less aggressive than 4Lx4Pass variant. 19 virtual layers from loop_end=6 x 3 passes. +12% compute/step vs PR openai#1667 base, expected ~4330 steps in 600s. Motivation: prior 4Lx4Pass (23 virt) landed at 1.07306 - step loss ate capacity gain. This variant keeps wider loop but reduces pass count. Changes vs PR openai#1667: - LOOP_END: 5 -> 6 (includes layer 6 in loop) - NUM_LOOPS: 2 (unchanged) - Gate defaults flipped on
already ships an MLP output gate path behind , but the best reproduced line so far () still leaves it off. This branch enables the gate by default on the same claimed-surface/no-mid-run-validation line to test the cheapest remaining same-family architectural tweak. Constraint: Must stay inside the openai#1667 family and avoid changing TTT, scorer, or packaging semantics. Rejected: Touch the TTT protocol again | current evidence says tail cleanliness, not the training recipe, is the more immediate blocker Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Keep this lane focused on the tiny gate toggle only; do not mix in new systems changes before it is measured cleanly Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall
…i#1667 line The claimed openai#1667 surface reproduces well on our infra, but we still do not know whether SmearGate is helping or just riding along with the attention output gate. This lane keeps the better claimed-surface/no-mid-run-validation stack and turns SmearGate back off so we can measure the attention-output-gate contribution in isolation. Constraint: Must stay in the same family and avoid changing TTT, scorer, or systems path. Rejected: Turn off the attention gate instead | the PR body and earlier signal both suggest the attention output gate is the more central mechanism Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a family ablation, not a new novelty thesis Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall
W72 and W73 showed that adding the MLP gate regresses and that keeping only the attention output gate collapses the score. This branch keeps the W69 control surface but disables the attention output gate so we can measure whether SmearGate itself carries the gain. Constraint: Keep the change to a single mechanism toggle so W69 remains the control Rejected: Hybrid multi-toggle follow-up | would confound attribution after W72/W73 Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a mechanism attribution run, not a tuned candidate surface Tested: python3 -m py_compile train_gpt.py Not-tested: Remote training/eval on Lepton
Our round26 reproductions reach the post-EMA diagnostic score and then die in serialize() because _compressed_code_size() unconditionally shells out to a pyminify CLI that is not present in the worker environment. Falling back to the raw source keeps the code-size estimate conservative while allowing the actual quantization and TTT tail to run. Constraint: Keep model behavior unchanged and only harden the packaging/tail path Rejected: Add a guessed pip dependency for pyminify | CLI/provider mismatch is unclear and slower to validate remotely Confidence: high Scope-risk: narrow Reversibility: clean Directive: Treat this as an operational tail fix for round26 reproductions, not as a model improvement Tested: python3 -m py_compile train_gpt.py Not-tested: Remote serialize/quantize/TTT completion on Lepton
Layout: [3,4,5,6] -> [3,4,5] -> [3,4] (16 virt, 9 looped passes). Matches PR openai#1667 compute exactly but breaks uniform-loop symmetry so LoRA TTT sees distinguishable per-layer gradient paths. ASYMMETRIC_LOOP env toggle added; default ON for this experiment. Gates stay on (SMEAR_GATE=1, GATE_ATTN_OUT=1, QK_GAIN_INIT=5.25).
W75 proved that the code-default openai#1667 surface reaches a real quantized_ttt_lora result, but it lands at 1.1106 rather than the PR's claimed 1.07139. The public PR body explicitly describes score-first TTT as SGD with 0.005 LR and 3 epochs per chunk, while the shipped defaults still use Adam, 1e-4 LR, and one grad step. This commit bakes the claimed TTT settings into the surface so we can test whether that mismatch explains the reproduction gap. Constraint: Keep the model/training surface fixed and change only the TTT defaults Rejected: More architecture ablations first | the dominant unresolved gap is now the public TTT surface mismatch Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Judge W76 only as a claimed-surface reproduction test, not as a tuned new candidate Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton
The PR body, code defaults, and attached train logs disagree. W75 showed the code-default surface reaches a real quantized_ttt_lora result, but it lands far from the claimed score. This branch moves toward the actual attached-log surface by restoring VAL_LOSS_EVERY=4000 and limiting the effective training shard set to 80, which are both explicitly printed in the PR's bundled logs. Constraint: Preserve the W75 tail-fix and the logged TTT defaults while changing only the surface mismatches proven by the attached logs Rejected: Combine with README-claimed SGD TTT settings | that would mix the PR body surface with the attached-log surface and lose attribution Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Use this branch only as an exact-log reproduction probe, not as a tuned candidate line Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton
W78 showed that the raw default surface is nowhere near the claimed score, but openai#1700 differs from openai#1667 because its attached train logs and README do agree on the eval-time mechanism. This branch bakes in the surfaced settings from the PR materials: phased TTT enabled with 3 phases, int7 embeddings, tighter MLP/embed clip sigmas, and an 80-shard training view matching the attached logs. Constraint: Keep the architecture fixed and change only the public surface defaults needed to match the PR's own materials Rejected: Jump straight to new architecture tuning | the unresolved question is still whether openai#1700's claimed public surface is reproducible Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a claimed/log-aligned reproduction lane, not as an original tuning line Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton
…base Stage 1 of cross-stack port: minimal model-level additions on top of PR openai#1700 (Multi-Phase Global SGD + Phased TTT + VarLen + DepthRec, 1.07219 mean) without touching the weight-bank attention path. Changes: - QK_GAIN_INIT default 5.0 -> 5.25 - SmearGate (modded-nanogpt forward-1 token smear) added at model level, inserted between tok_emb and rms_norm in both forward_logits and forward_ttt. New params (smear_gate.weight, smear_lambda) auto passthrough quant via numel<=65536 rule and registered with the scalar AdamW optimizer. AttnOutGate (the larger of the two gates from PR openai#1667) is deferred to Stage 2 since it needs surgery inside the attention/bank forward. If Stage 1 lands <=1.0710 it validates the port + motivates Stage 2.
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.
…base Builds on Stage 1 (SmearGate + QK-Gain 5.25, seed 42 = 1.07219). Adds per-head multiplicative gate inside attention (g = 2*sigmoid(W x[:,:12]) broadcast across head_dim, applied between flash_attn output and out_proj). Zero-init projection so gate starts at ~1.0 — Stage 2 is numerically identical to Stage 1 at step 0. Wired into: - CausalSelfAttention.forward (forward_logits path) - _block_with_lora (sequential TTT path) - _parallel_block_with_lora (parallel TTT path, layers >= parallel_start_layer) Param footprint: 96 floats per layer (8 heads x 12 width), 1152 total across 12 layers. Auto-passthrough via numel <= 65536 quant rule. Routed to scalar AdamW via attn_gate_proj entry in CONTROL_TENSOR_NAME_PATTERNS. Hypothesis: AttnOutGate adds ~0.0010-0.0015 BPB on top of Stage 1. Combined with Stage 1 gain (0.0011 over PR openai#1700), full PR openai#1667 -> PR openai#1700 cross-stack port should reach ~1.0707-1.0710 (seed 42).
Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier to test whether absolute-position bias is bottlenecking the PR openai#1700 TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged relative-position attention as the next architectural axis, and no PR has tried NoPE at frontier. ALiBi was the first choice, but FA3 (Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no alibi_slopes parameter, and FA2 fallback breaks the 600s budget under TTT. NoPE is the cheapest position-axis test under FA3. NOPE env knob (default 1) gates apply_rotary_emb in three attn paths: forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary module is still constructed so warmup calls remain harmless and the diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new params, submission size unchanged.
Per PR openai#1667/openai#1693: a tiny linear (gate_width x num_heads, default 12x8 = 96 weights per layer) projects the first 12 dims of the input into per-head gate values. Scaled to (0, 2) via 2*sigmoid for symmetric pass-through at zero-init. Total: 1056 extra params (8 heads x 12 width x 11 layers) — ~1KB at fp16. Zero-init = identity at start (transparent). Lets each head dynamically suppress noise per-token. Compatible with depth recurrence, parallel residuals, XSA, and GPTQ (gate weights pass through as fp16, numel < 65536).
Forward-1-token residual mixer at embedding lane:
x_t <- x_t + lambda * sigmoid(W * x_t[:12]) * x_{t-1}
The model gets a learnable bias toward bigram features without needing
attention to discover it. Tiny (13 params total: 12-wide linear + scalar lambda).
Zero-init lambda = transparent at start.
BOS-fix prevents cross-document leakage during packed training: gate is
masked to 0 at positions where input_ids == BOS_TOKEN_ID (default 1).
Both smear_gate.weight and smear_lambda match 'smear' pattern -> route to
scalar AdamW, not Muon. Both at GPT-level (not blocks), so explicitly
appended to scalar_params in Optimizers.
cocohearts
left a comment
There was a problem hiding this comment.
Accepted on substance, but please reformat the record directory before merge. The current directory uses 2026_04_16_...; please rename it to the standard YYYY-MM-DD_description form, e.g. records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT. No ML/result change needed.
cocohearts
left a comment
There was a problem hiding this comment.
Accepted on substance, but please reformat the record directory before merge. The current directory uses 2026_04_16_...; please rename it to the standard YYYY-MM-DD_description form, e.g. records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT. No ML/result change needed.
Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.
…feedback - 2026_04_16_SmearGate_Attention_Output_Gate_Score-First_TTT + 2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT No file content changes.
|
@cocohearts done. |
MarioPaerle
left a comment
There was a problem hiding this comment.
renamed folders
…lone openai#1851 Part 1: BOS-fixed SmearGate + per-head attn output gate ported onto PR1493 wd_strong_paired baseline (15+/-6 lines in train_pr1493.py). 5 new env vars: SMEARGATE_{ENABLED,BOS_ID,INIT}, ATTN_GATE_{ENABLED,INIT}. SmearGate is causal previous-token mixing with the BOS document-boundary mask from PR openai#1851: at positions where input_ids == bos_id, the smear contribution is forced to zero so the final token of doc N cannot leak into BOS of doc N+1. Verified by a focused unit test. Per-head attn_gate added inside CausalSelfAttention applied to flash_attn output before XSA. smeargate.smear_gate is a top-level GPT parameter so it gets explicitly appended to Optimizers.scalar_params (not picked up by the blocks-only loop). CONTROL_TENSOR_NAME_PATTERNS extended; 100% optimizer coverage verified. Real-run results (single seed s42, 8xH100): variant pre q q_sw q_ttt d_qttt baseline (wd_strong_paired) 1.08573 1.09874 1.08194 1.07971 -- smear+attn_gate1d (sigmoid) 1.08663 1.09887 1.08220 1.08052 +0.00081 smearonly (gate off) 1.08601 1.09834 1.08170 1.07998 +0.00027 smear_gate2d (additive) killed mid-train (~step 4000, val 1.1051) The 1D per-head sigmoid gate (8 params/layer) is undercapacity vs upstream PR openai#1667's 96 params/layer, and is +0.00090 worse pre-quant -- a real regression in the trained model. SmearGate alone improves q (-0.00040) and q_sw (-0.00024) but disrupts our SGD TTT lift (0.0017 vs 0.0022 baseline); net q_ttt within seed noise. The artifact stays >16 MB (added code costs ~7 KB; still bust like baseline). Conclusion: port is mechanically correct, just doesn't help on PR1493 base without the rest of the top stack (LQER, phased TTT, CaseOps). Part 2: Critical leaderboard analysis. PR openai#1855 and PR openai#1851 are both verified-merged by maintainer cocohearts and listed on README. PR openai#1855 has an OPEN val_docs=10_000 vs canonical 50_000 dispute (jfc43, 2026-04-30, unresolved) that affects the entire CaseOps chain (PRs 1736/1769/1787/ 1851/1855/1868). If ruling lands against, all six fall and PR1493 family returns to the top -- so building on PR1493 is a hedged investment. Real pre/q/q_ttt comparison vs openai#1855 seed 42 log: their pre=1.06396 vs ours 1.08573 (+0.022 BPB gap at the trained-model level), bigger than the total 0.020 gap. The leaderboard wedge is dominated by training-level wins (CaseOps + SparseAttnGate + 9-knob hparam stack), not LQER/phased-TTT. Part 3: Pivot decision. Clone openai#1851's train_gpt.py (152 KB, 3,574 lines) as the new base rather than porting their 2,500+ lines into our 553-line file. openai#1851 picked over openai#1855 because: same q_ttt within noise (1.06128 vs 1.06108), no lrzip system dep, fewer disputes. Layer only our small PR1493 differentiators (paired-head Muon NS, wd_schedule, gptq_all_reduce). CaseOps shards already published at romeerp/parameter-golf-caseops-v1 (80 train + val + val_bytes sidecar + tokenizer); saves 1-2 hr CPU retokenization. Background download in progress at session-end. Plan for next session: reproduce openai#1851 unmodified at s42 (target q_ttt 1.06128 +/- 0.0005); if reproduced, layer paired-head Muon then wd_schedule one-at-a-time; if not reproduced, stop and debug. Files added: pr1493_smeargate_to_top_stack_session.md full session writeup _top_ref/ cached openai#1851 reference files (train_gpt.py, lossless_caps.py, prepare_caseops_data.py, README.md) run_smear_*.sh smear experiment runners run_chain_smear_experiments.sh chain runner run_mom97.sh drafted but superseded logs/smear_*.txt + .stdout full run logs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RECORD: SmearGate + Attention Output Gate + Legal TTT
mean val_bpb = 1.07139 | std = 0.00082 | 15.927 MB
Key Results
Smear Gate
Reintroduced Smear Gate, yet with input dependence in Modded Nano GPT style.
Attention Output Gate (Per-Head Output Modulation)
A lightweight per-head multiplicative gate on the attention output
GATE_ATTN_OUT=1 GATE_ATTN_SRC=proj GATE_WIDTH=12Training Configuration
Installing packages
sp8192 Dataset Download
Run command
Note
Note on code size: train_gpt.py is shipped as raw source to increase readability (125 KB), but _compressed_code_size() reports the theoretical on-disk size of the same source
after pyminify + LZMA + base85 wrapping (~30 KB).
Training completes in ~587s (wallclock-capped), reaching 4836-4843 steps depending on seed. The gate overhead is ~1.5% of step throughput (from ~8,200 tok/s to ~8,080 tok/s at step 1000, widening slightly with layer looping after step ~2141).
Full Architecture Stack
1/sqrt(layer_idx+1)Compliance
This submission satisfies all Track B requirements:
Acknowledgments
Built on the work of the parameter-golf community:
This work was also possible thanks to the support provided by Paradigma ([link](https://paradigma.inc/)) and the use of Flywheel ([link](https://flywheel.paradigma.inc/)): their infrastructure for research
Our Team: me, @CerovazS, @GabrieleCirillo