Skip to content

RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139#1667

Merged
cocohearts merged 5 commits intoopenai:mainfrom
MarioPaerle:record/2026_04_16_SmearGate_Attention_Output_Gate_Score-First_TTT
Apr 30, 2026
Merged

RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139#1667
cocohearts merged 5 commits intoopenai:mainfrom
MarioPaerle:record/2026_04_16_SmearGate_Attention_Output_Gate_Score-First_TTT

Conversation

@MarioPaerle
Copy link
Copy Markdown
Contributor

@MarioPaerle MarioPaerle commented Apr 16, 2026

RECORD: SmearGate + Attention Output Gate + Legal TTT

mean val_bpb = 1.07139 | std = 0.00082 | 15.927 MB

Key Results

Seed Steps Pre-Quant val_bpb Quant val_bpb TTT val_bpb Artifact Size
42 4843 1.07227 1.08262 1.07221 15.94 MB
1337 4843 1.07074 1.08109 1.07057 15.91 MB
0 4836 1.07151 1.08183 1.07139 15.93 MB
Mean 4840 1.07159 1.08184 1.07139 15.927MB

Smear Gate

Reintroduced Smear Gate, yet with input dependence in Modded Nano GPT style.

Attention Output Gate (Per-Head Output Modulation)

A lightweight per-head multiplicative gate on the attention output

  • Weight-initialized to zero: at init, all heads pass through at scale 1.0
  • Total new parameters: 12 x 8 = 96 weights per layer x 11 layers = 1,056 parameters
  • Activated by GATE_ATTN_OUT=1 GATE_ATTN_SRC=proj GATE_WIDTH=12

Training Configuration

Installing packages

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
  pip install brotli sentencepiece python-minifier numpy

sp8192 Dataset Download

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

Run command

SEED=<SEED> RUN_ID=<RUN_ID> \
  SMEAR_GATE=1 SMEAR_GATE_WIDTH=12 \
  GATE_ATTN_OUT=1 GATE_ATTN_SRC=proj GATE_WIDTH=12 \
  QK_GAIN_INIT=5.25 \
  TTT_ENABLED=1 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Note

Note on code size: train_gpt.py is shipped as raw source to increase readability (125 KB), but _compressed_code_size() reports the theoretical on-disk size of the same source
after pyminify + LZMA + base85 wrapping (~30 KB).

Training completes in ~587s (wallclock-capped), reaching 4836-4843 steps depending on seed. The gate overhead is ~1.5% of step throughput (from ~8,200 tok/s to ~8,080 tok/s at step 1000, widening slightly with layer looping after step ~2141).

Full Architecture Stack

  • 11L x 512d x 8H / 4KV heads (GQA)
  • MLP 4x expansion with LeakyReLU(0.5)^2 activation
  • Partial RoPE (16/64 dims)
  • Layerwise LN scale: 1/sqrt(layer_idx+1)
  • Tied embeddings, logit softcap = 30.0
  • SmearGate (width=12, learned lambda) -- NEW
  • Attention Output Gate (width=12, per-head, all 11 layers) -- NEW
  • Skip gates (sigmoid-gated U-Net connections)
  • 3-layer depth recurrence (layers 3,4,5, activated at frac=0.35)
  • Parallel residuals (layer 7+)
  • QK-Gain 5.25 (per-head, per-layer)
  • MuonEq-R optimizer (WD=0.095, MLR=0.026, EMA=0.9965)
  • GPTQ quantization: int6 matrices (clip=12.85), int7 embeddings (clip=20.0)
  • Brotli-11 compression with byte-shuffle
  • Score-first TTT (SGD, LR=0.005, 3 epochs per chunk)

Compliance

This submission satisfies all Track B requirements:

  1. Causality: Sliding-window TTT evaluation maintains strict token ordering. Each position is scored from its prefix only.
  2. Distribution integrity: Standard softmax over complete vocabulary without post-hoc modifications or logit biasing.
  3. Score-before-update: TTT parameters update exclusively after scoring relevant data chunks (score-first methodology inherited from PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586).
  4. Single evaluation: Each token receives exactly one score without rescoring passes.
  5. Artifact size: All seeds produce artifacts under 16,000,000 bytes (max: 15,936,229 bytes for seed 42).
  6. Training time: All seeds complete within 600s wallclock.
  7. TTT eval time: All seeds complete TTT eval within 600s.

Acknowledgments

Built on the work of the parameter-golf community:

This work was also possible thanks to the support provided by Paradigma ([link](https://paradigma.inc/)) and the use of Flywheel ([link](https://flywheel.paradigma.inc/)): their infrastructure for research

Our Team: me, @CerovazS, @GabrieleCirillo

Copy link
Copy Markdown
Contributor Author

@MarioPaerle MarioPaerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Readme

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 16, 2026
Copy link
Copy Markdown
Contributor Author

@MarioPaerle MarioPaerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readme now includes more details on the submizzion code size

dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 17, 2026
…TTT — val_bpb 1.05733 (3-seed mean)

Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate
on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base.
Zero-init gates (identity at init) add 1,056 + 13 parameters total.

- Seed 42:   val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B
- Seed 0:    val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B
- Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B
- 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats
- Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar)
- Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats

Casefold legality pending organizer review at Issue openai#1604.
AttnOutGate and SmearGate are pure architectural additions and comply with
all Issue openai#1017 conditions (causality, normalized distribution, score-before-
update, single pass).
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
The public PR body for openai#1667 claims a run with , , and , but the shipped default surface leaves the gates OFF and qk_gain at 5.0. This branch bakes the claimed settings into code defaults so the reproduction run actually tests the claimed surface rather than the inert default one.

Constraint: Must preserve the rest of the public PR surface exactly; only claimed env settings are baked into defaults.
Rejected: Reproduce with env vars only | the current evaluator path does not forward arbitrary env vars to remote jobs
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Any public frontier PR used as a base must pass a self-containment/defaults-vs-claim check before being treated as a serious candidate surface
Tested: python3 -m py_compile train_gpt.py evaluate.py
Not-tested: GPU execution on Heimdall
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
The claimed openai#1667 surface currently keeps , and the live reproduction showed a full mid-run validation at step 4000 inside a 600-second wallclock budget. This lane disables periodic validation by default so the same family can spend those cycles on training instead.

Constraint: Must remain a systems-only exploitation of the same claimed surface; no mechanism or scorer changes.
Rejected: Leave periodic validation on | wastes wallclock on a non-essential mid-run diagnostic in the competition regime
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: In wallclock-capped rounds, periodic validation should never remain on by accident in a serious score lane
Tested: python3 -m py_compile train_gpt.py evaluate.py
Not-tested: GPU execution on Heimdall
Copy link
Copy Markdown
Contributor Author

@MarioPaerle MarioPaerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dataset download command on readme and summary

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
The claimed openai#1667 surface reproduces cleanly enough to show real score signal, but the current lane is still failing at the tail. This branch removes compile from the final quantized eval and the TTT eval path, and skips the TTT compile warmup, so we can distinguish score quality from eval-compile fragility.

Constraint: Must preserve the claimed train-time surface and only alter final-eval execution strategy.
Rejected: Disable all compile everywhere | would change the train-time systems regime more than necessary
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: If this lane succeeds cleanly with similar score, treat eval compile as an optional optimization rather than a required part of the candidate surface
Tested: python3 -m py_compile train_gpt.py evaluate.py
Not-tested: GPU execution on Heimdall
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 17, 2026
Stacks 4-layer x 4-pass depth recurrence (23 virtual layers) on PR openai#1667's
SmearGate + Attention Output Gate + legal TTT base (1.0714 BPB).

Changes vs PR openai#1667:
- LOOP_END: 5 -> 6 (includes layer 6 in loop)
- NUM_LOOPS: 2 -> 3 (4 passes total)
- Gate defaults flipped on so reproduction needs no env vars
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 17, 2026
Less aggressive than 4Lx4Pass variant. 19 virtual layers from loop_end=6 x 3 passes.
+12% compute/step vs PR openai#1667 base, expected ~4330 steps in 600s.

Motivation: prior 4Lx4Pass (23 virt) landed at 1.07306 - step loss ate capacity gain.
This variant keeps wider loop but reduces pass count.

Changes vs PR openai#1667:
- LOOP_END: 5 -> 6 (includes layer 6 in loop)
- NUM_LOOPS: 2 (unchanged)
- Gate defaults flipped on
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
 already ships an MLP output gate path behind , but the best reproduced line so far () still leaves it off. This branch enables the gate by default on the same claimed-surface/no-mid-run-validation line to test the cheapest remaining same-family architectural tweak.

Constraint: Must stay inside the openai#1667 family and avoid changing TTT, scorer, or packaging semantics.
Rejected: Touch the TTT protocol again | current evidence says tail cleanliness, not the training recipe, is the more immediate blocker
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Keep this lane focused on the tiny gate toggle only; do not mix in new systems changes before it is measured cleanly
Tested: python3 -m py_compile train_gpt.py evaluate.py
Not-tested: GPU execution on Heimdall
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
…i#1667 line

The claimed openai#1667 surface reproduces well on our infra, but we still do not know whether SmearGate is helping or just riding along with the attention output gate. This lane keeps the better claimed-surface/no-mid-run-validation stack and turns SmearGate back off so we can measure the attention-output-gate contribution in isolation.

Constraint: Must stay in the same family and avoid changing TTT, scorer, or systems path.
Rejected: Turn off the attention gate instead | the PR body and earlier signal both suggest the attention output gate is the more central mechanism
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat this as a family ablation, not a new novelty thesis
Tested: python3 -m py_compile train_gpt.py evaluate.py
Not-tested: GPU execution on Heimdall
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
W72 and W73 showed that adding the MLP gate regresses and that keeping only
the attention output gate collapses the score. This branch keeps the W69
control surface but disables the attention output gate so we can measure
whether SmearGate itself carries the gain.

Constraint: Keep the change to a single mechanism toggle so W69 remains the control
Rejected: Hybrid multi-toggle follow-up | would confound attribution after W72/W73
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat this as a mechanism attribution run, not a tuned candidate surface
Tested: python3 -m py_compile train_gpt.py
Not-tested: Remote training/eval on Lepton
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
Our round26 reproductions reach the post-EMA diagnostic score and then die in
serialize() because _compressed_code_size() unconditionally shells out to a
pyminify CLI that is not present in the worker environment. Falling back to the
raw source keeps the code-size estimate conservative while allowing the actual
quantization and TTT tail to run.

Constraint: Keep model behavior unchanged and only harden the packaging/tail path
Rejected: Add a guessed pip dependency for pyminify | CLI/provider mismatch is unclear and slower to validate remotely
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Treat this as an operational tail fix for round26 reproductions, not as a model improvement
Tested: python3 -m py_compile train_gpt.py
Not-tested: Remote serialize/quantize/TTT completion on Lepton
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 17, 2026
Layout: [3,4,5,6] -> [3,4,5] -> [3,4] (16 virt, 9 looped passes).
Matches PR openai#1667 compute exactly but breaks uniform-loop symmetry
so LoRA TTT sees distinguishable per-layer gradient paths.

ASYMMETRIC_LOOP env toggle added; default ON for this experiment.
Gates stay on (SMEAR_GATE=1, GATE_ATTN_OUT=1, QK_GAIN_INIT=5.25).
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
W75 proved that the code-default openai#1667 surface reaches a real quantized_ttt_lora
result, but it lands at 1.1106 rather than the PR's claimed 1.07139. The public
PR body explicitly describes score-first TTT as SGD with 0.005 LR and 3 epochs per
chunk, while the shipped defaults still use Adam, 1e-4 LR, and one grad step.
This commit bakes the claimed TTT settings into the surface so we can test whether
that mismatch explains the reproduction gap.

Constraint: Keep the model/training surface fixed and change only the TTT defaults
Rejected: More architecture ablations first | the dominant unresolved gap is now the public TTT surface mismatch
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Judge W76 only as a claimed-surface reproduction test, not as a tuned new candidate
Tested: python3 -m py_compile train_gpt.py
Not-tested: Remote train/eval on Lepton
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 17, 2026
The PR body, code defaults, and attached train logs disagree. W75 showed the
code-default surface reaches a real quantized_ttt_lora result, but it lands far
from the claimed score. This branch moves toward the actual attached-log surface
by restoring VAL_LOSS_EVERY=4000 and limiting the effective training shard set
to 80, which are both explicitly printed in the PR's bundled logs.

Constraint: Preserve the W75 tail-fix and the logged TTT defaults while changing only the surface mismatches proven by the attached logs
Rejected: Combine with README-claimed SGD TTT settings | that would mix the PR body surface with the attached-log surface and lose attribution
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Use this branch only as an exact-log reproduction probe, not as a tuned candidate line
Tested: python3 -m py_compile train_gpt.py
Not-tested: Remote train/eval on Lepton
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 18, 2026
W78 showed that the raw default surface is nowhere near the claimed score, but
openai#1700 differs from openai#1667 because its attached train logs and README do agree on
the eval-time mechanism. This branch bakes in the surfaced settings from the PR
materials: phased TTT enabled with 3 phases, int7 embeddings, tighter MLP/embed
clip sigmas, and an 80-shard training view matching the attached logs.

Constraint: Keep the architecture fixed and change only the public surface defaults needed to match the PR's own materials
Rejected: Jump straight to new architecture tuning | the unresolved question is still whether openai#1700's claimed public surface is reproducible
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: Treat this as a claimed/log-aligned reproduction lane, not as an original tuning line
Tested: python3 -m py_compile train_gpt.py
Not-tested: Remote train/eval on Lepton
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 18, 2026
…base

Stage 1 of cross-stack port: minimal model-level additions on top of
PR openai#1700 (Multi-Phase Global SGD + Phased TTT + VarLen + DepthRec,
1.07219 mean) without touching the weight-bank attention path.

Changes:
- QK_GAIN_INIT default 5.0 -> 5.25
- SmearGate (modded-nanogpt forward-1 token smear) added at model
  level, inserted between tok_emb and rms_norm in both forward_logits
  and forward_ttt. New params (smear_gate.weight, smear_lambda) auto
  passthrough quant via numel<=65536 rule and registered with the
  scalar AdamW optimizer.

AttnOutGate (the larger of the two gates from PR openai#1667) is deferred
to Stage 2 since it needs surgery inside the attention/bank forward.

If Stage 1 lands <=1.0710 it validates the port + motivates Stage 2.
mikeapedia added a commit to mikeapedia/parameter-golf-1 that referenced this pull request Apr 19, 2026
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192.
Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without
any test-time adaptation. Single seed 1337; compute-constrained
non-record submission — VM went down before the run log could be pushed
so it is not attached. Metrics were observed during the session.

Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop
injection, Gemma-style global/local attention, Gram Newton-Schulz) +
PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel +
AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept,
@MarioPaerle reintroduction) + new layered local sliding windows
(512 on early/loop layers, 1024 on post-loop layers, split at index 6).

KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased
global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file
for experiments but is disabled by default for this submission.
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 19, 2026
…base

Builds on Stage 1 (SmearGate + QK-Gain 5.25, seed 42 = 1.07219). Adds
per-head multiplicative gate inside attention (g = 2*sigmoid(W x[:,:12])
broadcast across head_dim, applied between flash_attn output and
out_proj). Zero-init projection so gate starts at ~1.0 — Stage 2 is
numerically identical to Stage 1 at step 0.

Wired into:
- CausalSelfAttention.forward (forward_logits path)
- _block_with_lora (sequential TTT path)
- _parallel_block_with_lora (parallel TTT path, layers >= parallel_start_layer)

Param footprint: 96 floats per layer (8 heads x 12 width), 1152 total
across 12 layers. Auto-passthrough via numel <= 65536 quant rule.
Routed to scalar AdamW via attn_gate_proj entry in
CONTROL_TENSOR_NAME_PATTERNS.

Hypothesis: AttnOutGate adds ~0.0010-0.0015 BPB on top of Stage 1.
Combined with Stage 1 gain (0.0011 over PR openai#1700), full PR openai#1667 ->
PR openai#1700 cross-stack port should reach ~1.0707-1.0710 (seed 42).
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 20, 2026
Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier
to test whether absolute-position bias is bottlenecking the PR openai#1700
TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged
relative-position attention as the next architectural axis, and no PR
has tried NoPE at frontier.

ALiBi was the first choice, but FA3
(Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no
alibi_slopes parameter, and FA2 fallback breaks the 600s budget under
TTT. NoPE is the cheapest position-axis test under FA3.

NOPE env knob (default 1) gates apply_rotary_emb in three attn paths:
forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary
module is still constructed so warmup calls remain harmless and the
diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new
params, submission size unchanged.
Meirzhan05 added a commit to Meirzhan05/parameter-golf that referenced this pull request Apr 27, 2026
Per PR openai#1667/openai#1693: a tiny linear (gate_width x num_heads, default 12x8 = 96
weights per layer) projects the first 12 dims of the input into per-head gate
values. Scaled to (0, 2) via 2*sigmoid for symmetric pass-through at zero-init.

Total: 1056 extra params (8 heads x 12 width x 11 layers) — ~1KB at fp16.
Zero-init = identity at start (transparent). Lets each head dynamically
suppress noise per-token. Compatible with depth recurrence, parallel residuals,
XSA, and GPTQ (gate weights pass through as fp16, numel < 65536).
Meirzhan05 added a commit to Meirzhan05/parameter-golf that referenced this pull request Apr 28, 2026
Forward-1-token residual mixer at embedding lane:
  x_t <- x_t + lambda * sigmoid(W * x_t[:12]) * x_{t-1}

The model gets a learnable bias toward bigram features without needing
attention to discover it. Tiny (13 params total: 12-wide linear + scalar lambda).
Zero-init lambda = transparent at start.

BOS-fix prevents cross-document leakage during packed training: gate is
masked to 0 at positions where input_ids == BOS_TOKEN_ID (default 1).

Both smear_gate.weight and smear_lambda match 'smear' pattern -> route to
scalar AdamW, not Muon. Both at GPT-level (not blocks), so explicitly
appended to scalar_params in Optimizers.
Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted on substance, but please reformat the record directory before merge. The current directory uses 2026_04_16_...; please rename it to the standard YYYY-MM-DD_description form, e.g. records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT. No ML/result change needed.

Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted on substance, but please reformat the record directory before merge. The current directory uses 2026_04_16_...; please rename it to the standard YYYY-MM-DD_description form, e.g. records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT. No ML/result change needed.

@cocohearts cocohearts dismissed their stale review April 29, 2026 19:20

Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.

…feedback

- 2026_04_16_SmearGate_Attention_Output_Gate_Score-First_TTT
+ 2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT

No file content changes.
@MarioPaerle
Copy link
Copy Markdown
Contributor Author

@cocohearts done.
Folder is now records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT

Copy link
Copy Markdown
Contributor Author

@MarioPaerle MarioPaerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed folders

@cocohearts cocohearts merged commit e8eeb62 into openai:main Apr 30, 2026
Itssshikhar added a commit to Itssshikhar/parameter-golf that referenced this pull request Apr 30, 2026
…lone openai#1851

Part 1: BOS-fixed SmearGate + per-head attn output gate ported onto PR1493
wd_strong_paired baseline (15+/-6 lines in train_pr1493.py). 5 new env vars:
SMEARGATE_{ENABLED,BOS_ID,INIT}, ATTN_GATE_{ENABLED,INIT}.

SmearGate is causal previous-token mixing with the BOS document-boundary mask
from PR openai#1851: at positions where input_ids == bos_id, the smear contribution
is forced to zero so the final token of doc N cannot leak into BOS of doc N+1.
Verified by a focused unit test. Per-head attn_gate added inside
CausalSelfAttention applied to flash_attn output before XSA.
smeargate.smear_gate is a top-level GPT parameter so it gets explicitly
appended to Optimizers.scalar_params (not picked up by the blocks-only loop).
CONTROL_TENSOR_NAME_PATTERNS extended; 100% optimizer coverage verified.

Real-run results (single seed s42, 8xH100):

  variant                       pre        q          q_sw      q_ttt     d_qttt
  baseline (wd_strong_paired)   1.08573    1.09874    1.08194   1.07971   --
  smear+attn_gate1d (sigmoid)   1.08663    1.09887    1.08220   1.08052   +0.00081
  smearonly (gate off)          1.08601    1.09834    1.08170   1.07998   +0.00027
  smear_gate2d (additive)       killed mid-train (~step 4000, val 1.1051)

The 1D per-head sigmoid gate (8 params/layer) is undercapacity vs upstream
PR openai#1667's 96 params/layer, and is +0.00090 worse pre-quant -- a real
regression in the trained model. SmearGate alone improves q (-0.00040) and
q_sw (-0.00024) but disrupts our SGD TTT lift (0.0017 vs 0.0022 baseline);
net q_ttt within seed noise. The artifact stays >16 MB (added code costs
~7 KB; still bust like baseline).

Conclusion: port is mechanically correct, just doesn't help on PR1493 base
without the rest of the top stack (LQER, phased TTT, CaseOps).

Part 2: Critical leaderboard analysis. PR openai#1855 and PR openai#1851 are both
verified-merged by maintainer cocohearts and listed on README. PR openai#1855
has an OPEN val_docs=10_000 vs canonical 50_000 dispute (jfc43, 2026-04-30,
unresolved) that affects the entire CaseOps chain (PRs 1736/1769/1787/
1851/1855/1868). If ruling lands against, all six fall and PR1493 family
returns to the top -- so building on PR1493 is a hedged investment.

Real pre/q/q_ttt comparison vs openai#1855 seed 42 log: their pre=1.06396 vs
ours 1.08573 (+0.022 BPB gap at the trained-model level), bigger than the
total 0.020 gap. The leaderboard wedge is dominated by training-level
wins (CaseOps + SparseAttnGate + 9-knob hparam stack), not LQER/phased-TTT.

Part 3: Pivot decision. Clone openai#1851's train_gpt.py (152 KB, 3,574 lines)
as the new base rather than porting their 2,500+ lines into our 553-line
file. openai#1851 picked over openai#1855 because: same q_ttt within noise (1.06128
vs 1.06108), no lrzip system dep, fewer disputes. Layer only our small
PR1493 differentiators (paired-head Muon NS, wd_schedule, gptq_all_reduce).

CaseOps shards already published at romeerp/parameter-golf-caseops-v1
(80 train + val + val_bytes sidecar + tokenizer); saves 1-2 hr CPU
retokenization. Background download in progress at session-end.

Plan for next session: reproduce openai#1851 unmodified at s42 (target q_ttt
1.06128 +/- 0.0005); if reproduced, layer paired-head Muon then wd_schedule
one-at-a-time; if not reproduced, stop and debug.

Files added:
  pr1493_smeargate_to_top_stack_session.md   full session writeup
  _top_ref/                                  cached openai#1851 reference files
                                             (train_gpt.py, lossless_caps.py,
                                              prepare_caseops_data.py, README.md)
  run_smear_*.sh                             smear experiment runners
  run_chain_smear_experiments.sh             chain runner
  run_mom97.sh                               drafted but superseded
  logs/smear_*.txt + .stdout                 full run logs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants