RECORD: SmearGate + Attention Output Gate + Legal TTT

MarioPaerle · 2026-04-16T12:32:56Z

mean val_bpb = 1.07139 | std = 0.00082 | 15.927 MB

Key Results

Seed	Steps	Pre-Quant val_bpb	Quant val_bpb	TTT val_bpb	Artifact Size
42	4843	1.07227	1.08262	1.07221	15.94 MB
1337	4843	1.07074	1.08109	1.07057	15.91 MB
0	4836	1.07151	1.08183	1.07139	15.93 MB
Mean	4840	1.07159	1.08184	1.07139	15.927MB

Smear Gate

Reintroduced Smear Gate, yet with input dependence in Modded Nano GPT style.

Attention Output Gate (Per-Head Output Modulation)

A lightweight per-head multiplicative gate on the attention output

Weight-initialized to zero: at init, all heads pass through at scale 1.0
Total new parameters: 12 x 8 = 96 weights per layer x 11 layers = 1,056 parameters
Activated by GATE_ATTN_OUT=1 GATE_ATTN_SRC=proj GATE_WIDTH=12

Training Configuration

Installing packages

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
  pip install brotli sentencepiece python-minifier numpy

sp8192 Dataset Download

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

Run command

SEED=<SEED> RUN_ID=<RUN_ID> \
  SMEAR_GATE=1 SMEAR_GATE_WIDTH=12 \
  GATE_ATTN_OUT=1 GATE_ATTN_SRC=proj GATE_WIDTH=12 \
  QK_GAIN_INIT=5.25 \
  TTT_ENABLED=1 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Note

Note on code size: train_gpt.py is shipped as raw source to increase readability (125 KB), but _compressed_code_size() reports the theoretical on-disk size of the same source
after pyminify + LZMA + base85 wrapping (~30 KB).

Training completes in ~587s (wallclock-capped), reaching 4836-4843 steps depending on seed. The gate overhead is ~1.5% of step throughput (from ~8,200 tok/s to ~8,080 tok/s at step 1000, widening slightly with layer looping after step ~2141).

Full Architecture Stack

11L x 512d x 8H / 4KV heads (GQA)
MLP 4x expansion with LeakyReLU(0.5)^2 activation
Partial RoPE (16/64 dims)
Layerwise LN scale: 1/sqrt(layer_idx+1)
Tied embeddings, logit softcap = 30.0
SmearGate (width=12, learned lambda) -- NEW
Attention Output Gate (width=12, per-head, all 11 layers) -- NEW
Skip gates (sigmoid-gated U-Net connections)
3-layer depth recurrence (layers 3,4,5, activated at frac=0.35)
Parallel residuals (layer 7+)
QK-Gain 5.25 (per-head, per-layer)
MuonEq-R optimizer (WD=0.095, MLR=0.026, EMA=0.9965)
GPTQ quantization: int6 matrices (clip=12.85), int7 embeddings (clip=20.0)
Brotli-11 compression with byte-shuffle
Score-first TTT (SGD, LR=0.005, 3 epochs per chunk)

Compliance

This submission satisfies all Track B requirements:

Causality: Sliding-window TTT evaluation maintains strict token ordering. Each position is scored from its prefix only.
Distribution integrity: Standard softmax over complete vocabulary without post-hoc modifications or logit biasing.
Score-before-update: TTT parameters update exclusively after scoring relevant data chunks (score-first methodology inherited from PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586).
Single evaluation: Each token receives exactly one score without rescoring passes.
Artifact size: All seeds produce artifacts under 16,000,000 bytes (max: 15,936,229 bytes for seed 42).
Training time: All seeds complete within 600s wallclock.
TTT eval time: All seeds complete TTT eval within 600s.

Acknowledgments

Built on the work of the parameter-golf community:

@bigbag -- SP8192 architecture, 3-layer depth recurrence, parallel residuals, QK-Gain 5.25, score-first TTT Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493
@dexhunter -- per-layer adaptive GPTQ clip, int7 embeddings, MLR tuning (PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586, our direct baseline)
@KellerJordan -- modded-nanogpt speedrun infrastructure and SmearGate concept (originally from modded-nanogpt)
SmearGate was first introduced to parameter-golf in earlier records (Add experiment workflow and sweep helpers #13-feat: recursive weight sharing for 16MB limit #15) before being removed at record Add depth recurrence + SwiGLU submission (Apple M3 8GB) #8

This work was also possible thanks to the support provided by Paradigma ([link](https://paradigma.inc/)) and the use of Flywheel ([link](https://flywheel.paradigma.inc/)): their infrastructure for research

Our Team: me, @CerovazS, @GabrieleCirillo

Removed unused arg from the running command

MarioPaerle

Updated Readme

…Output Gate; PR openai#1670 dexhunter 1.05970 casefold pending; PR openai#1647 SLOT-4 risky; Session 15 https://claude.ai/code/session_01VS9iDJJ7C5Qqpk8AAd1Avv

MarioPaerle

Readme now includes more details on the submizzion code size

@MarioPaerle

…TTT — val_bpb 1.05733 (3-seed mean) Stacks per-head Attention Output Gate (PR openai#1667 @MarioPaerle) and SmearGate on top of PR openai#1670's Casefold V4 + Multi-Phase Global SGD TTT base. Zero-init gates (identity at init) add 1,056 + 13 parameters total. - Seed 42: val_bpb=1.05693, val_loss=3.04604, artifact=15,936,269 B - Seed 0: val_bpb=1.05730, val_loss=3.04712, artifact=15,937,514 B - Seed 1234: val_bpb=1.05777, val_loss=3.04846, artifact=15,938,772 B - 3-seed mean val_bpb=1.05733 (std 0.00035), val_loss=3.04721 nats - Delta vs casefold leader (PR openai#1585): -0.00657 BPB / -0.01697 nats (>3x the 0.005-nat bar) - Delta vs PR openai#1670 casefold base: -0.00237 BPB / -0.00680 nats Casefold legality pending organizer review at Issue openai#1604. AttnOutGate and SmearGate are pure architectural additions and comply with all Issue openai#1017 conditions (causality, normalized distribution, score-before- update, single pass).

The public PR body for openai#1667 claims a run with , , and , but the shipped default surface leaves the gates OFF and qk_gain at 5.0. This branch bakes the claimed settings into code defaults so the reproduction run actually tests the claimed surface rather than the inert default one. Constraint: Must preserve the rest of the public PR surface exactly; only claimed env settings are baked into defaults. Rejected: Reproduce with env vars only | the current evaluator path does not forward arbitrary env vars to remote jobs Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any public frontier PR used as a base must pass a self-containment/defaults-vs-claim check before being treated as a serious candidate surface Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall

The claimed openai#1667 surface currently keeps , and the live reproduction showed a full mid-run validation at step 4000 inside a 600-second wallclock budget. This lane disables periodic validation by default so the same family can spend those cycles on training instead. Constraint: Must remain a systems-only exploitation of the same claimed surface; no mechanism or scorer changes. Rejected: Leave periodic validation on | wastes wallclock on a non-essential mid-run diagnostic in the competition regime Confidence: medium Scope-risk: narrow Reversibility: clean Directive: In wallclock-capped rounds, periodic validation should never remain on by accident in a serious score lane Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall

MarioPaerle

Added dataset download command on readme and summary

The claimed openai#1667 surface reproduces cleanly enough to show real score signal, but the current lane is still failing at the tail. This branch removes compile from the final quantized eval and the TTT eval path, and skips the TTT compile warmup, so we can distinguish score quality from eval-compile fragility. Constraint: Must preserve the claimed train-time surface and only alter final-eval execution strategy. Rejected: Disable all compile everywhere | would change the train-time systems regime more than necessary Confidence: medium Scope-risk: narrow Reversibility: clean Directive: If this lane succeeds cleanly with similar score, treat eval compile as an optional optimization rather than a required part of the candidate surface Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall

Stacks 4-layer x 4-pass depth recurrence (23 virtual layers) on PR openai#1667's SmearGate + Attention Output Gate + legal TTT base (1.0714 BPB). Changes vs PR openai#1667: - LOOP_END: 5 -> 6 (includes layer 6 in loop) - NUM_LOOPS: 2 -> 3 (4 passes total) - Gate defaults flipped on so reproduction needs no env vars

Less aggressive than 4Lx4Pass variant. 19 virtual layers from loop_end=6 x 3 passes. +12% compute/step vs PR openai#1667 base, expected ~4330 steps in 600s. Motivation: prior 4Lx4Pass (23 virt) landed at 1.07306 - step loss ate capacity gain. This variant keeps wider loop but reduces pass count. Changes vs PR openai#1667: - LOOP_END: 5 -> 6 (includes layer 6 in loop) - NUM_LOOPS: 2 (unchanged) - Gate defaults flipped on

already ships an MLP output gate path behind , but the best reproduced line so far () still leaves it off. This branch enables the gate by default on the same claimed-surface/no-mid-run-validation line to test the cheapest remaining same-family architectural tweak. Constraint: Must stay inside the openai#1667 family and avoid changing TTT, scorer, or packaging semantics. Rejected: Touch the TTT protocol again | current evidence says tail cleanliness, not the training recipe, is the more immediate blocker Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Keep this lane focused on the tiny gate toggle only; do not mix in new systems changes before it is measured cleanly Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall

…i#1667 line The claimed openai#1667 surface reproduces well on our infra, but we still do not know whether SmearGate is helping or just riding along with the attention output gate. This lane keeps the better claimed-surface/no-mid-run-validation stack and turns SmearGate back off so we can measure the attention-output-gate contribution in isolation. Constraint: Must stay in the same family and avoid changing TTT, scorer, or systems path. Rejected: Turn off the attention gate instead | the PR body and earlier signal both suggest the attention output gate is the more central mechanism Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a family ablation, not a new novelty thesis Tested: python3 -m py_compile train_gpt.py evaluate.py Not-tested: GPU execution on Heimdall

W72 and W73 showed that adding the MLP gate regresses and that keeping only the attention output gate collapses the score. This branch keeps the W69 control surface but disables the attention output gate so we can measure whether SmearGate itself carries the gain. Constraint: Keep the change to a single mechanism toggle so W69 remains the control Rejected: Hybrid multi-toggle follow-up | would confound attribution after W72/W73 Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a mechanism attribution run, not a tuned candidate surface Tested: python3 -m py_compile train_gpt.py Not-tested: Remote training/eval on Lepton

Our round26 reproductions reach the post-EMA diagnostic score and then die in serialize() because _compressed_code_size() unconditionally shells out to a pyminify CLI that is not present in the worker environment. Falling back to the raw source keeps the code-size estimate conservative while allowing the actual quantization and TTT tail to run. Constraint: Keep model behavior unchanged and only harden the packaging/tail path Rejected: Add a guessed pip dependency for pyminify | CLI/provider mismatch is unclear and slower to validate remotely Confidence: high Scope-risk: narrow Reversibility: clean Directive: Treat this as an operational tail fix for round26 reproductions, not as a model improvement Tested: python3 -m py_compile train_gpt.py Not-tested: Remote serialize/quantize/TTT completion on Lepton

Layout: [3,4,5,6] -> [3,4,5] -> [3,4] (16 virt, 9 looped passes). Matches PR openai#1667 compute exactly but breaks uniform-loop symmetry so LoRA TTT sees distinguishable per-layer gradient paths. ASYMMETRIC_LOOP env toggle added; default ON for this experiment. Gates stay on (SMEAR_GATE=1, GATE_ATTN_OUT=1, QK_GAIN_INIT=5.25).

W75 proved that the code-default openai#1667 surface reaches a real quantized_ttt_lora result, but it lands at 1.1106 rather than the PR's claimed 1.07139. The public PR body explicitly describes score-first TTT as SGD with 0.005 LR and 3 epochs per chunk, while the shipped defaults still use Adam, 1e-4 LR, and one grad step. This commit bakes the claimed TTT settings into the surface so we can test whether that mismatch explains the reproduction gap. Constraint: Keep the model/training surface fixed and change only the TTT defaults Rejected: More architecture ablations first | the dominant unresolved gap is now the public TTT surface mismatch Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Judge W76 only as a claimed-surface reproduction test, not as a tuned new candidate Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton

The PR body, code defaults, and attached train logs disagree. W75 showed the code-default surface reaches a real quantized_ttt_lora result, but it lands far from the claimed score. This branch moves toward the actual attached-log surface by restoring VAL_LOSS_EVERY=4000 and limiting the effective training shard set to 80, which are both explicitly printed in the PR's bundled logs. Constraint: Preserve the W75 tail-fix and the logged TTT defaults while changing only the surface mismatches proven by the attached logs Rejected: Combine with README-claimed SGD TTT settings | that would mix the PR body surface with the attached-log surface and lose attribution Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Use this branch only as an exact-log reproduction probe, not as a tuned candidate line Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton

W78 showed that the raw default surface is nowhere near the claimed score, but openai#1700 differs from openai#1667 because its attached train logs and README do agree on the eval-time mechanism. This branch bakes in the surfaced settings from the PR materials: phased TTT enabled with 3 phases, int7 embeddings, tighter MLP/embed clip sigmas, and an 80-shard training view matching the attached logs. Constraint: Keep the architecture fixed and change only the public surface defaults needed to match the PR's own materials Rejected: Jump straight to new architecture tuning | the unresolved question is still whether openai#1700's claimed public surface is reproducible Confidence: medium Scope-risk: narrow Reversibility: clean Directive: Treat this as a claimed/log-aligned reproduction lane, not as an original tuning line Tested: python3 -m py_compile train_gpt.py Not-tested: Remote train/eval on Lepton

…base Stage 1 of cross-stack port: minimal model-level additions on top of PR openai#1700 (Multi-Phase Global SGD + Phased TTT + VarLen + DepthRec, 1.07219 mean) without touching the weight-bank attention path. Changes: - QK_GAIN_INIT default 5.0 -> 5.25 - SmearGate (modded-nanogpt forward-1 token smear) added at model level, inserted between tok_emb and rms_norm in both forward_logits and forward_ttt. New params (smear_gate.weight, smear_lambda) auto passthrough quant via numel<=65536 rule and registered with the scalar AdamW optimizer. AttnOutGate (the larger of the two gates from PR openai#1667) is deferred to Stage 2 since it needs surgery inside the attention/bank forward. If Stage 1 lands <=1.0710 it validates the port + motivates Stage 2.

@mikeapedia

Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.

…base Builds on Stage 1 (SmearGate + QK-Gain 5.25, seed 42 = 1.07219). Adds per-head multiplicative gate inside attention (g = 2*sigmoid(W x[:,:12]) broadcast across head_dim, applied between flash_attn output and out_proj). Zero-init projection so gate starts at ~1.0 — Stage 2 is numerically identical to Stage 1 at step 0. Wired into: - CausalSelfAttention.forward (forward_logits path) - _block_with_lora (sequential TTT path) - _parallel_block_with_lora (parallel TTT path, layers >= parallel_start_layer) Param footprint: 96 floats per layer (8 heads x 12 width), 1152 total across 12 layers. Auto-passthrough via numel <= 65536 quant rule. Routed to scalar AdamW via attn_gate_proj entry in CONTROL_TENSOR_NAME_PATTERNS. Hypothesis: AttnOutGate adds ~0.0010-0.0015 BPB on top of Stage 1. Combined with Stage 1 gain (0.0011 over PR openai#1700), full PR openai#1667 -> PR openai#1700 cross-stack port should reach ~1.0707-1.0710 (seed 42).

Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier to test whether absolute-position bias is bottlenecking the PR openai#1700 TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged relative-position attention as the next architectural axis, and no PR has tried NoPE at frontier. ALiBi was the first choice, but FA3 (Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no alibi_slopes parameter, and FA2 fallback breaks the 600s budget under TTT. NoPE is the cheapest position-axis test under FA3. NOPE env knob (default 1) gates apply_rotary_emb in three attn paths: forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary module is still constructed so warmup calls remain harmless and the diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new params, submission size unchanged.

Per PR openai#1667/openai#1693: a tiny linear (gate_width x num_heads, default 12x8 = 96 weights per layer) projects the first 12 dims of the input into per-head gate values. Scaled to (0, 2) via 2*sigmoid for symmetric pass-through at zero-init. Total: 1056 extra params (8 heads x 12 width x 11 layers) — ~1KB at fp16. Zero-init = identity at start (transparent). Lets each head dynamically suppress noise per-token. Compatible with depth recurrence, parallel residuals, XSA, and GPTQ (gate weights pass through as fp16, numel < 65536).

Forward-1-token residual mixer at embedding lane: x_t <- x_t + lambda * sigmoid(W * x_t[:12]) * x_{t-1} The model gets a learnable bias toward bigram features without needing attention to discover it. Tiny (13 params total: 12-wide linear + scalar lambda). Zero-init lambda = transparent at start. BOS-fix prevents cross-document leakage during packed training: gate is masked to 0 at positions where input_ids == BOS_TOKEN_ID (default 1). Both smear_gate.weight and smear_lambda match 'smear' pattern -> route to scalar AdamW, not Muon. Both at GPT-level (not blocks), so explicitly appended to scalar_params in Optimizers.

cocohearts

Accepted on substance, but please reformat the record directory before merge. The current directory uses 2026_04_16_...; please rename it to the standard YYYY-MM-DD_description form, e.g. records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT. No ML/result change needed.

cocohearts

Accepted on substance, but please reformat the record directory before merge. The current directory uses 2026_04_16_...; please rename it to the standard YYYY-MM-DD_description form, e.g. records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT. No ML/result change needed.

Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.

…feedback - 2026_04_16_SmearGate_Attention_Output_Gate_Score-First_TTT + 2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT No file content changes.

MarioPaerle · 2026-04-29T20:07:08Z

@cocohearts done.
Folder is now records/track_10min_16mb/2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT

MarioPaerle

renamed folders

…lone openai#1851 Part 1: BOS-fixed SmearGate + per-head attn output gate ported onto PR1493 wd_strong_paired baseline (15+/-6 lines in train_pr1493.py). 5 new env vars: SMEARGATE_{ENABLED,BOS_ID,INIT}, ATTN_GATE_{ENABLED,INIT}. SmearGate is causal previous-token mixing with the BOS document-boundary mask from PR openai#1851: at positions where input_ids == bos_id, the smear contribution is forced to zero so the final token of doc N cannot leak into BOS of doc N+1. Verified by a focused unit test. Per-head attn_gate added inside CausalSelfAttention applied to flash_attn output before XSA. smeargate.smear_gate is a top-level GPT parameter so it gets explicitly appended to Optimizers.scalar_params (not picked up by the blocks-only loop). CONTROL_TENSOR_NAME_PATTERNS extended; 100% optimizer coverage verified. Real-run results (single seed s42, 8xH100): variant pre q q_sw q_ttt d_qttt baseline (wd_strong_paired) 1.08573 1.09874 1.08194 1.07971 -- smear+attn_gate1d (sigmoid) 1.08663 1.09887 1.08220 1.08052 +0.00081 smearonly (gate off) 1.08601 1.09834 1.08170 1.07998 +0.00027 smear_gate2d (additive) killed mid-train (~step 4000, val 1.1051) The 1D per-head sigmoid gate (8 params/layer) is undercapacity vs upstream PR openai#1667's 96 params/layer, and is +0.00090 worse pre-quant -- a real regression in the trained model. SmearGate alone improves q (-0.00040) and q_sw (-0.00024) but disrupts our SGD TTT lift (0.0017 vs 0.0022 baseline); net q_ttt within seed noise. The artifact stays >16 MB (added code costs ~7 KB; still bust like baseline). Conclusion: port is mechanically correct, just doesn't help on PR1493 base without the rest of the top stack (LQER, phased TTT, CaseOps). Part 2: Critical leaderboard analysis. PR openai#1855 and PR openai#1851 are both verified-merged by maintainer cocohearts and listed on README. PR openai#1855 has an OPEN val_docs=10_000 vs canonical 50_000 dispute (jfc43, 2026-04-30, unresolved) that affects the entire CaseOps chain (PRs 1736/1769/1787/ 1851/1855/1868). If ruling lands against, all six fall and PR1493 family returns to the top -- so building on PR1493 is a hedged investment. Real pre/q/q_ttt comparison vs openai#1855 seed 42 log: their pre=1.06396 vs ours 1.08573 (+0.022 BPB gap at the trained-model level), bigger than the total 0.020 gap. The leaderboard wedge is dominated by training-level wins (CaseOps + SparseAttnGate + 9-knob hparam stack), not LQER/phased-TTT. Part 3: Pivot decision. Clone openai#1851's train_gpt.py (152 KB, 3,574 lines) as the new base rather than porting their 2,500+ lines into our 553-line file. openai#1851 picked over openai#1855 because: same q_ttt within noise (1.06128 vs 1.06108), no lrzip system dep, fewer disputes. Layer only our small PR1493 differentiators (paired-head Muon NS, wd_schedule, gptq_all_reduce). CaseOps shards already published at romeerp/parameter-golf-caseops-v1 (80 train + val + val_bytes sidecar + tokenizer); saves 1-2 hr CPU retokenization. Background download in progress at session-end. Plan for next session: reproduce openai#1851 unmodified at s42 (target q_ttt 1.06128 +/- 0.0005); if reproduced, layer paired-head Muon then wd_schedule one-at-a-time; if not reproduced, stop and debug. Files added: pr1493_smeargate_to_top_stack_session.md full session writeup _top_ref/ cached openai#1851 reference files (train_gpt.py, lossless_caps.py, prepare_caseops_data.py, README.md) run_smear_*.sh smear experiment runners run_chain_smear_experiments.sh chain runner run_mom97.sh drafted but superseded logs/smear_*.txt + .stdout full run logs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MarioPaerle added 2 commits April 16, 2026 14:22

RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139

bfb2e12

Update README.md

076ad6c

Removed unused arg from the running command

MarioPaerle commented Apr 16, 2026

View reviewed changes

Update README.md

f3b85bb

MarioPaerle commented Apr 17, 2026

View reviewed changes

dexhunter mentioned this pull request Apr 17, 2026

Record: Casefold V4 + AttnOutGate + Multi-Phase Global SGD TTT — val_bpb 1.05733 (3-seed mean) #1693

Open

7 tasks

Inserted Dataset to download on README.md

8825b6a

MarioPaerle commented Apr 17, 2026

View reviewed changes

himanshudongre mentioned this pull request Apr 18, 2026

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean) #1716

Open

mikeapedia mentioned this pull request Apr 19, 2026

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706) #1728

Open

7 tasks

dexhunter mentioned this pull request Apr 19, 2026

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736

Merged

5 tasks

AjAnubolu mentioned this pull request Apr 27, 2026

Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) #1874

Open

4 tasks

Meirzhan05 mentioned this pull request Apr 28, 2026

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean) #1880

Open

This was referenced Apr 28, 2026

Update Parameter Golf leaderboard #1899

Open

Update Parameter Golf leaderboard #1900

Open

Update Parameter Golf leaderboard with BOS fix #1902

Merged

AayushBaniya2006 mentioned this pull request Apr 28, 2026

Record: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean) #1906

Open

6 tasks

Idan3011 mentioned this pull request Apr 28, 2026

val_bpb 1.0902 - 12L sp9000 + AttnOutGate + SmearGate #1565

Open

jorge-asenjo mentioned this pull request Apr 29, 2026

Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923

Open

3 tasks

simon-marcus mentioned this pull request Apr 29, 2026

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 #1925

Open

8 tasks

MarioPaerle mentioned this pull request Apr 29, 2026

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean) #1941

Closed

cocohearts requested changes Apr 29, 2026

View reviewed changes

cocohearts previously requested changes Apr 29, 2026

View reviewed changes

Rename record folder to YYYY-MM-DD_description format per @cocohearts …

e648137

…feedback - 2026_04_16_SmearGate_Attention_Output_Gate_Score-First_TTT + 2026-04-16_SmearGate_AttentionOutputGate_ScoreFirstTTT No file content changes.

MarioPaerle commented Apr 29, 2026

View reviewed changes

andrewbaggio1 mentioned this pull request Apr 30, 2026

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953

Open

10 tasks

cocohearts merged commit e8eeb62 into openai:main Apr 30, 2026

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

bsisduck mentioned this pull request Apr 30, 2026

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean) #1969

Open

sahiee-dev mentioned this pull request Apr 30, 2026

SP8192 + PolarExpressNS + MIN_LR + LQER Asym Rank-4 | val_bpb=1.07302 (3-seed mean) #1977

Open

9 tasks

EthanYangTW mentioned this pull request Apr 30, 2026

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item #1978

Open

6 tasks

simonbissonnette mentioned this pull request Apr 30, 2026

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014

Open

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

okezue mentioned this pull request May 1, 2026

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767 #2128

Open

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139#1667