Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) by samacqua · Pull Request #1530 · openai/parameter-golf

samacqua · 2026-04-11T00:21:22Z

Record: Varlen attention + fused MLP + TTT

val_loss: 2.77261 | val_bpb: 1.07336 | ~15.99 MB | 8×H100 SXM, 587s train + ~340s TTT eval

Seed	BPB	Loss
0	1.07258208	2.77059090
1	1.07324696	2.77230836
2	1.07426259	2.77493185
Mean	1.07336388	2.77261037
Std	0.00084633	0.00218618

Best PR bpb (PR #1529): bpb=1.0753 (delta=0.0019), loss=2.7776 (delta=0.0050)

Merged record bpb (PR #1493): bpb=1.0810 (delta=0.0076), loss=2.7923 (delta=0.0197)

Increased training speed ~5% via variable length attention, a fused MLP triton kernel (no cutlass_evt_fusion dep), and grouping together small parameters, yielding ~.002 nats when comparing sliding window eval. Re-added document-based LoRA TTT which has no inter-sequence dependence and improves over strided evaluation by ~.008 nats.

Based on a hackathon last weekend with @aldopareja, @sestinj, and @chrishamblin7 :)

Main changes

Applied changes from my old PR to a recent record PR: #1523. But PR #1552 beat my previous bpb before I submitted the PR, so I incorporated their (orthogonal) improvements. Most of below is copied from my previous PR #1354.

This involves 3 things:

1. Variable length attention (~2% faster training, ~0.001 nats)

Replaced dense causal attention with Flash Attention 3's flash_attn_varlen_func. During training, documents are packed into flat token buffers with cu_seqlens boundaries so attention is computed within documents only — the model never attends across unrelated documents that happen to be adjacent in a batch.

This does two things:

Removes the need for the model to learn to ignore pre-BOS content from unrelated documents
Reduces wasted FLOPs: e.g. 10 short (100-token) docs packed into a 1k-token buffer cost proportional to 100 * 100**2 = 1M attention FLOPs vs 10 * 1000**2 = 10M with dense attention.

2. Fused MLP + grouped small params (~3% faster training, ~0.001 nats)

A custom Triton kernel (linear_leaky_relu_square_kernel) fuses the up-projection, LeakyReLU(0.5)² activation, and squaring into a single kernel. Based on similar kernels from modded-nanogpt. I also group the many tiny replicated scalar/control gradients into a single all-reduce to avoid a pile of tiny collectives.

3. Doc-based test-time training (TTT) (~0.008 nats over sliding window)

Blog explaining LoRA-based TTT from past record

Although it is technically legal in this competition to train on tokens from previous documents in the dataset, I am spiritually opposed to this. Under the current formulation, if the eval set was bigger, the expectation of the loss would be lower which seems broken. So in this implementation, there is score-first TTT applied to each sequence in the validation set independently (and efficiently using batched LoRAs), which is strictly harder.

Re-adds LoRA-based TTT, based on my old implementation, but > 2x faster which allows for using smaller chunk sizes which leads to better performance. This is an instance of "Case 3" according to this classification.

It's interesting to note that adding test-time training improves loss more than adding ~215 steps. These 215 steps train on 786432*215=169,082,880 tokens to gain ~.002 nats. The average sequence length in the validation set is ~200 tokens which means test-time training here gains ~.003 nats / 800 tokens on average (valid bc sequences are trained independently). So, in a way, TTT is ~(.003/800) / (.002/169082880) >= 300k times more token efficient than pre-training: it helps to be in distribution :)

Other small changes

Made some changes to make replication and dev based on this PR easier:

Load from a checkpoint just for eval
Didn't submit minified code, instead wrote that utility into the script when calculating file size so that it is easier for people to build off of this
Store unminified code in logs

Replicating runs + dev

# setup
uv venv
source .venv/bin/activate
uv pip install -r records/track_10min_16mb/2026-04-10_VarLenAttn/requirements.txt
uv pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
uv pip install torch==2.9.1+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
  python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards  128

# train + eval
SEED=0
ARTIFACT_DIR="runs/varlen${SEED}" SEED=$SEED \
    torchrun --standalone --nproc_per_node=8 \
    records/track_10min_16mb/2026-04-10_VarLenAttn/train_gpt.py

# eval saved checkpoint w/ TTT (useful for dev)
EVAL_ONLY_PATH="runs/varlen${SEED}/final_model.pt" SEED=$SEED \
    torchrun --standalone --nproc_per_node=8 \
    records/track_10min_16mb/2026-04-10_VarLenAttn/train_gpt.py

…g + Muon 0.97 — val_bpb 1.07747 (3-seed mean) - 3-seed mean: 1.07747 BPP (std 0.00064) / 2.78321 nats - ~15.99 MB artifact, 8×H100 SXM, 600s - VarLen attention (within-document only), doc-independent LoRA TTT - Parameter banking + triple depth recurrence + parallel residuals - PyTorch MLP fallback (no Triton/CUTLASS dependency) - Based on PR openai#1530, PR openai#1523, PR openai#1514

dexhunter · 2026-04-11T14:34:53Z

I may be missing something, but I think there is one higher-scrutiny #1017 / README issue worth clarifying.

In the TTT path, the compile warmup appears to use actual validation tokens before the main eval loop, and it also does backward() / step() inside that warmup block. The main score loop itself looks score-first, so this is not a claim about the core TTT logic; the concern is specifically the pre-eval warmup.

My read of the current guidance is:

Track B allows score-first adaptation using previously scored eval tokens
but not adaptation on validation tokens before they are scored

If that reading is right, would you be willing to switch the warmup to:

synthetic tokens / shape-only warmup, or
training tokens, or
a no-update warmup

That would make the legality story much cleaner.

samacqua · 2026-04-11T18:04:20Z

@dexhunter it could honestly just be commented out, given that warmup + eval time is still < 600s. But it shouldn't matter -- training warmup does the same thing, parameters and optimizer states are reset. As a sanity check I re-ran TTT on seed 2 w/ warmup commented out, and the loss was within expected variance between runs (actually did slightly better): quantized_ttt_lora val_loss:2.77492177 val_bpb:1.07425869 eval_time:338465ms.

But given that making a change + re-running what take an hour of 8xh100, I will only if it is a blocker.

MatoTeziTanka · 2026-04-11T18:16:23Z

Community Review — VarLen attention + fused MLP + doc-independent TTT

Thanks @samacqua. Doc-independent TTT via cu_seqlens boundary isolation is a genuinely interesting approach to the causal-dependence question the SLOT cluster has been bouncing around. One import blocker, then a deeper question on the doc-independence claim.

What I found (head SHA 161d64428159c61f2d42dd6d415ec1386599ef90, records/track_10min_16mb/2026-04-10_VarLenAttn/train_gpt.py, 116,694 bytes of actual source — not shim-compressed, directly readable):

Imports (L1-12): from flash_attn_interface import (flash_attn_varlen_func, ...) at the top — hard import, no fallback
eval_val_sliding at L1948 — uses BOS_ID to find document boundaries via (chunk_cpu[:-1] == BOS_ID).nonzero(), builds cu_seqlens from those boundaries, then calls the FA3 varlen kernel with that cu_seqlens argument. This is standard document-packed varlen attention — within a batch, attention can't cross a BOS boundary.
_build_cu_seqlens(bos_pos, total_len, device, max_doc_len=0, bucket_size=64) at L260 — helper that builds the cu_seqlens tensor with a bucket-size 64 padding
DocumentPackingLoader at L283 — the training-time document-packed loader
Standard GPT at L730 with varlen-aware blocks

"Doc-independent TTT" — the interesting idea. My read is that if the LoRA (or whatever TTT-like adaptation you're running) respects the same cu_seqlens boundaries as attention, then when token t in document D is scored, the adaptation state derived from document D' ≠ D doesn't influence t's scoring through the attention path — because attention physically can't cross the boundary. That's a clean causal isolation argument IF the adaptation state also respects document boundaries.

The open question is whether the adaptation state itself is per-document or per-batch. I couldn't find an eval_val_sliding_ttt / eval_val_ttt / eval_val_slot function in the 116KB source via my structural grep — could you point me at where the TTT adaptation lives in this codebase? "doc-independent TTT" in the title suggests there's an eval-time adaptation somewhere, but the function I was expecting by name doesn't show up. Is the TTT-like adaptation integrated into the main eval path, or does it use a different function name?

Import blocker (smoke test). The CPU smoke on CT2038 hit:

IMPORT_FAIL error=ImportError("cannot import name 'flash_attn_varlen_func' from 'flash_attn_interface' (unknown location)")

My flash_attn stub covers flash_attn_func but not the varlen variant. This is a CPU-stub limitation, not a PR defect — FA3 is available on H100s where this is intended to run.

Questions

Where is the doc-independent TTT adaptation loop? eval_val_sliding at L1948 looks like standard no-grad scoring. Is the adaptation inlined into the forward pass, or does it live in a separately-named function I missed?
Per-batch vs per-doc adaptation state: if documents D_i and D_j are packed into the same batch with disjoint cu_seqlens, does the TTT state they each produce stay isolated, or does it mix before being used to score the next batch?
Cross-document information flow through DocumentPackingLoader: the training-time loader packs multiple documents. Does the TTT adapt at training time (which would be legal as a training-side technique) or at eval time (which needs the doc-independence argument to close)?

Compliance summary (partial)

N-gram family bug: not present (no full_key / ctx_hash ^ target * primes)
Scored-region SLOT: not present (no slot_loss / mask = scored region)
Pre-Quant TTT on val_tokens: not present (no prequant_ttt_adapt_adamw)
Varlen attention via FA3 cu_seqlens: legitimate hardware optimization per Issue Are HW optimization solutions also welcome? #1409
Doc-independent TTT causal argument: pending clarification on the adaptation loop location

Verdict: LOOKS INTERESTING, NEEDS AUTHOR CLARIFICATION on the TTT adaptation path.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: HOLD pending author clarification on where and how the doc-independent TTT runs. If the adaptation respects cu_seqlens boundaries and the temporal ordering is score-before-adapt at the document level, this is a genuinely clean path out of the SLOT compliance bind, and I'd flip to MERGE.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL due to flash_attn_varlen_func missing from my flash_attn stub (known stub limitation, not a PR defect). Static review of the 116,694-byte source performed against the compliance axes above — 4 of 5 clean on the standard audits, 1 needs author clarification on the TTT adaptation location. AI tooling: review drafted with Claude Code (Opus); batch-9 subagent quota exhausted mid-batch so this review was authored in the main session with reduced audit depth — the flagged questions above are places I'd want confirmation from the author rather than statements I'm making from full verification. SHA 161d64428159c61f2d42dd6d415ec1386599ef90.

samacqua · 2026-04-11T18:31:51Z

@MatoTeziTanka look at eval_val_ttt_lora here:

each sequence (split by BOS) has it's own LoRA, so there is no dependence between sequences
each sequence is split into 32-token chunks. We iterate through the chunks in order, evaluating on chunk i, then training on chunk i before moving on to chunk i+1.

So yes, it respects the same document boundaries. It is strictly harder (and more valid imo) than TTT on the full validation sequence autoregressively.

See the "Methods" section of this blog for clarity.

… (3-seed mean) PR openai#1530 v2 base + warmdown_frac=0.75 + TTT_CHUNK_SIZE=48 + Muon 0.97. 3-seed mean: 1.07406 (std 0.00132), 2.77441 nats. Delta vs merged SOTA (openai#1493): -0.01491 nats (clears 0.005 bar by 3.0x). All artifacts < 16 MB, train < 600s, eval < 225s.

dexhunter · 2026-04-12T09:13:21Z

The pattern structurally matches what @valerio-oai flagged as invalid in #677 ("adapt on validation before the reported eval pass"). Even though LoRA resets per batch, the compile warmup still runs backward+step on val tokens before the eval loop. Since you confirmed the fix is within variance, it would be worth switching to random/synthetic tokens to avoid any ambiguity during review.

msisovic · 2026-04-12T10:10:18Z

The pattern structurally matches what @valerio-oai flagged as invalid in #677 ("adapt on validation before the reported eval pass"). Even though LoRA resets per batch, the compile warmup still runs backward+step on val tokens before the eval loop. Since you confirmed the fix is within variance, it would be worth switching to random/synthetic tokens to avoid any ambiguity during review.

This submission actually looks good to me. They don't "adapt on validation before the reported eval pass", as the warmup/compilation throws away the updates. The final result wouldn't change at all if they replaced those validation tokens in warmup with any other tokens. The author even notes that the result is unchanged, even when they comment out the warmup.

msisovic · 2026-04-12T18:26:50Z

Hey btw @samacqua the training script will crash without the pyminify CLI tool installed on the machine, so you might want to add a step to the README that it should be installed (maybe I have missed it though).

samacqua · 2026-04-13T15:05:07Z

@msisovic fixed. Thanks!

…nto SP8192 stack Adds records/track_10min_16mb/2026-04-15_SP8192_VarLen/train_gpt.py (readable 1446 lines): - flash_attn_varlen_func with cu_seqlens document packing in CausalSelfAttention - DocumentPackingLoader replacing per-sequence shuffling for training batches - Triton linear+LeakyReLU(0.5)^2 fused MLP kernel with two-lane output split - cu_seqlens threaded through Block / GPT forward; max_seqlen pinned to train_seq_len to avoid torch.compile recompilation on varying ints Retains full SP8192 stack: depth recurrence (2 loops, layers 3-5), parallel residuals from layer 7, QK-Gain 5.0, GPTQ INT6 + INT8 embed + SDClip 12.85, score-first chunk TTT, fused-softcap-ce eval kernel, SP8192 tokenizer. Eval paths unchanged (ShuffledSequenceLoader + flash_attn_3_func when cu_seqlens is None). New knobs: USE_VARLEN, USE_FUSED_MLP, CU_BUCKET_SIZE, MAX_DOC_LEN. Requires flash_attn_3 wheels (cu128_torch291) and Triton 3.2+ for TensorDescriptor API. Compiles clean locally. Awaiting 8xH100 smoke test to validate end-to-end.

…is regressive on our SP8192 + depth recurrence stack Three configs tested at seed 42 on 8xH100 SXM: - VarLen + Fused MLP: 1.93 pre-quant val_bpb, 1440 steps, 2.3M tok/s (3.4x slower) - Fused MLP only: 1.110 pre-quant val_bpb, 2581 steps, 3.4M tok/s (2.3x slower) - Pure baseline reproduction: pod terminated mid-run before completion Root cause: VarLen + depth recurrence + fullgraph torch.compile triggers cascading shape recompilations (combinatorial explosion of loop_iter x cu_seqlens shape) that overflow even a 64-entry compile cache. Fused MLP Triton kernel has per-call TensorDescriptor allocation overhead that doesn't amortize for our hidden_dim=2048. Conclusion: do not ship this port. PR openai#1572 (1.07974) remains best submission. Move 2 (per-layer GPTQ from PR openai#1586) and Move 3 (LoRA TTT from PR openai#1530, eval-only so no torch.compile recompile concern) are still viable next directions.

…192 stack Config-level changes only, no kernel/compile changes that could interact with our depth recurrence stack (unlike VarLen port in submission/sp8192-varlen-frontier): - MLP_CLIP_SIGMAS 12.0 (tight, preserve MLP precision) - ATTN_CLIP_SIGMAS 13.0 (looser, save bytes on attention weights) - EMBED_BITS 8 -> 7 with EMBED_CLIP_SIGMAS 20.0 -> 15.0 (~530 KB artifact savings) - MATRIX_LR 0.022 -> 0.026 (dexhunter 6-point sweep optimum) - WARMDOWN_FRAC 0.72 -> 0.75 (longer peak LR window) Dexhunter measured 1.07493 BPB (3-seed mean) applying these against PR openai#1530 base. Against our 1.07974 SP8192 baseline the expected delta is in the 0.003-0.005 BPB range; the adaptive clip is stack-independent and the embed-bits + LR tweaks are universal. Fresh branch from upstream/main per PR hygiene (PR openai#1572 untouched).

Replaces chunk-based score-first SGD TTT with doc-independent batched LoRA adaptation at eval time. Eval-only, training path unchanged, so none of the torch.compile recompile concerns from VarLen apply here. New machinery: - BatchedLinearLoRA: per-document LoRA factors (bsz, rank, in_features) - BatchedTTTLoRA: module holding Q/K/V/O/MLP-up/lm_head LoRAs per block - CausalSelfAttention.forward accepts optional lora_q/k/v/o (adds to projections) - MLP.forward accepts optional lora_up (adds to fc projection) - Block.forward threads the LoRA args - GPT.forward_ttt runs the full forward stack with LoRAs injected, returns per-token loss (reshaped to input shape) - ttt_lora_evaluate orchestrates score-first doc batches with distributed counter-based work stealing across ranks Compliance: each doc fully scored BEFORE its LoRA adapts (score-first). Each doc gets fresh LoRA weights (doc-independent, no cross-doc leakage). Standard causal attention throughout. No SLOT, no pre-quant TTT, no ETLB, no n-gram. Samacqua reports ~-0.008 BPB vs sliding-window eval on his stack. If it translates to our stack, would put us ~1.072-1.073, below the current 1.0728 frontier. TTT_MODE=lora is default. Set TTT_MODE=chunk to fall back to the old chunk- based score-first TTT.

@samacqua

Adds flash_attn_varlen_func path for within-document attention during training. Attention is restricted to doc boundaries detected via BOS token positions in each batch, eliminating cross-doc attention noise. Changes: - Import flash_attn_varlen_func alongside flash_attn_3_func - Add VARLEN_ENABLED and BOS_TOKEN_ID env var hyperparams - Add _build_cu_seqlens_from_batch helper (detects BOS, builds cu_seqlens) - Thread cu_seqlens/max_seqlen through CausalSelfAttention -> Block -> GPT - Branch in attention: varlen when cu_seqlens provided, else flash_attn_3 - Switch torch.compile to fullgraph=False when VARLEN_ENABLED=1 (data-dep branch) - Training step builds cu_seqlens per batch and passes to model Eval path unchanged. When VARLEN_ENABLED=0 (default) behavior is identical to PR openai#1493 reference. Compliance unchanged (training-only change, causality preserved by causal=True flag). Reference: PR openai#1530 @samacqua, PR openai#1536 @dexhunter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…zation Council consensus across 3 models (Gemini, Sonnet, Nemotron) + followup analysis identified these as the high-EV targeted fixes without lineage switch. LoRA semantics (samacqua PR openai#1530 exact match): - mlp_loras: dim -> dim (was dim -> hidden_dim), applied as parallel residual-level bypass at Block forward (was inner tweak inside MLP.forward) - o_lora: input is pre-attention normalized residual n (was attention output y) - MLP.forward reverted to no-lora signature (cleaner; mlp_lora lives at Block level) - CausalSelfAttention.forward now only takes lora_q/k/v (o_lora moved to Block) Pod speedgate at step 20 (env var POD_SPEEDGATE_MS, default 0 = disabled): - Measures ms/step at step 20 - RuntimeError abort if above threshold - Saves ~$5 per bad pod per council recommendation Looped-layer quantization (env var LOOP_CLIP_SIGMAS, default 10.0): - Tighter clip_sigmas for blocks.3/4/5 (the NUM_LOOPS=2 recurrent layers) - Motivation per Sonnet: quantization error compounds 2x through recurrence, and GPTQ error amplifies ~900x over 3 cycles per Issue openai#140 - Only active when NUM_LOOPS > 0 No training changes; all three fixes are eval-only behavior + a safety gate. Training path semantics unchanged from baseline.

Implements per-document Legal TTT via batched LoRA adapters trained *independently* on each val sequence, then discarded — no inter-doc leakage. Strictly causal, score-then-train within each chunk. PR openai#1530 (samacqua, openai/parameter-golf, OPEN) reports 1.07336 BPB 3-seed mean using this technique on top of the SP8192 stack — well below the 1.0810 merged SOTA. Our base is 1.0828 (3-seed lockin mean), so doc-LoRA-TTT is the path to crossing the 1.076 record bar. Implementation (~400 lines): 1. CausalSelfAttention.forward gains optional q_delta/v_delta tensor args added to the c_q/c_v projections. 2. Block.forward gains optional q_delta_fn/v_delta_fn callables that produce the deltas from the attn-norm output, threaded to attn. 3. BatchedLinearLoRA: per-batch independent (A, B) matrices with B zero-init so resets produce no delta. 4. BatchedTTTLoRA: lm_head_lora + per-block q_loras/v_loras. 5. forward_logits and forward gain optional `lora` arg. With looping active, the same physical block's LoRA fires on each pass through that index (one adapter per block, not per virtual slot — simpler). 6. eval_val_ttt_lora: groups val docs by BOS_ID boundaries, sorts by length-bucket, processes batch_size=64 docs in parallel, score- then-train with one Adam step per chunk, resets LoRA per batch. 7. New env vars TTT_LORA_ENABLED / TTT_LORA_RANK / TTT_LORA_LR / TTT_LORA_CHUNK_SIZE / TTT_LORA_EVAL_SEQ_LEN / TTT_LORA_BATCH_SIZE (defaults match the 2026-03-17 LoRA TTT record: rank=8, lr=0.01, chunk=256, seq_len=1024, batch=64). 8. train_and_eval runs eval_val_ttt_lora after the existing chunk-TTT when TTT_LORA_ENABLED=1; final_val_bpb prefers the LoRA-TTT result. Compressed train_gpt.py grew 21.2KB -> 25.9KB. Submission_bytes will land near 16,000k — close to the limit. Watch for budget overruns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed adapters" This reverts commit 4200035.

LQER (PR openai#1797 / PR openai#1874 / PR openai#1530 lineage) ported into v4 verbatim from PR openai#1874's diff. The biggest single remaining lever in our stack: PR openai#1797 measured -0.009 BPB recovery from int6 quant tax at ~30 KB artifact cost. Default-OFF: LQER_ENABLED=0 returns v4 to v3 byte-for-byte. Patch surface: - 6 new env vars on Hyperparameters (LQER_ENABLED, LQER_RANK=4, LQER_TOP_K=3, LQER_FACTOR_BITS=4, LQER_ASYM_ENABLED=1, LQER_ASYM_GROUP=64). - _lqer_pack (sym INT4 per-row) and _lqer_pack_asym (INT2 for A scalar, INT4 per-group-64 for B) helper functions. - gptq_mixed_quantize: after each weight's GPTQ pass, capture residual E = W - W_quant and stash with Frobenius norm. After the main loop, if LQER_ENABLED=1, sort by norm, pick top-K, run torch.linalg.svd, take rank-r factors, pack via asym (default) or sym fallback. - dequantize_mixed: if metadata contains 'lqer_asym' or 'lqer', dequant the factors and add A @ B to the dequantized weight. Verified: - AST-clean on Python 3.13 (macOS) and 3.12 (Linux/Vultr). - CPU pack/dequant round-trip on a 512x2048 residual: confirms shape arithmetic and that asymmetric INT2/INT4 reconstruction tracks the symmetric INT4 reconstruction within 0.5%. - Sizes: v4 raw 57,420 lzma 15,776 (+648 vs v3, +2,528 vs SOTA). - Byte-cost projection: ~10.5 KB raw factors per 512x2048 weight, ~4-6 KB after brotli compression of redundant int8 patterns. Top-K=3 ~ 12-18 KB total. Worst-seed artifact projection ~ 9 KB OVER 16M cap; mitigated by LQER_TOP_K=2 fallback (~6 KB savings) or pre-flight serialize check. Proposal is deliberately directive (per user request). Tells Claude: 1. Run sanity check (Step 1, $0.70) - LQER_ENABLED=0 must reproduce v3. 2. Run size pre-flight (Step 2, $3) - 30-step training to verify artifact stays under 16M with LQER enabled. Drop to TOP_K=2 if over. 3. Run single-seed full retrain (Step 3, $15) stacking LQER + Polar Express + MIN_LR + LR=0.010 + ConfTTT. Compare against Phase-2 best. 4. If <= 1.0780, run 3-seed validation (Step 4, $45) and submit. Race awareness: PR openai#1797 and PR openai#1874 are both OPEN with LQER as a core component. Either merging tightens our threshold significantly. LQER is on the critical path either way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@samacqua

…al_bpb 1.07193 (3-seed mean) Novel multi-phase global SGD during phased TTT evaluation. Builds on PR openai#1530 (@samacqua) + PR openai#1610 (@romeerp) phased TTT concept. 3-seed mean: 1.07193 BPB (2.76890 nats), std 0.00063. Seeds: 42, 0, 1234. All artifacts <16 MB.

… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).

samacqua added 2 commits April 11, 2026 00:18

add record

b438561

add logs

ad26861

samacqua changed the title ~~Varlen attention + fused MLP + doc-independent TTT~~ Varlen attention + fused MLP + doc-independent TTT Apr 11, 2026

samacqua changed the title ~~Varlen attention + fused MLP + doc-independent TTT~~ Varlen attention + fused MLP + doc-independent TTT (1.07643) Apr 11, 2026

samacqua changed the title ~~Varlen attention + fused MLP + doc-independent TTT (1.07643)~~ Record: Varlen attention + fused MLP + doc-independent TTT (1.07643) Apr 11, 2026

dexhunter mentioned this pull request Apr 11, 2026

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) #1536

Open

aryanbhosale mentioned this pull request Apr 11, 2026

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean) #1540

Open

update w/ improvements+logs

161d644

samacqua changed the title ~~Record: Varlen attention + fused MLP + doc-independent TTT (1.07643)~~ Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) Apr 11, 2026

samacqua mentioned this pull request Apr 11, 2026

Record: varlen+fused mlp+ttt (bpb=1.1093) #1354

Closed

dexhunter mentioned this pull request Apr 12, 2026

Record: VarLen Attention + Triton Fused MLP + Doc-TTT + Warmdown 0.75 + Chunk 48 — val_bpb 1.07406 (3-seed mean) #1560

Open

11 tasks

dexhunter mentioned this pull request Apr 13, 2026

Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586

Open

11 tasks

add python-minifier dependency

7dca3de

romeerp mentioned this pull request Apr 14, 2026

Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610

Merged

aquariouseworkman mentioned this pull request Apr 27, 2026

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851

Merged

ndokutovich mentioned this pull request Apr 27, 2026

Record: PR #1797 base + PPM-D byte mixture — val_bpb 0.90236 (3-seed mean) #1854

Open

dexhunter mentioned this pull request Apr 27, 2026

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean) #1857

Closed

8 tasks

achen2089 added a commit to achen2089/parameter-golf that referenced this pull request Apr 27, 2026

Revert "add doc-based LoRA TTT (PR openai#1530-style) - per-doc batch…

8666384

…ed adapters" This reverts commit 4200035.

renqianluo mentioned this pull request Apr 28, 2026

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean) #1886

Open

This was referenced Apr 28, 2026

Update Parameter Golf leaderboard #1899

Open

Update Parameter Golf leaderboard #1900

Open

Update Parameter Golf leaderboard with BOS fix #1902

Merged

Idan3011 mentioned this pull request Apr 28, 2026

val_bpb 1.0902 - 12L sp9000 + AttnOutGate + SmearGate #1565

Open

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

dexhunter mentioned this pull request Apr 29, 2026

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed) #1924

Closed

simon-marcus mentioned this pull request Apr 29, 2026

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 #1925

Open

8 tasks

liujshi mentioned this pull request Apr 29, 2026

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934

Open

5 tasks

MarioPaerle mentioned this pull request Apr 29, 2026

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean) #1941

Closed

cocohearts merged commit 6136a81 into openai:main Apr 29, 2026

andrewbaggio1 mentioned this pull request Apr 30, 2026

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953

Open

10 tasks

chris-colinsky mentioned this pull request Apr 30, 2026

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean) #1962

Open

6 tasks

bsisduck mentioned this pull request Apr 30, 2026

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean) #1969

Open

EthanYangTW mentioned this pull request Apr 30, 2026

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item #1978

Open

6 tasks

This was referenced Apr 30, 2026

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282 #2032

Closed

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer Full-Val - val_bpb 1.015784 #2039

Open

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

anmarhindi mentioned this pull request May 1, 2026

Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138

Open

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Varlen attention + fused MLP + doc-independent TTT (1.07336)#1530

Record: Varlen attention + fused MLP + doc-independent TTT (1.07336)#1530
cocohearts merged 4 commits intoopenai:mainfrom
samacqua:varlen-fused-ttt-v2

samacqua commented Apr 11, 2026 •

edited

Loading

Uh oh!

dexhunter commented Apr 11, 2026

Uh oh!

samacqua commented Apr 11, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

samacqua commented Apr 11, 2026

Uh oh!

dexhunter commented Apr 12, 2026

Uh oh!

msisovic commented Apr 12, 2026

Uh oh!

msisovic commented Apr 12, 2026

Uh oh!

samacqua commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

samacqua commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Varlen attention + fused MLP + TTT

Main changes

1. Variable length attention (~2% faster training, ~0.001 nats)

2. Fused MLP + grouped small params (~3% faster training, ~0.001 nats)

3. Doc-based test-time training (TTT) (~0.008 nats over sliding window)

Other small changes

Replicating runs + dev

Uh oh!

dexhunter commented Apr 11, 2026

Uh oh!

samacqua commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — VarLen attention + fused MLP + doc-independent TTT

Questions

Compliance summary (partial)

Uh oh!

samacqua commented Apr 11, 2026

Uh oh!

dexhunter commented Apr 12, 2026

Uh oh!

msisovic commented Apr 12, 2026

Uh oh!

msisovic commented Apr 12, 2026

Uh oh!

samacqua commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

samacqua commented Apr 11, 2026 •

edited

Loading

samacqua commented Apr 11, 2026 •

edited

Loading