Anubhav ctw submission by AnubhavBharadwaaj · Pull Request #1011 · openai/parameter-golf

AnubhavBharadwaaj · 2026-03-28T09:22:54Z

Non-Record: CTW Eval-Time Augmentation on PR #549 SOTA Stack

val_bpb = 1.1203 (seed 1337) | 15.85 MB | 8×H100 SXM

Results

Run	Seed	Steps	Step Avg	Pre-TTT BPB	Post-TTT BPB	TTT Time	Artifact
Baseline (no CTW)	1337	7,023	85.5ms	1.1386	1.1203	352s	15,854,788
CTW (w=0.1, d=4)	1337	7,023	85.5ms	1.1386	1.1252	2,760s	15,854,788

Novel Contribution: CTW — A Negative Result

This submission integrates Context Tree Weighting (Willems, Shtarkov, Tjalkens 1995) into the PR #549 SOTA stack as an eval-time augmentation. CTW is a provably minimax-optimal sequential probability assignment over all variable-order Markov models up to depth D. It has zero artifact cost — the suffix tree is built entirely from already-scored tokens during evaluation.

Integration

CTW was deeply integrated into the TTT scoring loop — not as a separate eval pass. During Phase 1 (score) of each TTT chunk, neural logits from TTT-adapted weights are mixed with CTW predictions per-token via log-linear interpolation before computing NLL:

for each TTT chunk:
    Phase 1 — SCORE: sliding window eval
        for each scored token:
            mixed = (1 - w) * log_softmax(neural_logits) + w * log(ctw_probs)
            nll = cross_entropy(mixed, target)
            ctw.update(target)  # backward-looking: update AFTER scoring
    Phase 2 — TRAIN: SGD on chunk (unchanged from PR #549)

Finding: CTW Hurts Strong Neural Models

CTW degrades BPB by +0.005 at w=0.1, depth=4. The neural model at 1.12 BPB already captures n-gram patterns far better than any depth-4 Markov model. CTW's KT estimator over 1024 subword tokens is essentially a smoothed 4-gram model — the 11-layer transformer with 2048 context is already a strictly superior n-gram model. Mixing in a weaker predictor adds noise.

Additionally, the per-token Python loop makes CTW catastrophically slow (2,760s vs 352s for standard TTT), exceeding the 10-minute eval limit.

Why This Matters

Other approaches to n-gram eval augmentation in Parameter Golf (PRs #727, etc.) succeed by using:

Much higher order (5-7 grams) with count-min sketch
Entropy-adaptive mixing weight (near-zero when neural model is confident)
Vectorized GPU lookup (adds seconds, not minutes)

CTW's theoretical optimality over all variable-order Markov sources is irrelevant when the neural model already dominates the Markov component. The provable minimax guarantee applies to the class of tree sources — but the FineWeb validation set is not well-modeled by any depth-4 tree source that a 1024-vocab CTW can represent.

Base Architecture (PR #549 by @abaybektursun)

11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
Parameter Banking + Parallel Muon (FlashAttention 3)
BigramHash(1536), XSA4, Partial RoPE(16), LN Scale, VE128
EMA(0.997) + Tight SWA(50), GPTQ-lite int6 + LZMA-6
Legal Score-First TTT (SGD, lr=0.002, 3 epochs, 32K chunks)

Run Commands

# Baseline (reproduces PR #549)
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
CTW_WEIGHT=0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

# CTW enabled (negative result)
# Same as above but with: CTW_WEIGHT=0.1 CTW_DEPTH=4

Credits

CTW integration and negative result analysis: Anubhav (this submission)
LeakyReLU²: PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 by @parinzee, PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
Parallel Muon + Parameter Banking: PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 by @abaybektursun
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush

… Golf)

…ve result, 1.1203 BPB baseline)

MatoTeziTanka · 2026-04-11T20:06:10Z

Community Review — Anubhav ctw submission

BPB: 1.1203 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA dbf25a40840c, file records/track_non_record_16mb/2026-03-27_VRL_LeakyReLU2_GPTQ/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=50154 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=50154 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

AnubhavBharadwaaj added 5 commits March 28, 2026 14:14

Non-record: VRL + LeakyReLU² + Full SOTA Stack (iteration build)

4aaeb04

Non-record: CTW + VRL + Full SOTA Stack (first CTW entry in Parameter…

1059c8e

… Golf)

README.md updated with smoke test logs

cba01e5

Added smoke test logs for VRL LeakyReLU2

788bc8c

Non-record: CTW eval-time augmentation on PR openai#549 stack (negati…

dbf25a4

…ve result, 1.1203 BPB baseline)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anubhav ctw submission#1011

Anubhav ctw submission#1011
AnubhavBharadwaaj wants to merge 5 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-ctw-submission

AnubhavBharadwaaj commented Mar 28, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AnubhavBharadwaaj commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: CTW Eval-Time Augmentation on PR #549 SOTA Stack

Results

Novel Contribution: CTW — A Negative Result

Integration

Finding: CTW Hurts Strong Neural Models

Why This Matters

Base Architecture (PR #549 by @abaybektursun)

Run Commands

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Anubhav ctw submission

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnubhavBharadwaaj commented Mar 28, 2026 •

edited

Loading