Skip to content

Anubhav ctw submission#1011

Open
AnubhavBharadwaaj wants to merge 5 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-ctw-submission
Open

Anubhav ctw submission#1011
AnubhavBharadwaaj wants to merge 5 commits intoopenai:mainfrom
AnubhavBharadwaaj:anubhav-ctw-submission

Conversation

@AnubhavBharadwaaj
Copy link
Copy Markdown

@AnubhavBharadwaaj AnubhavBharadwaaj commented Mar 28, 2026

Non-Record: CTW Eval-Time Augmentation on PR #549 SOTA Stack

val_bpb = 1.1203 (seed 1337) | 15.85 MB | 8×H100 SXM

Results

Run Seed Steps Step Avg Pre-TTT BPB Post-TTT BPB TTT Time Artifact
Baseline (no CTW) 1337 7,023 85.5ms 1.1386 1.1203 352s 15,854,788
CTW (w=0.1, d=4) 1337 7,023 85.5ms 1.1386 1.1252 2,760s 15,854,788

Novel Contribution: CTW — A Negative Result

This submission integrates Context Tree Weighting (Willems, Shtarkov, Tjalkens 1995) into the PR #549 SOTA stack as an eval-time augmentation. CTW is a provably minimax-optimal sequential probability assignment over all variable-order Markov models up to depth D. It has zero artifact cost — the suffix tree is built entirely from already-scored tokens during evaluation.

Integration

CTW was deeply integrated into the TTT scoring loop — not as a separate eval pass. During Phase 1 (score) of each TTT chunk, neural logits from TTT-adapted weights are mixed with CTW predictions per-token via log-linear interpolation before computing NLL:

for each TTT chunk:
    Phase 1SCORE: sliding window eval
        for each scored token:
            mixed = (1 - w) * log_softmax(neural_logits) + w * log(ctw_probs)
            nll = cross_entropy(mixed, target)
            ctw.update(target)  # backward-looking: update AFTER scoring
    Phase 2TRAIN: SGD on chunk (unchanged from PR #549)

Finding: CTW Hurts Strong Neural Models

CTW degrades BPB by +0.005 at w=0.1, depth=4. The neural model at 1.12 BPB already captures n-gram patterns far better than any depth-4 Markov model. CTW's KT estimator over 1024 subword tokens is essentially a smoothed 4-gram model — the 11-layer transformer with 2048 context is already a strictly superior n-gram model. Mixing in a weaker predictor adds noise.

Additionally, the per-token Python loop makes CTW catastrophically slow (2,760s vs 352s for standard TTT), exceeding the 10-minute eval limit.

Why This Matters

Other approaches to n-gram eval augmentation in Parameter Golf (PRs #727, etc.) succeed by using:

  • Much higher order (5-7 grams) with count-min sketch
  • Entropy-adaptive mixing weight (near-zero when neural model is confident)
  • Vectorized GPU lookup (adds seconds, not minutes)

CTW's theoretical optimality over all variable-order Markov sources is irrelevant when the neural model already dominates the Markov component. The provable minimax guarantee applies to the class of tree sources — but the FineWeb validation set is not well-modeled by any depth-4 tree source that a 1024-vocab CTW can represent.

Base Architecture (PR #549 by @abaybektursun)

  • 11L, 512d, 8H/4KV, LeakyReLU(0.5)² MLP 3×
  • Parameter Banking + Parallel Muon (FlashAttention 3)
  • BigramHash(1536), XSA4, Partial RoPE(16), LN Scale, VE128
  • EMA(0.997) + Tight SWA(50), GPTQ-lite int6 + LZMA-6
  • Legal Score-First TTT (SGD, lr=0.002, 3 epochs, 32K chunks)

Run Commands

# Baseline (reproduces PR #549)
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
CTW_WEIGHT=0 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

# CTW enabled (negative result)
# Same as above but with: CTW_WEIGHT=0.1 CTW_DEPTH=4

Credits

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Anubhav ctw submission

BPB: 1.1203 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA dbf25a40840c, file records/track_non_record_16mb/2026-03-27_VRL_LeakyReLU2_GPTQ/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=50154 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=11, vocab=1024, code=50154 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants