[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation#1501
[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation#1501SPThole wants to merge 12 commits intoopenai:mainfrom
Conversation
Community Review — [Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT AblationCompliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) Summary PR #1501 ("2026_04_09_poscond_bigram_and_ablation") submits two experiments (record_exp101 and ablation_exp105a) whose train_gpt.py files are byte-for-byte identical. The submission uses score-first-per-chunk TTT (SGD, not AdamW) on the dequantized model after GPTQ. All audit flags are CLEAR. --- ## 1. N-gram / hash XOR illegal pattern — CLEAR No
|
This PR contains two experiments that form a complete unit:
(exp101) introducing position-conditional bigram hashing, and its controlled
ablation (exp105a) proving that the inherited FOMAML meta-TTT contributes
near-zero to the result.
See also: PR #1502
which builds on the ablation finding here and attempts a theoretically-grounded
redesign of the meta-TTT training loop.
which builds on the ablation finding here and attempts a theoretically-grounded
redesign of the meta-TTT training loop.
TL;DR — Key Learnings for the Community
Position-conditional bigram hashing is a zero-parameter trick that improves
legal_ttt by 0.001 bpb. If your model uses hash-based n-gram embeddings, check
whether different token classes (e.g., word-start vs within-word) are colliding
in the same buckets. Splitting the hash space by class can recover signal that
a shared hash was forced to suppress.
Always ablate inherited components. We ran 100+ experiments inheriting FOMAML
meta-TTT from an early ancestor without ever isolating its contribution. A
single-flag ablation revealed it adds +0.00036 bpb (noise) at 3% compute cost.
Those 3% translated to 206 lost training steps under a wallclock cap — a net
negative.
Same-batch FOMAML meta-TTT is equivalent to gradient noise in our setting.
It pushes the optimizer into a different local minimum (90-degree weight rotation)
but the new minimum has identical loss, identical TTT adaptation, and identical
quantization sensitivity. The rotation is a Muon optimizer artifact, not a
meaningful signal.
Weight-space cosine similarity is misleading under Muon. Two models trained
from the same seed with a 3% gradient perturbation show element-wise cosine of
0.05 (near-orthogonal) but principal-angle subspace cosine of 0.65 (partially
aligned). Use SVD-based subspace overlap for functional comparison, not raw
cosine.
Additional Info:
exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start
[0,2047)and within-word[2047,4094)halves keyed onhas_leading_space[current_token], eliminating bucket contention that forced the parent model'sword_start_boostgate to 0.007 → legal_ttt 1.11588 (zero extra params).exp105a: Single-flag ablation (
META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.Disclaimer
Hardware: All runs use a single H100 80 GB SXM GPU with
MAX_WALLCLOCK_SECONDS=4800(80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
standard 8×H100 @ 10 min budget at substantially lower cost. Gradient accumulation
(factor 4) ensures per-step updates are equivalent to the 8-GPU batch.
Early stopping: Both experiments stopped before the configured
ITERATIONS=7500due to the wallclock cap (exp101 at step 7020, exp105a at step 7226). This is
expected behavior, not a hardware failure — the final ~300-500 steps would be in
the deep warmdown phase with diminishing returns.
Non-record: exp105a is a non-record ablation experiment (
non_record: true).It exists solely to measure meta-TTT's contribution. exp101 is the record submission.
Cost constraint: GPU time was limited (~$3/hr H100 spot). Experiments that
clearly were not meeting expectations were terminated early to preserve budget
for more promising directions. Where this affected results, missing values are
marked "—" with explanation.
Architecture Overview
Base Architecture
All experiments in this lineage share the following architecture. We describe
every component so this document is self-contained.
lots of inspiration from PR #1019
tok_emb = lm_head^T)What is XSA (Cross-layer Shared Attention)?
In a standard transformer, each layer has its own Q, K, V, and output projection
matrices. In XSA, these are replaced by banked weight matrices shared across
all layers:
qo_bank: shape(22, 512, 512)— 22 "slots" (2 per layer × 11 layers), sharedquery-output projection. Each layer selects its 2 slots from the bank.
kv_bank: shape(22, 256, 512)— shared key-value projection.mlp_up_bank: shape(11, 1536, 512)— shared MLP input projection (one per layer).mlp_down_bank: shape(11, 512, 1536)— shared MLP output projection.Why bank: The bank structure makes Test-Time Training (TTT) efficient. At eval
time, the model adapts to test data by running SGD on just these 4 bank tensors
(~24M of the 27M params). Because they're stored as contiguous 3D tensors rather
than scattered per-layer matrices, the TTT optimizer can update all layers in a
single operation.
How layers access banks: Each layer
ireadsqo_bank[2*i:2*i+2]for itsquery/output weights and
kv_bank[2*i:2*i+2]for key/value. The bank is a sharedpool; the per-layer "selection" is just indexing, not learned routing.
What is the Bigram Hash Table?
The model includes a hash-based bigram embedding table (
bigram.embed.weight,shape
4096×64) that provides a fast, parameter-cheap lookup of bigram statistics:t, computehash(token[t-1], token[t]) mod 4095→ bucket indexbigram.scale(~0.11 after training)tThis gives the model access to bigram transition statistics without any
attention computation. With 1024² ≈ 1M possible bigrams mapped to 4095 buckets,
each bucket serves ~256 bigram contexts on average (hash collision is by design —
the embeddings learn an average predictive signal across all colliding contexts).
word_start_boost: A learned scalar gate (initialized to 1.0) that scalesthe bigram contribution specifically at word-start positions — positions where
the current token begins with a leading space (e.g.,
_the,_was,_and).In the parent model, this gate collapsed to 0.007, meaning the model learned
to almost completely suppress the bigram signal at word-start positions. This
suppression was the key observation that motivated exp101's innovation.
Training Pipeline
MATRIX_LR=0.025(Muon),0.001(AdamW)Quantization Pipeline (for 16 MB submission)
Test-Time Training (TTT) — The Scoring Mechanism
The competition uses score-first-then-adapt evaluation (called
legal_tttor
eval_val_sliding_ttt):CastedLinear._qat_enabled = TrueInnovation — What This PR Introduces
Innovation 1: Position-Conditional Bigram Hashing (exp101)
Problem observed: In the parent model, the bigram table's 4095 buckets are
shared between all
(prev, curr)bigram contexts regardless of whether thecurrent token is a word-start (has leading space) or within-word. Analysis of the
parent checkpoint revealed:
word_start_boostvalueRoot cause: Word-start bigram transitions (
_was→_the,_the→_quick)have enormous variance because the next word depends on semantic context that a
simple bigram can't capture. Within-word transitions (
qu→ick,th→e)are low-variance and highly predictable. When both types collide in the same hash
bucket, the learned embedding is a compromise that doesn't fit either well. The
model's only option is a global suppression gate.
Solution: Split the hash space by word-start class. The 4095 usable buckets
become two disjoint halves:
[0, 2047)(prev, curr)pairs wherehas_leading_space[curr] = true[2047, 4094)(prev, curr)pairs wherehas_leading_space[curr] = false40944095The split key is
has_leading_space[current_token], which is a deterministicproperty of the current token (already in the causal window — no future leakage).
This is the same information the existing
word_start_boostgate already uses,so legality is preserved.
Parameter cost: Zero. Same 4096×64 table, same parameter count. Only the hash
function changes.
Innovation 2: Trigram Lookup (exp101)
In addition to the
(t-1, t)bigram, we add a(t-2, t-1, t)trigram lookupthat hashes to the same table. This doubles the number of contexts per bucket but
each context carries more specific information (trigrams are more predictive than
bigrams). The trigram hash respects the same position-conditional split.
Parameter cost: Zero. Reuses the same embedding table.
Innovation 3: TTT Optimizer Correction (exp101)
The parent model configured AdamW+flat for in-training TTT but its reported
legal_ttt of 1.1169 was actually produced by a standalone SGD+cosine post-run.
We reverted to SGD+cosine during training to ensure the training-time and
eval-time TTT optimizers match. This is not novel but was a necessary correction.
Innovation 4: Single-Variable Meta-TTT Ablation (exp105a)
What is FOMAML meta-TTT? During training, every 4th step (
META_TTT_EVERY=4),the model runs a mini meta-learning loop:
training batch → produces adapted banks
banks'banks'on the same batchand accumulate it with the normal training gradient
The idea (from MAML) is that this teaches the banks to be "pre-positioned" for
fast adaptation, so TTT at eval time will be more effective.
The ablation: exp105a changes exactly one flag —
META_TTT_ENABLED=1 → 0—with everything else byte-identical (same seed, same data order, same LR schedule,
same QAT timing, same SWA windows, same train_gpt.py source). This is the cleanest
single-variable ablation possible in this codebase.
Weight-space analysis: We ran 5 CPU-only analyses on the two checkpoints
(script:
ablation_exp105a/supporting_files/analysis_meta_ttt.py, ~1.3s runtime):Results
exp101 — Record Submission
exp105a — Meta-TTT Ablation (non-record)
Head-to-Head Comparison
Comparison with Parent Architecture
Analysis
Why Position-Conditional Hashing Works
The theoretical prediction was that word-start bigrams have exploitable structure
(after sentence-ending punctuation, the next word-start is biased toward function
words and proper nouns; within a paragraph, the next word-start depends on
syntactic role). The position-conditional split lets the model learn this structure
in clean ws-only buckets rather than being forced to suppress everything via a
global gate.
Evidence it worked: The 0.001 bpb improvement from parent to exp101 is
consistent with the theoretical "realistic estimate" of ~0.01 bpb. The improvement
persists through quantization and TTT, confirming it's a genuine architectural gain
rather than an overfitting artifact.
Why the Ablation Kills the Meta-TTT Narrative
The same-batch FOMAML in exp101 has a fundamental objective mismatch:
At eval time (TTT), the model adapts on chunk
iand is scored on chunki—but the scoring happens before adaptation (score-first-then-adapt). The
meta-gradient optimizes for "banks that recover quickly from an SGD step on
seen data" — this rewards banks that resist change, not banks that
generalize to new data.
After 7000 training steps, the banks are already well-converged. The FOMAML
inner step barely moves them (small gradient on a near-optimum), so the outer
gradient (on the same data) carries essentially zero useful signal. The meta-TTT
degenerates into gradient noise.
Weight-Space Story: Orthogonal Weights, Same Function
The weight-space analysis (5 analyses, CPU-only, 1.3s) reveals a fascinating picture:
Element-level: Bank weight cosines are 0.05–0.10 (near-orthogonal). A 3%
training perturbation caused a 90° rotation in weight space. This is a Muon
amplification effect — Muon's Newton-Schulz gradient orthogonalization transforms
small gradient differences into large basis rotations.
Function-level: Principal-angle subspace cosines average 0.65, with
kv_bankat 0.955 (nearly identical subspace). The two models learned the samefunctional subspace but expressed it in a different basis. Their outputs on any
given input are identical to 3-4 decimal places.
Implication: Raw weight cosine is not a meaningful similarity metric under
Muon. Use SVD-based principal-angle analysis instead.
Learnings for the Community
Hash bucket contention is analyzable and fixable. If you use hash-based
embeddings (bigram tables, feature hashing, locality-sensitive hashing), check
whether semantically different token classes are colliding in the same buckets.
A learned gate that collapses toward 0 is a strong signal of bucket pollution.
Position-conditional splitting is a zero-param fix.
Ablate before you optimize. We inherited FOMAML meta-TTT through 100+
experiments and multiple architecture changes without ever isolating its
contribution. A one-line flag change (
META_TTT_ENABLED=0) revealed it wascontributing nothing. If we'd done this ablation 50 experiments earlier, we'd
have saved 3% of compute on every subsequent run.
Same-batch FOMAML is a trap for well-trained models. When the inner and
outer evaluation use the same data, the meta-gradient rewards parameter stability,
not adaptation ability. This is a known issue in meta-learning but is easy to
overlook when inheriting code from an early prototype where the model wasn't
well-trained yet.
Muon-trained models require subspace analysis, not cosine distance. The
Newton-Schulz orthogonalization in Muon amplifies small gradient perturbations
into large basis rotations. Two models from the same seed can be 90° apart in
weight space while computing the same function. Principal-angle subspace overlap
(via SVD) is the correct functional similarity metric.
The TTT delta is a property of architecture, not initialization. The ~0.023
bpb TTT improvement is identical whether meta-TTT is on or off. This implies the
TTT ceiling is set by the bank dimensionality and TTT optimizer configuration, not
by how the banks were initialized during training.
Related PRs
PR and tests whether a theoretically-correct redesign of FOMAML (cross-chunk
inner/outer split, delta-loss objective, learned per-layer LR scales) can move
the TTT ceiling. Spoiler: it can't — the TTT delta remains at ~0.023 bpb.
Includes a complete three-way weight-space analysis and error surface geometry
study across all three experiments.
Folder Structure