You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR presents a theoretically-grounded redesign of FOMAML meta-TTT that
addresses every identified flaw from #1501 ablation — and demonstrates that the
TTT ceiling is architecture-limited, not initialization-limited. Three training
procedures (original FOMAML, no meta-TTT, redesigned FOMAML) all produce the same
~0.023 bpb TTT delta, proving the ceiling is set by the bank dimensionality and TTT
optimizer, not by meta-training.
See also: [PR: Position-Conditional Bigram + Ablation] #1501 ,
which introduces the base architecture and proves FOMAML meta-TTT adds only
+0.00036 bpb in its original formulation.
TL;DR — Key Learnings for the Community
TTT adaptation ceiling is architecture-limited. Three different meta-training
objectives — same-batch FOMAML, no meta-training, and cross-chunk FOMAML with
Δ-loss — all produce the same ~0.023 bpb TTT improvement. No meta-training
objective can move this ceiling. To raise it, you need more adaptable parameters
(more bank layers, LoRA-style correctors) or a better TTT optimizer (Adam,
more epochs, higher LR).
Three different training procedures find equidistant solutions in weight space
with identical local curvature. Bank condition numbers (1.03–1.38), effective
ranks (22 for attention, 11 for MLP), and energy distributions are identical
across all three models. The loss landscape is degenerate: many equivalent
minima exist, meta-TTT selects which one you land in, but the TTT adaptation
surface looks the same from every minimum.
MetaSGD per-layer LR learning needs a stronger signal. 66 learned per-bank-
per-layer learning rate scales all converged to their 1.0 initialization. One
meta-step every 4 training steps is too infrequent, and the meta-gradient is too
weak relative to the main task gradient, to drive per-layer differentiation.
Cross-chunk FOMAML is less disruptive than same-batch FOMAML. Subspace
overlap analysis shows the no-meta model and cross-chunk model share 73%
functional subspace, vs only 62% between the no-meta and same-batch models.
The biased same-batch meta-gradient systematically rotates the MLP input
subspace; the unbiased cross-chunk variant preserves it.
Always measure the TTT delta, not just the final score. If we'd only
compared final legal_ttt numbers, we might have concluded exp106's float-path
1.11469 was better than exp101's 1.11588. But the delta tells the real story:
exp106's better float baseline (1.1377) compensates for fewer training steps,
while the TTT improvement itself is the same.
NOTE: exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start [0,2047) and within-word [2047,4094) halves keyed on has_leading_space[current_token], eliminating bucket contention that forced the parent model's word_start_boost gate to 0.007 → legal_ttt 1.11588 (zero extra params).
exp105a: Single-flag ablation (META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.
Disclaimer
Hardware: All runs use a single H100 80 GB SXM GPU with MAX_WALLCLOCK_SECONDS=4800
(80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
standard 8×H100 @ 10 min budget at substantially lower cost.
Early stopping due to wallclock: exp106 completed 6686 of 7500 steps —
~11% fewer than the ablation (exp105a: 7226 steps). This is because MetaSGD's
extra gradient storage (+8.6 GB peak memory) slowed each step from ~663 ms to
~718 ms, consuming the 80-minute budget faster. The model was still in the
warmdown phase when stopped.
Int6 canonical eval crashed: After GPTQ quantization, eval_model.load_state_dict()
failed with RuntimeError: Missing key(s): "meta_sgd_qo", "meta_sgd_kv", "meta_sgd_up", "meta_sgd_down" because the 66 MetaSGD parameters were correctly excluded from the
16 MB export but GPT.__init__ still registers them. This meant the in-script int6
roundtrip evaluation and canonical legal_ttt could not run. A hotfix was applied to
the standalone ttt_from_checkpoint.py harness, which produced the float-path and
partial int6 numbers reported here. Where int6 canonical values are unavailable, they
are marked "—".
Non-record: This experiment is a non-record exploration (non_record: true). It
exists to answer the question "can a better meta-TTT formulation move the TTT ceiling?"
Cost constraint: GPU time was limited. The partial int6 TTT run (80% complete) was
terminated when the trajectory showed no convergence trend different from the baseline.
Projected final value is ~1.118, consistent with the invariant ~0.023 delta.
Architecture Overview
Base Architecture
This experiment shares the identical architecture as PR #1501 (exp101). We reproduce
the full specification here for self-containment.
lots of inspiration from : PR #1019
Component
Configuration
What it does
Model
11-layer U-Net GPT
5 encoder blocks + 6 decoder blocks with skip connections between corresponding encoder-decoder pairs. Skip connections (additive residuals) help gradient flow and allow the decoder to reference early-layer representations.
Input embedding and output projection share weights.
RoPE
Partial, 16/64 dims
Rotary Position Embeddings on 25% of head dimensions.
XSA
All 11 blocks
Cross-layer Shared Attention: Q/K/V/O and MLP weights stored as banked 3D tensors shared across all layers (see PR #1501 for full explanation). The 4 banks (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank) are the parameters adapted during TTT.
VE
Layers 7–10
Value Embeddings: additional value projection on the last 4 layers.
Bigram
4096×64, position-conditional
Hash-based bigram table with word-start/within-word bucket split (see PR #1501 for full explanation).
PR #1501's ablation (exp105a) proved that exp101's FOMAML meta-TTT adds only +0.00036
bpb. But the concept of meta-TTT — training the model to adapt faster at test
time — is theoretically sound (MAML-style learning works in the meta-learning
literature). The failure had three identifiable structural causes:
Flaw
What's wrong
How it hurts
(A) Same-batch inner/outer
Inner loop adapts on batch X, outer evaluates on batch X
Meta-gradient rewards banks that resist SGD on seen data — the opposite of "generalize to unseen test chunks"
(B) No adaptation reward
Outer loss = absolute L(banks'; X)
A bank with low initial loss that gets worse under the inner step is rewarded equally as one that improves. No term explicitly rewards the improvement from adaptation.
(C) Uniform inner LR
All 4 bank types × 11 layers use inner_lr = 0.002
The optimal adaptation speed for a shallow attention bank vs a deep MLP bank is likely different. No mechanism to learn this.
Innovation A: Cross-Chunk Split
Split the training batch B (shape [batch, seq_len]) along the batch dimension
into two halves. The first half provides the inner-loop adaptation data, the second
half provides the outer-loop evaluation data:
Inner: banks' ← banks − α · s ⊙ ∇L(banks; B_first_half)
Outer: L_meta = L(banks'; B_second_half) ← DIFFERENT documents
Because the dataloader draws independent random sequences from fineweb10B, B_first_half
and B_second_half come from different documents. This matches the TTT deployment
regime: adapt on document i, get scored on document j.
Fallback: When per-GPU batch size = 1 (not our case, but handled), falls back
to sequence-half split (first/last 1024 tokens of the same sequence).
Innovation B: Delta-Loss Outer Objective
Instead of optimizing absolute post-adaptation loss, we add a term that explicitly
rewards the improvement from the inner step:
L_meta = (w_post + w_Δ) · L_post − w_Δ · L_pre
where:
L_post = L(banks'; B_second_half) ← loss AFTER adaptation
L_pre = L(banks; B_second_half) ← loss BEFORE adaptation (detached banks)
w_post = 0.5 (META_TTT_LOSS_WEIGHT)
w_Δ = 0.3 (META_TTT_DELTA_WEIGHT)
The second term is the adaptation delta: it directly penalizes banks where the
inner step makes things worse and rewards banks where it helps. A bank that starts
with low loss but doesn't improve gets penalized by the −w_Δ · L_pre term.
Cost: One extra forward pass per meta-step (computing L_pre).
Innovation C: MetaSGD — Learned Per-Layer LR Scales
For each bank type k ∈ {qo, kv, up, down} and each layer l:
where s[k, l] ∈ R+ is a learned scalar initialized to 1.0. Shapes:
Parameter
Shape
Count
Purpose
meta_sgd_qo
(22,)
22
Per-slot LR scale for query-output bank
meta_sgd_kv
(22,)
22
Per-slot LR scale for key-value bank
meta_sgd_up
(11,)
11
Per-layer LR scale for MLP up-projection
meta_sgd_down
(11,)
11
Per-layer LR scale for MLP down-projection
Total
—
66
Excluded from 16 MB export (0 bytes in submission)
If meta-TTT works, different layers should learn different scales — e.g., shallow
attention layers might need larger inner-loop steps than deep MLP layers. The
scales are registered as nn.Parameter and receive gradients via the outer loss
backprop. They are excluded from the exported final_model.pt and final_model.int6.ptz to preserve the 16 MB budget.
Implementation detail: The inner-loop update is built as a differentiable
non-leaf tensor so a single backward pass populates both MetaSGD scale gradients
(via leaf autograd) and bank FOMAML gradients (via retain_grad + manual copy).
Results
exp106 — Meta-TTT Redesign
Metric
Value
Source
Note
Steps completed
6686 / 7500
wallclock cap
−334 vs exp101 (MetaSGD overhead)
val_bpb @ step 3000
1.2251
training log
val_bpb @ step 6000
1.1431
training log
Best of the three at matched step
Post-EMA val_bpb
1.1377
training log
Slightly worse than exp101 (fewer steps)
MetaSGD params exported
0 (66 excluded)
by design
Int6 val_bpb (roundtrip)
—
crashed
meta_sgd_* strict-load RuntimeError
Model size (int6+lzma)
15.02 MB
final artifact
Total submission size
15.14 MB
model + code
Peak GPU memory
31,695 MiB
training log
+8.6 GB vs exp101 (MetaSGD gradients)
Float baseline bpb
1.13767
ttt_from_checkpoint_float_qatoff.log
Float-path legal_ttt
1.11469
ttt_from_checkpoint_float_qatoff.log
Float TTT delta
−0.02299
computed
Int6 TTT (partial 80%)
1.11800
ttt_int6_ep4_partial.log (chunk 761/947)
Projected int6 legal_ttt
~1.118
trajectory extrapolation
Late QAT fired
step 5110
training log
SWA started
step 5300
training log
Int6 TTT Trajectory (partial, 80% complete)
Chunk progress
bpb
Source
401 / 947 (42%)
1.117622
ttt_int6_ep4_partial.log
621 / 947 (66%)
1.118994
ttt_int6_ep4_partial.log
661 / 947 (70%)
1.116769
ttt_int6_ep4_partial.log
681 / 947 (72%)
1.116469
ttt_int6_ep4_partial.log
761 / 947 (80%)
1.117976
ttt_int6_ep4_partial.log
Baseline (int6 canonical): 1.14160. Running delta at 80%: −0.02362.
The trajectory is flat in the 66–80% range. Projected final: ~1.118.
MetaSGD Scale Convergence
All 66 learned LR scales converged to values near their 1.0 initialization:
Parameter group
Mean
Std
Min
Max
meta_sgd_qo (22 scales)
~1.00
<0.04
>0.92
<1.08
meta_sgd_kv (22 scales)
~1.00
<0.04
>0.93
<1.07
meta_sgd_up (11 scales)
~1.00
<0.03
>0.94
<1.06
meta_sgd_down (11 scales)
~1.00
<0.03
>0.95
<1.05
Interpretation: No per-layer differentiation was learned. The meta-training
signal (1 meta-step per 4 training steps, at ~30% of main gradient magnitude)
is too weak to push 66 scalar parameters away from their initialization over
6686 training steps.
Analysis — Complete Meta-TTT Lineage (All Three Experiments)
This section summarizes the findings across all three experiments in this series.
A reader who sees only this PR should be able to understand the full meta-TTT story.
*Float-path TTT; int6 canonical unavailable due to strict-load crash.
The Central Finding: TTT Delta Invariance
Experiment
Baseline bpb
Post-TTT bpb
TTT delta
Source
exp101 (FOMAML, int6)
1.13930
1.11588
−0.02342
logs_seed42.txt
exp105a (no meta, int6)
1.13956
1.11624
−0.02331
logs_seed42.txt
exp106 (redesign, float)
1.13767
1.11469
−0.02299
ttt_from_checkpoint_float_qatoff.log
exp106 (redesign, int6 80%)
1.14160
~1.118
~−0.024
ttt_int6_ep4_partial.log
The TTT delta is −0.023 ± 0.001 bpb across all variants. Three different
training objectives — from "no meta-signal" to "theoretically correct cross-document
generalization reward" — produce the same adaptation improvement.
Three-Way Weight-Space Analysis
We ran 8 analyses comparing all three models pairwise (script: supporting_files/analysis_three_way.py, CPU-only, 3.6s on M2):
All three models are approximately the same distance from each other. Meta-TTT
doesn't push you in a consistent direction — it pushes you to a random neighboring
basin, and the specific basin depends on the meta-gradient formulation.
Subspace Overlap: Cross-Chunk Preserves the Natural Subspace
Pair
Avg subspace cosine
Frac dims aligned
exp101 vs exp105a (FOMAML vs no-meta)
0.615
0.411
exp101 vs exp106 (FOMAML vs redesign)
0.659
0.472
exp105a vs exp106 (no-meta vs redesign)
0.727
0.548
The redesigned cross-chunk FOMAML (exp106) produces a solution closer in
functional subspace to the no-meta baseline than the original same-batch
FOMAML (exp101) does. The biased same-batch meta-gradient rotates the subspace
more than the unbiased cross-chunk variant.
Most striking: mlp_up_bank subspace cosine is 0.949 between exp105a and
exp106 (nearly identical) but only 0.551 between exp101 and exp105a (half-
rotated). Same-batch FOMAML systematically distorts the MLP input features.
Error Surface: Identical Curvature at All Three Minima
Property
exp101
exp105a
exp106
Interpretation
Bank avg condition number
1.2
1.2
1.2
Near-isotropic — SGD works equally well in all directions
Bank avg effective rank
16.5
16.5
16.5
All bank dimensions contribute equally
Bank avg top-5 energy frac
0.37
0.37
0.37
Uniform energy distribution
Quantization MSE
8.686e-5
8.691e-5
8.686e-5
Identical sensitivity to int6
This is why the TTT delta is invariant: the local curvature of the loss
landscape — the surface that SGD navigates during TTT — is identical at all three
minima. SGD makes the same progress per step from any starting point.
Mode Connectivity: Three Distinct Basins
Pair
Midpoint norm ratio
Assessment
exp101 vs exp105a
0.786
Different basins
exp101 vs exp106
0.793
Different basins
exp105a vs exp106
0.807
Borderline same basin
3-way centroid
0.704
Clearly distinct (30% norm loss)
The three models occupy distinct local minima. exp105a and exp106 are closest
to being in the same basin (ratio 0.807, threshold ~0.8), consistent with
cross-chunk FOMAML being less disruptive than same-batch FOMAML.
Why Meta-TTT Cannot Move the Ceiling
The argument from curvature invariance: TTT improvement depends on (1) how
far SGD can move the banks in 4 epochs (fixed by TTT config) and (2) how much
loss reduction each step buys (determined by local curvature). We showed the
curvature is identical at all three minima. Therefore the TTT delta must be
identical — QED.
The argument from over-parameterization: The training loss surface has a
degenerate set of equivalent minima (the three models prove this). Meta-TTT
selects a different minimum but cannot escape the set. All minima in the set
have the same curvature and the same TTT potential. To escape, you'd need a
stronger perturbation: second-order MAML, many more inner steps, or a dedicated
meta-training phase after warmdown.
The argument from MetaSGD: If per-layer LR differentiation could help, the
66 MetaSGD scales should have diverged from their 1.0 initialization. They
didn't. The meta-gradient signal at 1 step per 4, with loss weight 0.5, is
too weak to drive 66 scalar parameters in 6686 training steps.
Learnings for the Community
The TTT adaptation ceiling is set by architecture, not initialization.
~0.023 bpb is invariant across three FOMAML variants (same-batch, none,
cross-chunk + Δ-loss + MetaSGD). To improve TTT, change the bank dimensionality
or the TTT optimizer — not the training-time meta-objective.
First-order MAML with 1 inner step on a well-trained model ≈ gradient noise.
After 6000+ training steps, the banks are near a local optimum. A single inner
SGD step barely perturbs them, so the FOMAML outer gradient carries near-zero
functional signal regardless of how the inner/outer data is split.
Cross-chunk FOMAML is less harmful than same-batch FOMAML (even though both
are useless for TTT). Same-batch FOMAML introduces a systematic directional
bias that rotates the MLP input subspace 45° from the natural optimum. Cross-
chunk FOMAML's unbiased meta-gradient preserves the natural subspace (cos 0.95).
MetaSGD needs a stronger signal to learn meaningful per-layer differentiation.
At 1 meta-step per 4 training steps with loss weight 0.5, the effective meta-
gradient energy is ~7.5% of total gradient. This is insufficient to drive 66
scalar parameters away from their initialization over 6686 steps.
Three equivalent minima with identical local curvature — the loss landscape
of a Muon-trained 27M-param transformer has a degenerate set of solutions.
Meta-learning perturbations select among them but cannot improve them. This
is consistent with overparameterization theory and with empirical results from
lottery ticket and mode connectivity research.
Measure the delta, not the score. If we'd only compared final bpb numbers,
exp106's 1.11469 looks better than exp101's 1.11588. But the TTT delta
(architecture-level metric) is the same. The per-experiment score difference comes
from different pre-TTT baselines (1.1377 vs 1.1393), which are driven by the
number of training steps completed, not by meta-TTT quality.
Related PRs
PR [Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation #1501 — Position-Conditional Bigram + Ablation (exp101 + exp105a): Introduces
the base architecture (position-conditional bigram hashing, a zero-parameter trick
that improves legal_ttt by 0.001 bpb) and the controlled ablation proving same-batch
FOMAML meta-TTT contributes only +0.00036 bpb. The ablation finding is the
motivation for this PR's redesign.
Analysis ### TTT Pattern: Score-First Per Chunk (LEGAL) The TTT implementation in eval_val_sliding_ttt (lines 1190–1330) follows the legal score-first-per-chunk pattern: 1. Score first under inference_mode (lines 1242–1272): Each chunk's windows are scored under torch.inference_mode() using base_model.eval(). Loss, token counts, and byte counts are accumulated. 2. Train after scoring, with is_last guard (lines 1273–1304): Training only occurs if not is_last (line 1274). The model is switched to .train() mode, and multi-epoch SGD runs on the chunk's val_tokens. This matches the PR #1413 legal pattern exactly. python # Line 1243: score first with torch.inference_mode(): ... loss_sum += scored_nll.sum() # accumulate scored region # Line 1273–1304: train after, guarded is_last = (ci == num_chunks - 1) if not is_last and args.ttt_epochs > 0: base_model.train() for _ep in range(args.ttt_epochs): ... optimizer.step() # SGD, not AdamW The optimizer is SGD (line 1228), not AdamW — this is relevant context but not itself a classification criterion. ### N-gram Hash: LEGAL The bigram hash at line 806: python torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod This is input XOR input (adjacent tokens), not ctx_hash XOR (target * primes[k]). The illegal pattern requires the target to be XOR'd into the lookup key. Neither bigram_hash nor trigram_hash (line 822) involves the target token — they only hash context tokens to produce an embedding for the current position. ### Meta-TTT (FOMAML): No Val-Token Violation Meta-TTT (meta_ttt_step, line 1780) operates on training micro-batches (variables x, y from the training loop at line 2160), not on...
Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.
Recommendation to @cocohearts@valerio-oai@0hq@yuzhougu-oai@notapplica:MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.
Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.
SPThole
changed the title
[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD
[Non Record/notable] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD
May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR presents a theoretically-grounded redesign of FOMAML meta-TTT that
addresses every identified flaw from #1501 ablation — and demonstrates that the
TTT ceiling is architecture-limited, not initialization-limited. Three training
procedures (original FOMAML, no meta-TTT, redesigned FOMAML) all produce the same
~0.023 bpb TTT delta, proving the ceiling is set by the bank dimensionality and TTT
optimizer, not by meta-training.
See also: [PR: Position-Conditional Bigram + Ablation] #1501 ,
which introduces the base architecture and proves FOMAML meta-TTT adds only
+0.00036 bpb in its original formulation.
TL;DR — Key Learnings for the Community
TTT adaptation ceiling is architecture-limited. Three different meta-training
objectives — same-batch FOMAML, no meta-training, and cross-chunk FOMAML with
Δ-loss — all produce the same ~0.023 bpb TTT improvement. No meta-training
objective can move this ceiling. To raise it, you need more adaptable parameters
(more bank layers, LoRA-style correctors) or a better TTT optimizer (Adam,
more epochs, higher LR).
Three different training procedures find equidistant solutions in weight space
with identical local curvature. Bank condition numbers (1.03–1.38), effective
ranks (22 for attention, 11 for MLP), and energy distributions are identical
across all three models. The loss landscape is degenerate: many equivalent
minima exist, meta-TTT selects which one you land in, but the TTT adaptation
surface looks the same from every minimum.
MetaSGD per-layer LR learning needs a stronger signal. 66 learned per-bank-
per-layer learning rate scales all converged to their 1.0 initialization. One
meta-step every 4 training steps is too infrequent, and the meta-gradient is too
weak relative to the main task gradient, to drive per-layer differentiation.
Cross-chunk FOMAML is less disruptive than same-batch FOMAML. Subspace
overlap analysis shows the no-meta model and cross-chunk model share 73%
functional subspace, vs only 62% between the no-meta and same-batch models.
The biased same-batch meta-gradient systematically rotates the MLP input
subspace; the unbiased cross-chunk variant preserves it.
Always measure the TTT delta, not just the final score. If we'd only
compared final legal_ttt numbers, we might have concluded exp106's float-path
1.11469 was better than exp101's 1.11588. But the delta tells the real story:
exp106's better float baseline (1.1377) compensates for fewer training steps,
while the TTT improvement itself is the same.
NOTE: exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start
[0,2047)and within-word[2047,4094)halves keyed onhas_leading_space[current_token], eliminating bucket contention that forced the parent model'sword_start_boostgate to 0.007 → legal_ttt 1.11588 (zero extra params).exp105a: Single-flag ablation (
META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.Disclaimer
Hardware: All runs use a single H100 80 GB SXM GPU with
MAX_WALLCLOCK_SECONDS=4800(80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
standard 8×H100 @ 10 min budget at substantially lower cost.
Early stopping due to wallclock: exp106 completed 6686 of 7500 steps —
~11% fewer than the ablation (exp105a: 7226 steps). This is because MetaSGD's
extra gradient storage (+8.6 GB peak memory) slowed each step from ~663 ms to
~718 ms, consuming the 80-minute budget faster. The model was still in the
warmdown phase when stopped.
Int6 canonical eval crashed: After GPTQ quantization,
eval_model.load_state_dict()failed with
RuntimeError: Missing key(s): "meta_sgd_qo", "meta_sgd_kv", "meta_sgd_up", "meta_sgd_down"because the 66 MetaSGD parameters were correctly excluded from the16 MB export but
GPT.__init__still registers them. This meant the in-script int6roundtrip evaluation and canonical legal_ttt could not run. A hotfix was applied to
the standalone
ttt_from_checkpoint.pyharness, which produced the float-path andpartial int6 numbers reported here. Where int6 canonical values are unavailable, they
are marked "—".
Non-record: This experiment is a non-record exploration (
non_record: true). Itexists to answer the question "can a better meta-TTT formulation move the TTT ceiling?"
Cost constraint: GPU time was limited. The partial int6 TTT run (80% complete) was
terminated when the trajectory showed no convergence trend different from the baseline.
Projected final value is ~1.118, consistent with the invariant ~0.023 delta.
Architecture Overview
Base Architecture
This experiment shares the identical architecture as PR #1501 (exp101). We reproduce
the full specification here for self-containment.
lots of inspiration from : PR #1019
tok_emb = lm_head^T)qo_bank,kv_bank,mlp_up_bank,mlp_down_bank) are the parameters adapted during TTT.Training Pipeline
Quantization and TTT
Same pipeline as PR #1501 : GPTQ int6 (attn+MLP) / int8 (embed) → LZMA → 16 MB.
TTT: SGD + cosine LR, momentum 0.9, 4 epochs, 947 chunks × 65K tokens.
Scoring: score-first-then-adapt (
legal_ttt).Innovation — What This PR Introduces
Motivation: Why Meta-TTT Needed a Redesign
PR #1501's ablation (exp105a) proved that exp101's FOMAML meta-TTT adds only +0.00036
bpb. But the concept of meta-TTT — training the model to adapt faster at test
time — is theoretically sound (MAML-style learning works in the meta-learning
literature). The failure had three identifiable structural causes:
L(banks'; X)inner_lr = 0.002Innovation A: Cross-Chunk Split
Split the training batch
B(shape[batch, seq_len]) along the batch dimensioninto two halves. The first half provides the inner-loop adaptation data, the second
half provides the outer-loop evaluation data:
Because the dataloader draws independent random sequences from fineweb10B,
B_first_halfand
B_second_halfcome from different documents. This matches the TTT deploymentregime: adapt on document
i, get scored on documentj.Fallback: When per-GPU batch size = 1 (not our case, but handled), falls back
to sequence-half split (first/last 1024 tokens of the same sequence).
Innovation B: Delta-Loss Outer Objective
Instead of optimizing absolute post-adaptation loss, we add a term that explicitly
rewards the improvement from the inner step:
Expanding:
L_meta = 0.5 · L_post + 0.3 · (L_post − L_pre)The second term is the adaptation delta: it directly penalizes banks where the
inner step makes things worse and rewards banks where it helps. A bank that starts
with low loss but doesn't improve gets penalized by the
−w_Δ · L_preterm.Cost: One extra forward pass per meta-step (computing
L_pre).Innovation C: MetaSGD — Learned Per-Layer LR Scales
For each bank type
k ∈ {qo, kv, up, down}and each layerl:where
s[k, l] ∈ R+is a learned scalar initialized to 1.0. Shapes:meta_sgd_qometa_sgd_kvmeta_sgd_upmeta_sgd_downIf meta-TTT works, different layers should learn different scales — e.g., shallow
attention layers might need larger inner-loop steps than deep MLP layers. The
scales are registered as
nn.Parameterand receive gradients via the outer lossbackprop. They are excluded from the exported
final_model.ptandfinal_model.int6.ptzto preserve the 16 MB budget.Implementation detail: The inner-loop update is built as a differentiable
non-leaf tensor so a single backward pass populates both MetaSGD scale gradients
(via leaf autograd) and bank FOMAML gradients (via
retain_grad+ manual copy).Results
exp106 — Meta-TTT Redesign
meta_sgd_*strict-load RuntimeErrorInt6 TTT Trajectory (partial, 80% complete)
Baseline (int6 canonical): 1.14160. Running delta at 80%: −0.02362.
The trajectory is flat in the 66–80% range. Projected final: ~1.118.
MetaSGD Scale Convergence
All 66 learned LR scales converged to values near their 1.0 initialization:
Interpretation: No per-layer differentiation was learned. The meta-training
signal (1 meta-step per 4 training steps, at ~30% of main gradient magnitude)
is too weak to push 66 scalar parameters away from their initialization over
6686 training steps.
Analysis — Complete Meta-TTT Lineage (All Three Experiments)
This section summarizes the findings across all three experiments in this series.
A reader who sees only this PR should be able to understand the full meta-TTT story.
The Three Experiments
META_TTT_ENABLED=0)*Float-path TTT; int6 canonical unavailable due to strict-load crash.
The Central Finding: TTT Delta Invariance
The TTT delta is −0.023 ± 0.001 bpb across all variants. Three different
training objectives — from "no meta-signal" to "theoretically correct cross-document
generalization reward" — produce the same adaptation improvement.
Three-Way Weight-Space Analysis
We ran 8 analyses comparing all three models pairwise (script:
supporting_files/analysis_three_way.py, CPU-only, 3.6s on M2):Triangle Geometry: Equidistant Solutions
All three models are approximately the same distance from each other. Meta-TTT
doesn't push you in a consistent direction — it pushes you to a random neighboring
basin, and the specific basin depends on the meta-gradient formulation.
Subspace Overlap: Cross-Chunk Preserves the Natural Subspace
The redesigned cross-chunk FOMAML (exp106) produces a solution closer in
functional subspace to the no-meta baseline than the original same-batch
FOMAML (exp101) does. The biased same-batch meta-gradient rotates the subspace
more than the unbiased cross-chunk variant.
Most striking:
mlp_up_banksubspace cosine is 0.949 between exp105a andexp106 (nearly identical) but only 0.551 between exp101 and exp105a (half-
rotated). Same-batch FOMAML systematically distorts the MLP input features.
Error Surface: Identical Curvature at All Three Minima
This is why the TTT delta is invariant: the local curvature of the loss
landscape — the surface that SGD navigates during TTT — is identical at all three
minima. SGD makes the same progress per step from any starting point.
Mode Connectivity: Three Distinct Basins
The three models occupy distinct local minima. exp105a and exp106 are closest
to being in the same basin (ratio 0.807, threshold ~0.8), consistent with
cross-chunk FOMAML being less disruptive than same-batch FOMAML.
Why Meta-TTT Cannot Move the Ceiling
The argument from curvature invariance: TTT improvement depends on (1) how
far SGD can move the banks in 4 epochs (fixed by TTT config) and (2) how much
loss reduction each step buys (determined by local curvature). We showed the
curvature is identical at all three minima. Therefore the TTT delta must be
identical — QED.
The argument from over-parameterization: The training loss surface has a
degenerate set of equivalent minima (the three models prove this). Meta-TTT
selects a different minimum but cannot escape the set. All minima in the set
have the same curvature and the same TTT potential. To escape, you'd need a
stronger perturbation: second-order MAML, many more inner steps, or a dedicated
meta-training phase after warmdown.
The argument from MetaSGD: If per-layer LR differentiation could help, the
66 MetaSGD scales should have diverged from their 1.0 initialization. They
didn't. The meta-gradient signal at 1 step per 4, with loss weight 0.5, is
too weak to drive 66 scalar parameters in 6686 training steps.
Learnings for the Community
The TTT adaptation ceiling is set by architecture, not initialization.
~0.023 bpb is invariant across three FOMAML variants (same-batch, none,
cross-chunk + Δ-loss + MetaSGD). To improve TTT, change the bank dimensionality
or the TTT optimizer — not the training-time meta-objective.
First-order MAML with 1 inner step on a well-trained model ≈ gradient noise.
After 6000+ training steps, the banks are near a local optimum. A single inner
SGD step barely perturbs them, so the FOMAML outer gradient carries near-zero
functional signal regardless of how the inner/outer data is split.
Cross-chunk FOMAML is less harmful than same-batch FOMAML (even though both
are useless for TTT). Same-batch FOMAML introduces a systematic directional
bias that rotates the MLP input subspace 45° from the natural optimum. Cross-
chunk FOMAML's unbiased meta-gradient preserves the natural subspace (cos 0.95).
MetaSGD needs a stronger signal to learn meaningful per-layer differentiation.
At 1 meta-step per 4 training steps with loss weight 0.5, the effective meta-
gradient energy is ~7.5% of total gradient. This is insufficient to drive 66
scalar parameters away from their initialization over 6686 steps.
Three equivalent minima with identical local curvature — the loss landscape
of a Muon-trained 27M-param transformer has a degenerate set of solutions.
Meta-learning perturbations select among them but cannot improve them. This
is consistent with overparameterization theory and with empirical results from
lottery ticket and mode connectivity research.
Measure the delta, not the score. If we'd only compared final bpb numbers,
exp106's 1.11469 looks better than exp101's 1.11588. But the TTT delta
(architecture-level metric) is the same. The per-experiment score difference comes
from different pre-TTT baselines (1.1377 vs 1.1393), which are driven by the
number of training steps completed, not by meta-TTT quality.
Related PRs
the base architecture (position-conditional bigram hashing, a zero-parameter trick
that improves legal_ttt by 0.001 bpb) and the controlled ablation proving same-batch
FOMAML meta-TTT contributes only +0.00036 bpb. The ablation finding is the
motivation for this PR's redesign.
Folder Structure