[Non Record/notable] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD by SPThole · Pull Request #1502 · openai/parameter-golf

SPThole · 2026-04-09T15:43:28Z

Track: 10min_16mb (Track B, score-first-then-adapt) | Hardware: 1×H100 80 GB SXM
Float-path legal_ttt: 1.11469 | TTT delta: −0.02299 bpb
Status: Non-record exploration (int6 canonical eval crashed; see Disclaimer)

This PR presents a theoretically-grounded redesign of FOMAML meta-TTT that
addresses every identified flaw from #1501 ablation — and demonstrates that the
TTT ceiling is architecture-limited, not initialization-limited. Three training
procedures (original FOMAML, no meta-TTT, redesigned FOMAML) all produce the same
~0.023 bpb TTT delta, proving the ceiling is set by the bank dimensionality and TTT
optimizer, not by meta-training.

See also: [PR: Position-Conditional Bigram + Ablation] #1501 ,
which introduces the base architecture and proves FOMAML meta-TTT adds only
+0.00036 bpb in its original formulation.

TL;DR — Key Learnings for the Community

TTT adaptation ceiling is architecture-limited. Three different meta-training
objectives — same-batch FOMAML, no meta-training, and cross-chunk FOMAML with
Δ-loss — all produce the same ~0.023 bpb TTT improvement. No meta-training
objective can move this ceiling. To raise it, you need more adaptable parameters
(more bank layers, LoRA-style correctors) or a better TTT optimizer (Adam,
more epochs, higher LR).
Three different training procedures find equidistant solutions in weight space
with identical local curvature. Bank condition numbers (1.03–1.38), effective
ranks (22 for attention, 11 for MLP), and energy distributions are identical
across all three models. The loss landscape is degenerate: many equivalent
minima exist, meta-TTT selects which one you land in, but the TTT adaptation
surface looks the same from every minimum.
MetaSGD per-layer LR learning needs a stronger signal. 66 learned per-bank-
per-layer learning rate scales all converged to their 1.0 initialization. One
meta-step every 4 training steps is too infrequent, and the meta-gradient is too
weak relative to the main task gradient, to drive per-layer differentiation.
Cross-chunk FOMAML is less disruptive than same-batch FOMAML. Subspace
overlap analysis shows the no-meta model and cross-chunk model share 73%
functional subspace, vs only 62% between the no-meta and same-batch models.
The biased same-batch meta-gradient systematically rotates the MLP input
subspace; the unbiased cross-chunk variant preserves it.
Always measure the TTT delta, not just the final score. If we'd only
compared final legal_ttt numbers, we might have concluded exp106's float-path
1.11469 was better than exp101's 1.11588. But the delta tells the real story:
exp106's better float baseline (1.1377) compensates for fewer training steps,
while the TTT improvement itself is the same.

NOTE: exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start [0,2047) and within-word [2047,4094) halves keyed on has_leading_space[current_token], eliminating bucket contention that forced the parent model's word_start_boost gate to 0.007 → legal_ttt 1.11588 (zero extra params).

exp105a: Single-flag ablation (META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.

Disclaimer

Hardware: All runs use a single H100 80 GB SXM GPU with MAX_WALLCLOCK_SECONDS=4800
(80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
standard 8×H100 @ 10 min budget at substantially lower cost.
Early stopping due to wallclock: exp106 completed 6686 of 7500 steps —
~11% fewer than the ablation (exp105a: 7226 steps). This is because MetaSGD's
extra gradient storage (+8.6 GB peak memory) slowed each step from ~663 ms to
~718 ms, consuming the 80-minute budget faster. The model was still in the
warmdown phase when stopped.
Int6 canonical eval crashed: After GPTQ quantization, eval_model.load_state_dict()
failed with RuntimeError: Missing key(s): "meta_sgd_qo", "meta_sgd_kv", "meta_sgd_up", "meta_sgd_down" because the 66 MetaSGD parameters were correctly excluded from the
16 MB export but GPT.__init__ still registers them. This meant the in-script int6
roundtrip evaluation and canonical legal_ttt could not run. A hotfix was applied to
the standalone ttt_from_checkpoint.py harness, which produced the float-path and
partial int6 numbers reported here. Where int6 canonical values are unavailable, they
are marked "—".
Non-record: This experiment is a non-record exploration (non_record: true). It
exists to answer the question "can a better meta-TTT formulation move the TTT ceiling?"
Cost constraint: GPU time was limited. The partial int6 TTT run (80% complete) was
terminated when the trajectory showed no convergence trend different from the baseline.
Projected final value is ~1.118, consistent with the invariant ~0.023 delta.

Architecture Overview

Base Architecture

This experiment shares the identical architecture as PR #1501 (exp101). We reproduce
the full specification here for self-containment.
lots of inspiration from : PR #1019

Component	Configuration	What it does
Model	11-layer U-Net GPT	5 encoder blocks + 6 decoder blocks with skip connections between corresponding encoder-decoder pairs. Skip connections (additive residuals) help gradient flow and allow the decoder to reference early-layer representations.
Hidden dim	512	Width of the residual stream.
Attention	8Q / 4KV (GQA)	Grouped-Query Attention: 8 query heads share 4 key-value heads (2:1 ratio). Halves KV param count with minimal quality loss.
MLP	3× expansion (1536)	SwiGLU feed-forward network: 512 → 1536 → 512.
Vocabulary	1024 tokens	SentencePiece BPE on fineweb10B.
Embeddings	Tied (`tok_emb = lm_head^T`)	Input embedding and output projection share weights.
RoPE	Partial, 16/64 dims	Rotary Position Embeddings on 25% of head dimensions.
XSA	All 11 blocks	Cross-layer Shared Attention: Q/K/V/O and MLP weights stored as banked 3D tensors shared across all layers (see PR #1501 for full explanation). The 4 banks (`qo_bank`, `kv_bank`, `mlp_up_bank`, `mlp_down_bank`) are the parameters adapted during TTT.
VE	Layers 7–10	Value Embeddings: additional value projection on the last 4 layers.
Bigram	4096×64, position-conditional	Hash-based bigram table with word-start/within-word bucket split (see PR #1501 for full explanation).
Total params	26,960,991	~27M trainable parameters.

Training Pipeline

Component	Configuration
Optimizer	Muon (matrices) + AdamW (embeddings, scalars)
Schedule	Cosine warmdown, adaptive trigger
EMA	Decay 0.998
SWA	Every 50 steps during warmdown
Late QAT	Threshold 0.25
Batch	786,432 tokens (4× grad accumulation on 1 GPU)

Quantization and TTT

Same pipeline as PR #1501 : GPTQ int6 (attn+MLP) / int8 (embed) → LZMA → 16 MB.
TTT: SGD + cosine LR, momentum 0.9, 4 epochs, 947 chunks × 65K tokens.
Scoring: score-first-then-adapt (legal_ttt).

Innovation — What This PR Introduces

Motivation: Why Meta-TTT Needed a Redesign

PR #1501's ablation (exp105a) proved that exp101's FOMAML meta-TTT adds only +0.00036
bpb. But the concept of meta-TTT — training the model to adapt faster at test
time — is theoretically sound (MAML-style learning works in the meta-learning
literature). The failure had three identifiable structural causes:

Flaw	What's wrong	How it hurts
(A) Same-batch inner/outer	Inner loop adapts on batch X, outer evaluates on batch X	Meta-gradient rewards banks that resist SGD on seen data — the opposite of "generalize to unseen test chunks"
(B) No adaptation reward	Outer loss = absolute `L(banks'; X)`	A bank with low initial loss that gets worse under the inner step is rewarded equally as one that improves. No term explicitly rewards the improvement from adaptation.
(C) Uniform inner LR	All 4 bank types × 11 layers use `inner_lr = 0.002`	The optimal adaptation speed for a shallow attention bank vs a deep MLP bank is likely different. No mechanism to learn this.

Innovation A: Cross-Chunk Split

Split the training batch B (shape [batch, seq_len]) along the batch dimension
into two halves. The first half provides the inner-loop adaptation data, the second
half provides the outer-loop evaluation data:

Inner: banks' ← banks − α · s ⊙ ∇L(banks; B_first_half)
Outer: L_meta = L(banks'; B_second_half)              ← DIFFERENT documents

Because the dataloader draws independent random sequences from fineweb10B, B_first_half
and B_second_half come from different documents. This matches the TTT deployment
regime: adapt on document i, get scored on document j.

Fallback: When per-GPU batch size = 1 (not our case, but handled), falls back
to sequence-half split (first/last 1024 tokens of the same sequence).

Innovation B: Delta-Loss Outer Objective

Instead of optimizing absolute post-adaptation loss, we add a term that explicitly
rewards the improvement from the inner step:

L_meta = (w_post + w_Δ) · L_post − w_Δ · L_pre

where:
  L_post = L(banks'; B_second_half)    ← loss AFTER adaptation
  L_pre  = L(banks;  B_second_half)    ← loss BEFORE adaptation (detached banks)
  w_post = 0.5  (META_TTT_LOSS_WEIGHT)
  w_Δ    = 0.3  (META_TTT_DELTA_WEIGHT)

Expanding: L_meta = 0.5 · L_post + 0.3 · (L_post − L_pre)

The second term is the adaptation delta: it directly penalizes banks where the
inner step makes things worse and rewards banks where it helps. A bank that starts
with low loss but doesn't improve gets penalized by the −w_Δ · L_pre term.

Cost: One extra forward pass per meta-step (computing L_pre).

Innovation C: MetaSGD — Learned Per-Layer LR Scales

For each bank type k ∈ {qo, kv, up, down} and each layer l:

banks'[k, l] = banks[k, l] − α · s[k, l] · ∇L(banks[k, l]; B_inner)

where s[k, l] ∈ R+ is a learned scalar initialized to 1.0. Shapes:

Parameter	Shape	Count	Purpose
`meta_sgd_qo`	(22,)	22	Per-slot LR scale for query-output bank
`meta_sgd_kv`	(22,)	22	Per-slot LR scale for key-value bank
`meta_sgd_up`	(11,)	11	Per-layer LR scale for MLP up-projection
`meta_sgd_down`	(11,)	11	Per-layer LR scale for MLP down-projection
Total	—	66	Excluded from 16 MB export (0 bytes in submission)

If meta-TTT works, different layers should learn different scales — e.g., shallow
attention layers might need larger inner-loop steps than deep MLP layers. The
scales are registered as nn.Parameter and receive gradients via the outer loss
backprop. They are excluded from the exported final_model.pt and
final_model.int6.ptz to preserve the 16 MB budget.

Implementation detail: The inner-loop update is built as a differentiable
non-leaf tensor so a single backward pass populates both MetaSGD scale gradients
(via leaf autograd) and bank FOMAML gradients (via retain_grad + manual copy).

Results

exp106 — Meta-TTT Redesign

Metric	Value	Source	Note
Steps completed	6686 / 7500	wallclock cap	−334 vs exp101 (MetaSGD overhead)
val_bpb @ step 3000	1.2251	training log
val_bpb @ step 6000	1.1431	training log	Best of the three at matched step
Post-EMA val_bpb	1.1377	training log	Slightly worse than exp101 (fewer steps)
MetaSGD params exported	0 (66 excluded)	by design
Int6 val_bpb (roundtrip)	—	crashed	`meta_sgd_*` strict-load RuntimeError
Model size (int6+lzma)	15.02 MB	final artifact
Total submission size	15.14 MB	model + code
Peak GPU memory	31,695 MiB	training log	+8.6 GB vs exp101 (MetaSGD gradients)
Float baseline bpb	1.13767	ttt_from_checkpoint_float_qatoff.log
Float-path legal_ttt	1.11469	ttt_from_checkpoint_float_qatoff.log
Float TTT delta	−0.02299	computed
Int6 TTT (partial 80%)	1.11800	ttt_int6_ep4_partial.log (chunk 761/947)
Projected int6 legal_ttt	~1.118	trajectory extrapolation
Late QAT fired	step 5110	training log
SWA started	step 5300	training log

Int6 TTT Trajectory (partial, 80% complete)

Chunk progress	bpb	Source
401 / 947 (42%)	1.117622	ttt_int6_ep4_partial.log
621 / 947 (66%)	1.118994	ttt_int6_ep4_partial.log
661 / 947 (70%)	1.116769	ttt_int6_ep4_partial.log
681 / 947 (72%)	1.116469	ttt_int6_ep4_partial.log
761 / 947 (80%)	1.117976	ttt_int6_ep4_partial.log

Baseline (int6 canonical): 1.14160. Running delta at 80%: −0.02362.
The trajectory is flat in the 66–80% range. Projected final: ~1.118.

MetaSGD Scale Convergence

All 66 learned LR scales converged to values near their 1.0 initialization:

Parameter group	Mean	Std	Min	Max
meta_sgd_qo (22 scales)	~1.00	<0.04	>0.92	<1.08
meta_sgd_kv (22 scales)	~1.00	<0.04	>0.93	<1.07
meta_sgd_up (11 scales)	~1.00	<0.03	>0.94	<1.06
meta_sgd_down (11 scales)	~1.00	<0.03	>0.95	<1.05

Interpretation: No per-layer differentiation was learned. The meta-training
signal (1 meta-step per 4 training steps, at ~30% of main gradient magnitude)
is too weak to push 66 scalar parameters away from their initialization over
6686 training steps.

Analysis — Complete Meta-TTT Lineage (All Three Experiments)

This section summarizes the findings across all three experiments in this series.
A reader who sees only this PR should be able to understand the full meta-TTT story.

The Three Experiments

#	Name	Meta-TTT variant	Architecture changes	legal_ttt	TTT delta
exp101	Record (PR #1501 )	FOMAML, same-batch inner/outer, every 4 steps	Pos-conditional bigram, trigram, SGD+cosine TTT	1.11588	−0.02342
exp105a	Ablation (PR #1501)	Disabled (`META_TTT_ENABLED=0`)	Identical to exp101	1.11624	−0.02331
exp106	Redesign (this PR)	Cross-chunk + Δ-loss + MetaSGD, every 4 steps	Identical to exp101	1.11469*	−0.02299

*Float-path TTT; int6 canonical unavailable due to strict-load crash.

The Central Finding: TTT Delta Invariance

Experiment	Baseline bpb	Post-TTT bpb	TTT delta	Source
exp101 (FOMAML, int6)	1.13930	1.11588	−0.02342	logs_seed42.txt
exp105a (no meta, int6)	1.13956	1.11624	−0.02331	logs_seed42.txt
exp106 (redesign, float)	1.13767	1.11469	−0.02299	ttt_from_checkpoint_float_qatoff.log
exp106 (redesign, int6 80%)	1.14160	~1.118	~−0.024	ttt_int6_ep4_partial.log

The TTT delta is −0.023 ± 0.001 bpb across all variants. Three different
training objectives — from "no meta-signal" to "theoretically correct cross-document
generalization reward" — produce the same adaptation improvement.

Three-Way Weight-Space Analysis

We ran 8 analyses comparing all three models pairwise (script:
supporting_files/analysis_three_way.py, CPU-only, 3.6s on M2):

Triangle Geometry: Equidistant Solutions

                        exp101 (FOMAML)
                       /      \
              2336    /        \  2356         (bank L2 distances)
                     /          \
     exp105a (no meta) ──────── exp106 (redesign)
                        2324

All three models are approximately the same distance from each other. Meta-TTT
doesn't push you in a consistent direction — it pushes you to a random neighboring
basin, and the specific basin depends on the meta-gradient formulation.

Subspace Overlap: Cross-Chunk Preserves the Natural Subspace

Pair	Avg subspace cosine	Frac dims aligned
exp101 vs exp105a (FOMAML vs no-meta)	0.615	0.411
exp101 vs exp106 (FOMAML vs redesign)	0.659	0.472
exp105a vs exp106 (no-meta vs redesign)	0.727	0.548

The redesigned cross-chunk FOMAML (exp106) produces a solution closer in
functional subspace to the no-meta baseline than the original same-batch
FOMAML (exp101) does. The biased same-batch meta-gradient rotates the subspace
more than the unbiased cross-chunk variant.

Most striking: mlp_up_bank subspace cosine is 0.949 between exp105a and
exp106 (nearly identical) but only 0.551 between exp101 and exp105a (half-
rotated). Same-batch FOMAML systematically distorts the MLP input features.

Error Surface: Identical Curvature at All Three Minima

Property	exp101	exp105a	exp106	Interpretation
Bank avg condition number	1.2	1.2	1.2	Near-isotropic — SGD works equally well in all directions
Bank avg effective rank	16.5	16.5	16.5	All bank dimensions contribute equally
Bank avg top-5 energy frac	0.37	0.37	0.37	Uniform energy distribution
Quantization MSE	8.686e-5	8.691e-5	8.686e-5	Identical sensitivity to int6

This is why the TTT delta is invariant: the local curvature of the loss
landscape — the surface that SGD navigates during TTT — is identical at all three
minima. SGD makes the same progress per step from any starting point.

Mode Connectivity: Three Distinct Basins

Pair	Midpoint norm ratio	Assessment
exp101 vs exp105a	0.786	Different basins
exp101 vs exp106	0.793	Different basins
exp105a vs exp106	0.807	Borderline same basin
3-way centroid	0.704	Clearly distinct (30% norm loss)

The three models occupy distinct local minima. exp105a and exp106 are closest
to being in the same basin (ratio 0.807, threshold ~0.8), consistent with
cross-chunk FOMAML being less disruptive than same-batch FOMAML.

Why Meta-TTT Cannot Move the Ceiling

The argument from curvature invariance: TTT improvement depends on (1) how
far SGD can move the banks in 4 epochs (fixed by TTT config) and (2) how much
loss reduction each step buys (determined by local curvature). We showed the
curvature is identical at all three minima. Therefore the TTT delta must be
identical — QED.

The argument from over-parameterization: The training loss surface has a
degenerate set of equivalent minima (the three models prove this). Meta-TTT
selects a different minimum but cannot escape the set. All minima in the set
have the same curvature and the same TTT potential. To escape, you'd need a
stronger perturbation: second-order MAML, many more inner steps, or a dedicated
meta-training phase after warmdown.

The argument from MetaSGD: If per-layer LR differentiation could help, the
66 MetaSGD scales should have diverged from their 1.0 initialization. They
didn't. The meta-gradient signal at 1 step per 4, with loss weight 0.5, is
too weak to drive 66 scalar parameters in 6686 training steps.

Learnings for the Community

The TTT adaptation ceiling is set by architecture, not initialization.
~0.023 bpb is invariant across three FOMAML variants (same-batch, none,
cross-chunk + Δ-loss + MetaSGD). To improve TTT, change the bank dimensionality
or the TTT optimizer — not the training-time meta-objective.
First-order MAML with 1 inner step on a well-trained model ≈ gradient noise.
After 6000+ training steps, the banks are near a local optimum. A single inner
SGD step barely perturbs them, so the FOMAML outer gradient carries near-zero
functional signal regardless of how the inner/outer data is split.
Cross-chunk FOMAML is less harmful than same-batch FOMAML (even though both
are useless for TTT). Same-batch FOMAML introduces a systematic directional
bias that rotates the MLP input subspace 45° from the natural optimum. Cross-
chunk FOMAML's unbiased meta-gradient preserves the natural subspace (cos 0.95).
MetaSGD needs a stronger signal to learn meaningful per-layer differentiation.
At 1 meta-step per 4 training steps with loss weight 0.5, the effective meta-
gradient energy is ~7.5% of total gradient. This is insufficient to drive 66
scalar parameters away from their initialization over 6686 steps.
Three equivalent minima with identical local curvature — the loss landscape
of a Muon-trained 27M-param transformer has a degenerate set of solutions.
Meta-learning perturbations select among them but cannot improve them. This
is consistent with overparameterization theory and with empirical results from
lottery ticket and mode connectivity research.
Measure the delta, not the score. If we'd only compared final bpb numbers,
exp106's 1.11469 looks better than exp101's 1.11588. But the TTT delta
(architecture-level metric) is the same. The per-experiment score difference comes
from different pre-TTT baselines (1.1377 vs 1.1393), which are driven by the
number of training steps completed, not by meta-TTT quality.

Related PRs

PR [Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation #1501 — Position-Conditional Bigram + Ablation (exp101 + exp105a): Introduces
the base architecture (position-conditional bigram hashing, a zero-parameter trick
that improves legal_ttt by 0.001 bpb) and the controlled ablation proving same-batch
FOMAML meta-TTT contributes only +0.00036 bpb. The ablation finding is the
motivation for this PR's redesign.

Folder Structure

pr2_metattt_redesign/
├── pull_summary.md                          ← this file
├── experiment_exp106/                       ← META-TTT REDESIGN (non-record)
│   ├── train_gpt.py                         ← full training script with A+B+C (123K)
│   ├── submission.json                      ← metadata + results
│   ├── logs_seed42.txt                      ← condensed training metrics
│   ├── training_stdout_seed42.txt           ← full training stdout (128K)
│   └── supporting_files/
│       ├── README.md                        ← detailed experiment writeup
│       ├── run.sh                           ← training config (META_TTT_SPLIT=batch, etc.)
│       ├── Inference.ipynb                  ← model loading + eval + TTT visualization
│       ├── save_model.py                    ← checkpoint export (meta_sgd exclusion)
│       ├── ttt_eval.py                      ← TTT evaluation harness
│       ├── ttt_from_checkpoint.py           ← standalone TTT eval (hotfixed for meta_sgd)
│       ├── ttt_from_checkpoint.log          ← int6 canonical TTT attempt
│       ├── ttt_from_checkpoint_float_qatoff.log  ← complete float-path TTT run
│       ├── ttt_int6_ep4_partial.log         ← partial int6 TTT (80% complete)
│       ├── requant_mixed_precision.py       ← mixed int6/int7 attempt (over budget)
│       ├── requant_mixed_v1.log             ← mixed-precision log
│       ├── ERROR_SURFACE_ANALYSIS.md        ← three-way error surface geometry study
│       ├── META_TTT_ANALYSIS.md             ← two-way weight-space analysis (exp101 vs exp105a)
│       ├── analysis_three_way.py            ← three-way analysis script (8 analyses, 3.6s)
│       ├── analysis_three_way.json          ← three-way numerical results
│       ├── analysis_meta_ttt.py             ← two-way analysis script (5 analyses, 1.3s)
│       └── analysis_meta_ttt.json           ← two-way numerical results

MatoTeziTanka · 2026-04-12T05:02:23Z

Community Review — [Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis ### TTT Pattern: Score-First Per Chunk (LEGAL) The TTT implementation in `eval_val_sliding_ttt` (lines 1190–1330) follows the legal score-first-per-chunk pattern: 1. Score first under inference_mode (lines 1242–1272): Each chunk's windows are scored under `torch.inference_mode()` using `base_model.eval()`. Loss, token counts, and byte counts are accumulated. 2. Train after scoring, with is_last guard (lines 1273–1304): Training only occurs if `not is_last` (line 1274). The model is switched to `.train()` mode, and multi-epoch SGD runs on the chunk's val_tokens. This matches the PR #1413 legal pattern exactly. `python # Line 1243: score first with torch.inference_mode(): ... loss_sum += scored_nll.sum() # accumulate scored region # Line 1273–1304: train after, guarded is_last = (ci == num_chunks - 1) if not is_last and args.ttt_epochs > 0: base_model.train() for _ep in range(args.ttt_epochs): ... optimizer.step() # SGD, not AdamW` The optimizer is SGD (line 1228), not AdamW — this is relevant context but not itself a classification criterion. ### N-gram Hash: LEGAL The bigram hash at line 806: `python torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod` This is `input XOR input` (adjacent tokens), not `ctx_hash XOR (target * primes[k])`. The illegal pattern requires the target to be XOR'd into the lookup key. Neither bigram_hash nor trigram_hash (line 822) involves the target token — they only hash context tokens to produce an embedding for the current position. ### Meta-TTT (FOMAML): No Val-Token Violation Meta-TTT (`meta_ttt_step`, line 1780) operates on training micro-batches (variables `x`, `y` from the training loop at line 2160), not on...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

SPThole and others added 11 commits March 24, 2026 18:09

updated sub

2e1278e

updt readme

db9ea39

Update README.md

a029a31

Merge branch 'openai:main' into main

2c2d807

added qk init non record one H100

d7ec40c

removing non record

3019166

Merge branch 'openai:main' into main

940cb57

adding exps

de42f2b

Merge branch 'openai:main' into main

67e19da

updated meta ttt design

2145ccc

removed earlier submission

a8402cb

SPThole changed the title ~~[Non Record] Meta-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD~~ [Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD Apr 9, 2026

SPThole mentioned this pull request Apr 9, 2026

[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation #1501

Open

update git id

4828468

SPThole mentioned this pull request Apr 13, 2026

[non-record/notable] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT #1601

Open

SPThole mentioned this pull request May 3, 2026

Leaderboard Update (Review Request) #2151

Open

SPThole changed the title ~~[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD~~ [Non Record/notable] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non Record/notable] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD#1502

[Non Record/notable] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD#1502
SPThole wants to merge 12 commits intoopenai:mainfrom
SPThole:non_record_4

SPThole commented Apr 9, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SPThole commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR — Key Learnings for the Community

Disclaimer

Architecture Overview

Base Architecture

Training Pipeline

Quantization and TTT

Innovation — What This PR Introduces

Motivation: Why Meta-TTT Needed a Redesign

Innovation A: Cross-Chunk Split

Innovation B: Delta-Loss Outer Objective

Innovation C: MetaSGD — Learned Per-Layer LR Scales

Results

exp106 — Meta-TTT Redesign

Int6 TTT Trajectory (partial, 80% complete)

MetaSGD Scale Convergence

Analysis — Complete Meta-TTT Lineage (All Three Experiments)

The Three Experiments

The Central Finding: TTT Delta Invariance

Three-Way Weight-Space Analysis

Triangle Geometry: Equidistant Solutions

Subspace Overlap: Cross-Chunk Preserves the Natural Subspace

Error Surface: Identical Curvature at All Three Minima

Mode Connectivity: Three Distinct Basins

Why Meta-TTT Cannot Move the Ceiling

Learnings for the Community

Related PRs

Folder Structure

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — [Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SPThole commented Apr 9, 2026 •

edited

Loading