[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation by SPThole · Pull Request #1501 · openai/parameter-golf

SPThole · 2026-04-09T15:37:11Z

Track: non record (Track B, score-first-then-adapt) | Hardware: 1×H100 80 GB SXM
Best val_bpb (legal_ttt): 1.11588 (record_exp101) | TTT delta: −0.02342 bpb

This PR contains two experiments that form a complete unit:
(exp101) introducing position-conditional bigram hashing, and its controlled
ablation (exp105a) proving that the inherited FOMAML meta-TTT contributes
near-zero to the result.

See also: PR #1502
which builds on the ablation finding here and attempts a theoretically-grounded
redesign of the meta-TTT training loop.

which builds on the ablation finding here and attempts a theoretically-grounded
redesign of the meta-TTT training loop.

TL;DR — Key Learnings for the Community

Position-conditional bigram hashing is a zero-parameter trick that improves
legal_ttt by 0.001 bpb. If your model uses hash-based n-gram embeddings, check
whether different token classes (e.g., word-start vs within-word) are colliding
in the same buckets. Splitting the hash space by class can recover signal that
a shared hash was forced to suppress.
Always ablate inherited components. We ran 100+ experiments inheriting FOMAML
meta-TTT from an early ancestor without ever isolating its contribution. A
single-flag ablation revealed it adds +0.00036 bpb (noise) at 3% compute cost.
Those 3% translated to 206 lost training steps under a wallclock cap — a net
negative.
Same-batch FOMAML meta-TTT is equivalent to gradient noise in our setting.
It pushes the optimizer into a different local minimum (90-degree weight rotation)
but the new minimum has identical loss, identical TTT adaptation, and identical
quantization sensitivity. The rotation is a Muon optimizer artifact, not a
meaningful signal.
Weight-space cosine similarity is misleading under Muon. Two models trained
from the same seed with a 3% gradient perturbation show element-wise cosine of
0.05 (near-orthogonal) but principal-angle subspace cosine of 0.65 (partially
aligned). Use SVD-based subspace overlap for functional comparison, not raw
cosine.

Additional Info:
exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start [0,2047) and within-word [2047,4094) halves keyed on has_leading_space[current_token], eliminating bucket contention that forced the parent model's word_start_boost gate to 0.007 → legal_ttt 1.11588 (zero extra params).

exp105a: Single-flag ablation (META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.

Disclaimer

Hardware: All runs use a single H100 80 GB SXM GPU with MAX_WALLCLOCK_SECONDS=4800
(80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
standard 8×H100 @ 10 min budget at substantially lower cost. Gradient accumulation
(factor 4) ensures per-step updates are equivalent to the 8-GPU batch.
Early stopping: Both experiments stopped before the configured ITERATIONS=7500
due to the wallclock cap (exp101 at step 7020, exp105a at step 7226). This is
expected behavior, not a hardware failure — the final ~300-500 steps would be in
the deep warmdown phase with diminishing returns.
Non-record: exp105a is a non-record ablation experiment (non_record: true).
It exists solely to measure meta-TTT's contribution. exp101 is the record submission.
Cost constraint: GPU time was limited (~$3/hr H100 spot). Experiments that
clearly were not meeting expectations were terminated early to preserve budget
for more promising directions. Where this affected results, missing values are
marked "—" with explanation.

Architecture Overview

Base Architecture

All experiments in this lineage share the following architecture. We describe
every component so this document is self-contained.
lots of inspiration from PR #1019

Component	Configuration	What it does
Model	11-layer U-Net GPT	5 encoder blocks + 6 decoder blocks with skip connections between corresponding encoder-decoder pairs. The skip connections (additive residuals) help gradient flow and allow the decoder to reference early-layer representations directly.
Hidden dim	512	Width of the residual stream. Every layer reads from and writes to this 512-dimensional vector per token position.
Attention	8Q / 4KV (GQA)	Grouped-Query Attention: 8 query heads share 4 key-value heads (2:1 ratio). This halves the KV cache size and KV parameter count relative to standard multi-head attention, saving ~25% of attention params with minimal quality loss.
MLP	3× expansion (1536)	Each block has a feed-forward network that projects 512 → 1536 → 512. Uses SwiGLU activation (two parallel projections, element-wise multiply, then down-project).
Vocabulary	1024 tokens	SentencePiece BPE trained on fineweb10B. Small vocab is a deliberate choice for 16 MB budget — larger vocab would consume too much embedding memory.
Embeddings	Tied (`tok_emb = lm_head^T`)	The input token embedding matrix and the output logit projection matrix are transposes of each other. This halves embedding parameter count (1024×512 = 524K params shared).
RoPE	Partial, 16 of 64 dims	Rotary Position Embeddings applied to only 25% of each attention head's dimensions (16 out of 64). The remaining 48 dims are position-free, allowing the model to learn position-invariant features.
XSA	All 11 blocks	Cross-layer Shared Attention — see detailed explanation below.
VE	Layers 7–10	Value Embeddings on the last 4 layers — see explanation below.
Total params	26,960,991	~27M trainable parameters before quantization.

What is XSA (Cross-layer Shared Attention)?

In a standard transformer, each layer has its own Q, K, V, and output projection
matrices. In XSA, these are replaced by banked weight matrices shared across
all layers:

qo_bank: shape (22, 512, 512) — 22 "slots" (2 per layer × 11 layers), shared
query-output projection. Each layer selects its 2 slots from the bank.
kv_bank: shape (22, 256, 512) — shared key-value projection.
mlp_up_bank: shape (11, 1536, 512) — shared MLP input projection (one per layer).
mlp_down_bank: shape (11, 512, 1536) — shared MLP output projection.

Why bank: The bank structure makes Test-Time Training (TTT) efficient. At eval
time, the model adapts to test data by running SGD on just these 4 bank tensors
(~24M of the 27M params). Because they're stored as contiguous 3D tensors rather
than scattered per-layer matrices, the TTT optimizer can update all layers in a
single operation.

How layers access banks: Each layer i reads qo_bank[2*i:2*i+2] for its
query/output weights and kv_bank[2*i:2*i+2] for key/value. The bank is a shared
pool; the per-layer "selection" is just indexing, not learned routing.

What is the Bigram Hash Table?

The model includes a hash-based bigram embedding table (bigram.embed.weight,
shape 4096×64) that provides a fast, parameter-cheap lookup of bigram statistics:

For each position t, compute hash(token[t-1], token[t]) mod 4095 → bucket index
Look up the 64-dimensional embedding at that bucket
Scale it by a learned scalar bigram.scale (~0.11 after training)
Add it to the residual stream at position t

This gives the model access to bigram transition statistics without any
attention computation. With 1024² ≈ 1M possible bigrams mapped to 4095 buckets,
each bucket serves ~256 bigram contexts on average (hash collision is by design —
the embeddings learn an average predictive signal across all colliding contexts).

word_start_boost: A learned scalar gate (initialized to 1.0) that scales
the bigram contribution specifically at word-start positions — positions where
the current token begins with a leading space (e.g., _the, _was, _and).
In the parent model, this gate collapsed to 0.007, meaning the model learned
to almost completely suppress the bigram signal at word-start positions. This
suppression was the key observation that motivated exp101's innovation.

Training Pipeline

Component	Configuration	Purpose
Optimizer	Muon (weight matrices) + AdamW (embeddings, scalars)	Muon uses Newton-Schulz orthogonalized gradients for matrix params, giving faster convergence. AdamW handles 1D/0D params where Muon doesn't apply.
LR	`MATRIX_LR=0.025` (Muon), `0.001` (AdamW)	Muon tolerates higher LR due to gradient preconditioning.
Schedule	Cosine warmdown from step ~2200	Warmdown gradually reduces LR to near-zero. Adaptive trigger fires when val loss plateaus.
EMA	Decay 0.998	Exponential Moving Average of weights. Final model uses EMA weights.
SWA	Every 50 steps during warmdown	Stochastic Weight Averaging further smooths the EMA during the final phase.
Late QAT	Threshold 0.25	Quantization-Aware Training activates when int8 quantization gap exceeds threshold, simulating quantization noise during forward passes to make the model robust to post-training quantization.
Batch	786,432 tokens (384 seqs × 2048 tokens)	Effective batch via 4× gradient accumulation on 1 GPU.

Quantization Pipeline (for 16 MB submission)

Step	Details
Quantization	GPTQ (Hessian-informed column reordering) with per-row int6 for attention + MLP weights, per-row int8 for embeddings
Calibration	Auto-regressive self-generated data: 64 sequences × 2048 tokens at temperature 0.8. The model generates its own calibration set, avoiding the need for external data.
Compression	LZMA on the quantized weight buffer. Achieves ~15 MB model artifact.
Budget	16 MB total (model weights + quantized code + any metadata)

Test-Time Training (TTT) — The Scoring Mechanism

The competition uses score-first-then-adapt evaluation (called legal_ttt
or eval_val_sliding_ttt):

Parameter	Value	What it does
Method	Sliding-window TTT	The validation set is split into 947 non-overlapping chunks of 65,536 tokens each. For each chunk, the model first scores (computes loss), then adapts (runs SGD on bank weights). The reported val_bpb is the average of all per-chunk scores.
Optimizer	SGD, momentum 0.9	Adapts the 4 bank tensors (qo, kv, mlp_up, mlp_down).
LR	0.004, cosine decay	Per-chunk learning rate schedule.
Epochs	4	Number of passes over each chunk for adaptation.
QAT mode	`CastedLinear._qat_enabled = True`	During int6 TTT, the adapted weights are quantized on-the-fly to simulate deployment conditions.
TTT delta	The bpb difference between the pre-TTT int6 baseline and the post-TTT legal_ttt score. Typically ~0.023 bpb for this architecture.

Innovation — What This PR Introduces

Innovation 1: Position-Conditional Bigram Hashing (exp101)

Problem observed: In the parent model, the bigram table's 4095 buckets are
shared between all (prev, curr) bigram contexts regardless of whether the
current token is a word-start (has leading space) or within-word. Analysis of the
parent checkpoint revealed:

Observation	Value	Implication
Word-start tokens' share of total loss	~70%	Word-start prediction is the dominant challenge
Mean loss at word-start positions	3.37 nats	Much harder than within-word (1.08 nats)
Learned `word_start_boost` value	0.007	Model actively suppressing bigram at word-start
Bucket sharing	100% of 4095 buckets reachable by both ws and non-ws pairs	No bucket is exclusively ws or non-ws — model can't selectively clean up
Loss impact of removing the gate	+0.017 nats (+0.025 bpb)	The gate IS doing real work — suppressing genuine noise

Root cause: Word-start bigram transitions (_was → _the, _the → _quick)
have enormous variance because the next word depends on semantic context that a
simple bigram can't capture. Within-word transitions (qu → ick, th → e)
are low-variance and highly predictable. When both types collide in the same hash
bucket, the learned embedding is a compromise that doesn't fit either well. The
model's only option is a global suppression gate.

Solution: Split the hash space by word-start class. The 4095 usable buckets
become two disjoint halves:

Bucket range	Assigned to	Contexts per bucket
`[0, 2047)`	Word-start `(prev, curr)` pairs where `has_leading_space[curr] = true`	~163 (was ~256 shared)
`[2047, 4094)`	Within-word `(prev, curr)` pairs where `has_leading_space[curr] = false`	~350 (was ~256 shared)
`4094`	Unused	—
`4095`	Sequence-start sentinel	—

The split key is has_leading_space[current_token], which is a deterministic
property of the current token (already in the causal window — no future leakage).
This is the same information the existing word_start_boost gate already uses,
so legality is preserved.

# Core implementation (from train_gpt.py)
def bigram_hash(self, tokens, has_leading_space):
    mod = self.bigram_vocab_size - 1   # 4095
    half = mod // 2                     # 2047
    base = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % half
    is_ws = has_leading_space[tokens[..., 1:].long()].to(torch.int32)
    shift = (1 - is_ws) * half  # ws → [0, 2047), non-ws → [2047, 4094)
    return base + shift

Parameter cost: Zero. Same 4096×64 table, same parameter count. Only the hash
function changes.

Innovation 2: Trigram Lookup (exp101)

In addition to the (t-1, t) bigram, we add a (t-2, t-1, t) trigram lookup
that hashes to the same table. This doubles the number of contexts per bucket but
each context carries more specific information (trigrams are more predictive than
bigrams). The trigram hash respects the same position-conditional split.

Parameter cost: Zero. Reuses the same embedding table.

Innovation 3: TTT Optimizer Correction (exp101)

The parent model configured AdamW+flat for in-training TTT but its reported
legal_ttt of 1.1169 was actually produced by a standalone SGD+cosine post-run.
We reverted to SGD+cosine during training to ensure the training-time and
eval-time TTT optimizers match. This is not novel but was a necessary correction.

Innovation 4: Single-Variable Meta-TTT Ablation (exp105a)

What is FOMAML meta-TTT? During training, every 4th step (META_TTT_EVERY=4),
the model runs a mini meta-learning loop:

Inner step: Take a gradient step on the bank weights using the current
training batch → produces adapted banks banks'
Outer evaluation: Compute loss with banks' on the same batch
Meta-gradient: Backpropagate the outer loss to the original bank weights
and accumulate it with the normal training gradient

The idea (from MAML) is that this teaches the banks to be "pre-positioned" for
fast adaptation, so TTT at eval time will be more effective.

The ablation: exp105a changes exactly one flag — META_TTT_ENABLED=1 → 0 —
with everything else byte-identical (same seed, same data order, same LR schedule,
same QAT timing, same SWA windows, same train_gpt.py source). This is the cleanest
single-variable ablation possible in this codebase.

Weight-space analysis: We ran 5 CPU-only analyses on the two checkpoints
(script: ablation_exp105a/supporting_files/analysis_meta_ttt.py, ~1.3s runtime):

Analysis	Method	Key finding
Weight deltas	Per-tensor cosine similarity and L2 distance	Bank weights are near-orthogonal (cos ~0.05–0.10) but scalar controls are identical (cos ~0.99)
Quant sensitivity	Per-row int6 simulation MSE	Identical: ratio 0.9989 (0.11% difference = noise)
Spectral	SVD spectrum: op-norm, condition number, stable rank	Condition number −8.2% for meta-TTT (only signal); all other metrics within 1%
Subspace overlap	Principal angles between top-k left-SV subspaces	Average subspace cosine 0.65 — models span partially the same functional subspace despite orthogonal element-wise weights
Mode connectivity	Midpoint norm ratio	0.799 — borderline different basins (threshold ~0.8)

Results

exp101 — Record Submission

Metric	Value	Source
Steps completed	7020 / 7500	wallclock cap at 4800s
val_bpb @ step 3000	1.2254	training log
val_bpb @ step 6000	1.1474	training log
Post-EMA val_bpb	1.1352	training log
Int6 val_bpb (roundtrip)	1.13930	logs_seed42.txt
legal_ttt val_bpb	1.11588	logs_seed42.txt
TTT delta (int6 → TTT)	−0.02342	computed
Model size (int6+lzma)	14.97 MB	final artifact
Total submission size	15.08 MB	model + code
Peak GPU memory	23,044 MiB	training log
Late QAT fired	step 5384	training log
SWA started	step 5600	training log

exp105a — Meta-TTT Ablation (non-record)

Metric	Value	Source
Steps completed	7226 / 7500	wallclock cap; +206 steps vs exp101 (no FOMAML overhead)
Post-EMA val_bpb	1.1353	training log
Int6 val_bpb (roundtrip)	1.13956	logs_seed42.txt
legal_ttt val_bpb	1.11624	logs_seed42.txt
TTT delta (int6 → TTT)	−0.02331	computed
Model size (int6+lzma)	14.94 MB	final artifact
Peak GPU memory	23,043 MiB	training log

Head-to-Head Comparison

Metric	exp101 (meta ON)	exp105a (meta OFF)	Delta	Interpretation
legal_ttt	1.11588	1.11624	+0.00036	Meta-TTT adds < 0.4 millibits — noise level
TTT delta	−0.02342	−0.02331	0.00011	Identical to 4 decimal places
Steps completed	7020	7226	+206	3% more steps from eliminated FOMAML overhead
Post-EMA val_bpb	1.1352	1.1353	+0.0001	Identical after EMA smoothing
Peak memory	23,044 MiB	23,043 MiB	−1 MiB	No memory difference
Per-step time	~684 ms	~663 ms	−21 ms (−3.1%)	FOMAML inner/outer loop overhead

Comparison with Parent Architecture

Metric	Parent model	exp101	Change
Bigram hash	Shared (all 4095 buckets mixed ws + non-ws)	Position-conditional (2047 ws + 2047 non-ws)	Split by word-start class
Trigram	Disabled	Enabled	Zero-param addition
TTT optimizer (train-time)	AdamW + flat LR	SGD + cosine LR	Corrected to match eval-time
legal_ttt	1.1169	1.11588	−0.0010 bpb improvement
Extra params	—	0	Zero-parameter change

Analysis

Why Position-Conditional Hashing Works

The theoretical prediction was that word-start bigrams have exploitable structure
(after sentence-ending punctuation, the next word-start is biased toward function
words and proper nouns; within a paragraph, the next word-start depends on
syntactic role). The position-conditional split lets the model learn this structure
in clean ws-only buckets rather than being forced to suppress everything via a
global gate.

Evidence it worked: The 0.001 bpb improvement from parent to exp101 is
consistent with the theoretical "realistic estimate" of ~0.01 bpb. The improvement
persists through quantization and TTT, confirming it's a genuine architectural gain
rather than an overfitting artifact.

Why the Ablation Kills the Meta-TTT Narrative

The same-batch FOMAML in exp101 has a fundamental objective mismatch:

Inner: banks' ← banks − α·∇L(banks; x_batch)     ← adapt on batch X
Outer: L_meta = L(banks'; x_batch)                 ← evaluate on SAME batch X

At eval time (TTT), the model adapts on chunk i and is scored on chunk i —
but the scoring happens before adaptation (score-first-then-adapt). The
meta-gradient optimizes for "banks that recover quickly from an SGD step on
seen data" — this rewards banks that resist change, not banks that
generalize to new data.

After 7000 training steps, the banks are already well-converged. The FOMAML
inner step barely moves them (small gradient on a near-optimum), so the outer
gradient (on the same data) carries essentially zero useful signal. The meta-TTT
degenerates into gradient noise.

Weight-Space Story: Orthogonal Weights, Same Function

The weight-space analysis (5 analyses, CPU-only, 1.3s) reveals a fascinating picture:

Element-level: Bank weight cosines are 0.05–0.10 (near-orthogonal). A 3%
training perturbation caused a 90° rotation in weight space. This is a Muon
amplification effect — Muon's Newton-Schulz gradient orthogonalization transforms
small gradient differences into large basis rotations.

Function-level: Principal-angle subspace cosines average 0.65, with
kv_bank at 0.955 (nearly identical subspace). The two models learned the same
functional subspace but expressed it in a different basis. Their outputs on any
given input are identical to 3-4 decimal places.

Implication: Raw weight cosine is not a meaningful similarity metric under
Muon. Use SVD-based principal-angle analysis instead.

Learnings for the Community

Hash bucket contention is analyzable and fixable. If you use hash-based
embeddings (bigram tables, feature hashing, locality-sensitive hashing), check
whether semantically different token classes are colliding in the same buckets.
A learned gate that collapses toward 0 is a strong signal of bucket pollution.
Position-conditional splitting is a zero-param fix.
Ablate before you optimize. We inherited FOMAML meta-TTT through 100+
experiments and multiple architecture changes without ever isolating its
contribution. A one-line flag change (META_TTT_ENABLED=0) revealed it was
contributing nothing. If we'd done this ablation 50 experiments earlier, we'd
have saved 3% of compute on every subsequent run.
Same-batch FOMAML is a trap for well-trained models. When the inner and
outer evaluation use the same data, the meta-gradient rewards parameter stability,
not adaptation ability. This is a known issue in meta-learning but is easy to
overlook when inheriting code from an early prototype where the model wasn't
well-trained yet.
Muon-trained models require subspace analysis, not cosine distance. The
Newton-Schulz orthogonalization in Muon amplifies small gradient perturbations
into large basis rotations. Two models from the same seed can be 90° apart in
weight space while computing the same function. Principal-angle subspace overlap
(via SVD) is the correct functional similarity metric.
The TTT delta is a property of architecture, not initialization. The ~0.023
bpb TTT improvement is identical whether meta-TTT is on or off. This implies the
TTT ceiling is set by the bank dimensionality and TTT optimizer configuration, not
by how the banks were initialized during training.

Related PRs

PR 2/2 — Meta-TTT Redesign (exp106): Takes the ablation finding from this
PR and tests whether a theoretically-correct redesign of FOMAML (cross-chunk
inner/outer split, delta-loss objective, learned per-layer LR scales) can move
the TTT ceiling. Spoiler: it can't — the TTT delta remains at ~0.023 bpb.
Includes a complete three-way weight-space analysis and error surface geometry
study across all three experiments.

Folder Structure

pr1_poscond_bigram_and_ablation/
├── pull_summary.md                          ← this file
├── record_exp101/                           ← RECORD SUBMISSION
│   ├── train_gpt.py                         ← full training script (115K)
│   ├── submission.json                      ← metadata + results
│   ├── logs_seed42.txt                      ← condensed training metrics
│   ├── training_stdout_seed42.txt           ← full training stdout (506K)
│   └── supporting_files/
│       ├── README.md                        ← detailed experiment writeup
│       ├── run.sh                           ← training launch script
│       ├── ttt_eval.py                      ← TTT evaluation harness
│       └── ttt.log                          ← TTT eval output
├── ablation_exp105a/                        ← META-TTT ABLATION (non-record)
│   ├── train_gpt.py                         ← identical to exp101
│   ├── submission.json                      ← metadata (non_record: true)
│   ├── logs_seed42.txt                      ← condensed training metrics
│   ├── training_stdout_seed42.txt           ← full training stdout (253K)
│   └── supporting_files/
│       ├── README.md                        ← ablation writeup
│       ├── run.sh                           ← only change: META_TTT_ENABLED=0
│       ├── Inference.ipynb                  ← model loading + eval notebook
│       ├── save_model.py                    ← checkpoint export script
│       ├── ttt_eval.py                      ← TTT evaluation harness
│       ├── ttt.log                          ← TTT eval output
│       ├── META_TTT_ANALYSIS.md             ← full weight-space analysis (5 analyses)
│       ├── analysis_meta_ttt.py             ← analysis script (CPU-only, 1.3s)
│       └── analysis_meta_ttt.json           ← numerical results (50K)

MatoTeziTanka · 2026-04-12T05:02:33Z

Community Review — [Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Summary PR #1501 ("2026_04_09_poscond_bigram_and_ablation") submits two experiments (record_exp101 and ablation_exp105a) whose train_gpt.py files are byte-for-byte identical. The submission uses score-first-per-chunk TTT (SGD, not AdamW) on the dequantized model after GPTQ. All audit flags are CLEAR. --- ## 1. N-gram / hash XOR illegal pattern — CLEAR No `ctx_hash ^ (target * primes[k])` or any target-XOR-into-hash-key pattern exists. The BigramHashEmbedding uses only input-token XOR: - Line 794: `torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod` - Line 805: `(36313 * t[..., 2:] ^ 27191 * t[..., 1:-1] ^ 51497 * t[..., :-2]) % half` These are pure input-token (context) XOR hashes — the target never enters the hash key. This is the explicitly legal `input_ids[..., 1:] ^ input_ids[..., :-1]` BigramHash pattern. --- ## 2. Pre-Quant TTT / multi-epoch AdamW on val_tokens — CLEAR TTT runs AFTER GPTQ quantization (line 2264–2273). The optimizer used is SGD (line 1207: `torch.optim.SGD(...)`), not AdamW. The `ttt_epochs` hyperparameter controls SGD sweeps per chunk, and the training never touches val_tokens before the model is scored. --- ## 3. Score-first-per-chunk TTT — LEGAL (PR #1413 pattern confirmed) `eval_val_sliding_ttt` (lines 1169–1309) implements clean score-first ordering: - Lines 1221–1251: Each chunk is scored under `torch.inference_mode()` first, accumulating `loss_sum` / `scored_nll` before any gradient update. - Lines 1252–1283: Training (`optimizer.step()`) occurs only `if not is_last` — an exact `is_last_chunk` guard — ensuring the final chunk is never trained on, only scored. - The scored region maps to the upcoming eval windows for chunk `ci`; training operates on the...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

SPThole and others added 11 commits March 24, 2026 18:09

updated sub

2e1278e

updt readme

db9ea39

Update README.md

a029a31

Merge branch 'openai:main' into main

2c2d807

added qk init non record one H100

d7ec40c

removing non record

3019166

Merge branch 'openai:main' into main

940cb57

adding exps

de42f2b

Merge branch 'openai:main' into main

67e19da

add ablation meta ttt

d4d8938

removed earlier submission

a51345f

SPThole changed the title ~~[Non Record] Position-Conditional Bigram Hashing + Meta-TTT Ablation~~ [Non Record] Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation Apr 9, 2026

SPThole mentioned this pull request Apr 9, 2026

[Non Record/notable] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD #1502

Open

SPThole changed the title ~~[Non Record] Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation~~ [Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation Apr 9, 2026

update git id

6436243

SPThole mentioned this pull request Apr 13, 2026

[non-record/notable] Sharpness-Aware Minimization (SAM) Inner Loop for Meta-TTT #1601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation#1501

[Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation#1501
SPThole wants to merge 12 commits intoopenai:mainfrom
SPThole:non_record_3

SPThole commented Apr 9, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SPThole commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR — Key Learnings for the Community

Disclaimer

Architecture Overview

Base Architecture

What is XSA (Cross-layer Shared Attention)?

What is the Bigram Hash Table?

Training Pipeline

Quantization Pipeline (for 16 MB submission)

Test-Time Training (TTT) — The Scoring Mechanism

Innovation — What This PR Introduces

Innovation 1: Position-Conditional Bigram Hashing (exp101)

Innovation 2: Trigram Lookup (exp101)

Innovation 3: TTT Optimizer Correction (exp101)

Innovation 4: Single-Variable Meta-TTT Ablation (exp105a)

Results

exp101 — Record Submission

exp105a — Meta-TTT Ablation (non-record)

Head-to-Head Comparison

Comparison with Parent Architecture

Analysis

Why Position-Conditional Hashing Works

Why the Ablation Kills the Meta-TTT Narrative

Weight-Space Story: Orthogonal Weights, Same Function

Learnings for the Community

Related PRs

Folder Structure

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — [Non Record] Learn to Learn: Position-Conditional Bigram Hashing + Meta-Learning + TTT Ablation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SPThole commented Apr 9, 2026 •

edited

Loading