[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property" by X-Abhishek-X · Pull Request #1837 · openai/parameter-golf

X-Abhishek-X · 2026-04-26T17:15:11Z

Summary

This is a non-record / wishlist submission addressing the openai/parameter-golf README §Requests for PRs item: "State-space models, E2E TTT, super long context for evaluation or training".

A working full-model E2E TTT implementation with distributed lockstep gradient synchronization, demonstrating the wishlist item end-to-end on top of my existing PR #1695 record submission.

Result

val_bpb 1.07063 on the same checkpoint as my record submission [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695 (1.07590)
−0.00527 BPB improvement over my own PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695 baseline via E2E TTT alone
Non-record because eval time (1292s) exceeds the 600s record cap by design
1-seed (consistent with other non-record submissions like Will DePue's 4-hour baseline and Ciprian Ifrim's 1-bit submission)
Artifact 15,961,787 B (under 16 MB cap)
All training/eval hyperparameters in submission.json; full proof in e2e_proof.log

My original contributions in this submission

E2E TTT implementation (eval_val_e2e_ttt + _select_e2e_ttt_params in train_gpt.py) — full-model SGD per chunk, generalizes chunk-LoRA Phased TTT (PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695) to the full parameter set
Distributed lockstep gradient sync — all_reduce(MEAN) across all 8 ranks before optimizer.step, ensuring every rank's E2E TTT trajectory stays byte-identical
"Healing property" empirical observation (see below) — first reported here
Companion record PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695: SpinQuant V1 + MP-SGD-TTT recipe, val_bpb 1.07590 (3-seed mean, std 0.00019), an improvement of −0.025 BPB over the bigbag Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 base
Hyperparameter tunings (WD=0.095, MLR=0.022, EMA=0.9965) shipped in PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 / [Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866 #1471, also credited by @PranavViswanath in PR Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809

Key observation — "healing property"

SpinQuant + GPTQ degraded the post-quant model from a pre-quant val_bpb of 1.07125 to 6.47968 (a 5.4 BPB regression — model is essentially broken on cold inference). E2E TTT recovered it to 1.07063 within the eval window — fully healing the quantization damage and slightly exceeding the pre-quant ceiling.

This suggests aggressive quantization may be more recoverable than commonly assumed when paired with full-model TTT. Worth further investigation as a wishlist research direction.

Concurrent / related work — @taka6745 #1818

@taka6745's concurrent PR #1818 ("Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT") characterizes a related effect from a different angle: GPTQ-int6 → pre-quant 1.1009, post-quant 3.4620, post-TTT 2.7663 (3-seed). Their submission documents partial TTT recovery (~30 % of the damage gap closed by sliding-window TTT) on a smaller initial damage (+2.36 BPB).

This submission is a complementary data point: a more aggressive quantization regime (SpinQuant + GPTQ, +5.4 BPB damage) paired with a stronger TTT variant (full-model E2E SGD with distributed lockstep grad-sync), yielding near-complete recovery (~99 % of the damage gap closed, slightly exceeding pre-quant). Together the two PRs suggest that the recoverability of post-quant damage scales meaningfully with TTT capacity.

Files

File	Purpose
README.md	Submission readme
PORTFOLIO_SUMMARY.md	Full writeup with attribution + negative-result context
submission.json	Metadata, scores, hyperparameters
train_gpt.py	Patched training/eval script (MD5 4397db0c9025478d0251434044f0df44)
e2e_proof.log	Run log proving val_bpb 1.07063 (MD5 6e6bd78df1e1acb2a1f9a0b45123865b)

Companion record

PR #1695 — Stage3 + SpinQuant V1 + MP-SGD-TTT, val_bpb 1.07590 (3-seed mean, std 0.00019).

Lineage credit

This submission builds on the bigbag #1493 architectural lineage (the standard parameter-golf base most top PRs fork from), and uses the legal score-first TTT framework established via @valerio-oai (Issue #402) and the original chunked TTT pattern from abaybektursun (#549). The novel contribution here is the full-model SGD generalization with distributed lockstep grad-sync, plus the healing-property observation.

…07063, healing-property observation

Non-record (wishlist): E2E TTT — full-model SGD per chunk, val_bpb 1.…

b87b494

…07063, healing-property observation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:e2e-ttt-wishlist-non-record

X-Abhishek-X commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

X-Abhishek-X commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result

My original contributions in this submission

Key observation — "healing property"

Concurrent / related work — @taka6745 #1818

Files

Companion record

Lineage credit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

X-Abhishek-X commented Apr 26, 2026 •

edited

Loading