Skip to content

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837

Open
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:e2e-ttt-wishlist-non-record
Open

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:e2e-ttt-wishlist-non-record

Conversation

@X-Abhishek-X
Copy link
Copy Markdown

@X-Abhishek-X X-Abhishek-X commented Apr 26, 2026

Summary

This is a non-record / wishlist submission addressing the openai/parameter-golf README §Requests for PRs item: "State-space models, E2E TTT, super long context for evaluation or training".

A working full-model E2E TTT implementation with distributed lockstep gradient synchronization, demonstrating the wishlist item end-to-end on top of my existing PR #1695 record submission.

Result

My original contributions in this submission

Key observation — "healing property"

SpinQuant + GPTQ degraded the post-quant model from a pre-quant val_bpb of 1.07125 to 6.47968 (a 5.4 BPB regression — model is essentially broken on cold inference). E2E TTT recovered it to 1.07063 within the eval window — fully healing the quantization damage and slightly exceeding the pre-quant ceiling.

This suggests aggressive quantization may be more recoverable than commonly assumed when paired with full-model TTT. Worth further investigation as a wishlist research direction.

Concurrent / related work — @taka6745 #1818

@taka6745's concurrent PR #1818 ("Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT") characterizes a related effect from a different angle: GPTQ-int6 → pre-quant 1.1009, post-quant 3.4620, post-TTT 2.7663 (3-seed). Their submission documents partial TTT recovery (~30 % of the damage gap closed by sliding-window TTT) on a smaller initial damage (+2.36 BPB).

This submission is a complementary data point: a more aggressive quantization regime (SpinQuant + GPTQ, +5.4 BPB damage) paired with a stronger TTT variant (full-model E2E SGD with distributed lockstep grad-sync), yielding near-complete recovery (~99 % of the damage gap closed, slightly exceeding pre-quant). Together the two PRs suggest that the recoverability of post-quant damage scales meaningfully with TTT capacity.

Files

File Purpose
README.md Submission readme
PORTFOLIO_SUMMARY.md Full writeup with attribution + negative-result context
submission.json Metadata, scores, hyperparameters
train_gpt.py Patched training/eval script (MD5 4397db0c9025478d0251434044f0df44)
e2e_proof.log Run log proving val_bpb 1.07063 (MD5 6e6bd78df1e1acb2a1f9a0b45123865b)

Companion record

PR #1695 — Stage3 + SpinQuant V1 + MP-SGD-TTT, val_bpb 1.07590 (3-seed mean, std 0.00019).

Lineage credit

This submission builds on the bigbag #1493 architectural lineage (the standard parameter-golf base most top PRs fork from), and uses the legal score-first TTT framework established via @valerio-oai (Issue #402) and the original chunked TTT pattern from abaybektursun (#549). The novel contribution here is the full-model SGD generalization with distributed lockstep grad-sync, plus the healing-property observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant