[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property"#1837
Open
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…07063, healing-property observation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a non-record / wishlist submission addressing the openai/parameter-golf README §Requests for PRs item: "State-space models, E2E TTT, super long context for evaluation or training".
A working full-model E2E TTT implementation with distributed lockstep gradient synchronization, demonstrating the wishlist item end-to-end on top of my existing PR #1695 record submission.
Result
My original contributions in this submission
eval_val_e2e_ttt+_select_e2e_ttt_paramsin train_gpt.py) — full-model SGD per chunk, generalizes chunk-LoRA Phased TTT (PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695) to the full parameter setall_reduce(MEAN)across all 8 ranks beforeoptimizer.step, ensuring every rank's E2E TTT trajectory stays byte-identicalKey observation — "healing property"
SpinQuant + GPTQ degraded the post-quant model from a pre-quant val_bpb of 1.07125 to 6.47968 (a 5.4 BPB regression — model is essentially broken on cold inference). E2E TTT recovered it to 1.07063 within the eval window — fully healing the quantization damage and slightly exceeding the pre-quant ceiling.
This suggests aggressive quantization may be more recoverable than commonly assumed when paired with full-model TTT. Worth further investigation as a wishlist research direction.
Concurrent / related work — @taka6745 #1818
@taka6745's concurrent PR #1818 ("Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT") characterizes a related effect from a different angle: GPTQ-int6 → pre-quant 1.1009, post-quant 3.4620, post-TTT 2.7663 (3-seed). Their submission documents partial TTT recovery (~30 % of the damage gap closed by sliding-window TTT) on a smaller initial damage (+2.36 BPB).
This submission is a complementary data point: a more aggressive quantization regime (SpinQuant + GPTQ, +5.4 BPB damage) paired with a stronger TTT variant (full-model E2E SGD with distributed lockstep grad-sync), yielding near-complete recovery (~99 % of the damage gap closed, slightly exceeding pre-quant). Together the two PRs suggest that the recoverability of post-quant damage scales meaningfully with TTT capacity.
Files
Companion record
PR #1695 — Stage3 + SpinQuant V1 + MP-SGD-TTT, val_bpb 1.07590 (3-seed mean, std 0.00019).
Lineage credit
This submission builds on the bigbag #1493 architectural lineage (the standard parameter-golf base most top PRs fork from), and uses the legal score-first TTT framework established via @valerio-oai (Issue #402) and the original chunked TTT pattern from abaybektursun (#549). The novel contribution here is the full-model SGD generalization with distributed lockstep grad-sync, plus the healing-property observation.