Non-record: final frontier autopsy#2110
Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
9f5d71f to
1236ae4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: Final Frontier Autopsy
Track: non-record / methodology
Author: Himanshu Dongre
Date: 2026-05-01
Leaderboard claim: none
This is a non-record submission. It records my final attempt to improve the
late PR #2018 frontier, the logs behind that attempt, and the stop rule that
fell out of it.
The short version: every serious branch failed before quantization. On this
frontier, the trained model had to be competitive before GPTQ and TTT. Mine
was not.
Result Summary
The useful observation is where the runs failed. The gap was already around
+0.013 BPBbefore quantization. That was too large for quantization choicesor score-first TTT to rescue.
Files
Plan C is terminal-observed only. The pod connection dropped before I could
copy the full remote output folder, so I do not treat it as a complete logged
artifact.
Plan A: Gate32 + q-aware token-only n-gram tilt
Command shape:
Key log lines:
Budget checks passed:
15,972,854bytes,595.974s,515.283s.The n-gram path was token-only and timed inside eval. The failure was not
timing or artifact size. The trained model was weak before quantization.
Plan B: Gate32 + native PR #2018 n-gram
Plan B removed the q-aware patch to test whether the stricter n-gram logic was
responsible for the regression.
Command shape:
Key log lines:
That result isolated the main issue. Gate32 itself did not transfer to this
stack. The q-aware n-gram patch was not the cause of the pre-quant regression.
Plan C: exact #2018 gates + tiny BigramHash
The last branch removed Gate32 and tested a small causal input feature from my
earlier work.
Command shape:
Terminal-observed lines:
This was also stopped at pre-quant. A tiny BigramHash branch did not recover
training quality on the #2018 frontier.
Interpretation
The final runs support three narrow conclusions.
1. Gate32 did not transfer here
Gate widening had public evidence on nearby stacks, but it damaged this one.
The failure appeared before quantization and before TTT, so this was a training
dynamics problem rather than an eval-time issue.
2. The q-aware n-gram patch was not the root cause
Plan B removed that patch and still produced an even worse pre-quant result.
3. A tiny BigramHash branch was too small or mismatched
BigramHash helped my earlier PR #1716 in a different base. The 512x4 version
tested here did not transfer to #2018.
Compliance Notes
This is a non-record package, but the executed Plan A run still satisfies the
ordinary record constraints:
Closing Note
This folder is intentionally narrow. It is not a competition-wide synthesis.
It is the evidence package for one failed final transfer attempt.
The broader research notes are in PR #2111.