GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6#1749
Open
gracebml wants to merge 1 commit intoopenai:mainfrom
Open
GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6#1749gracebml wants to merge 1 commit intoopenai:mainfrom
gracebml wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6
Non-Record Submission — Unlimited Compute Track
Author: mlinh · @gracebml
Track:
records/track_non_record_16mb/Date: 2026-04-20
What this PR adds
A single new folder:
No changes to
train_gpt.pyat the repo root or any other existing file.Score
For context, the current leaderboard top (PR #1493) sits at 1.0810 bpb on 8×H100 for 10 minutes. This submission achieves 1.0996 having completed only 28% of the planned training budget on a single H100.
Three novel contributions
1 · GDN-Hybrid architecture
Replaces 2 of the 12 standard attention layers with a shared Sliding Window Attention module and fills the remaining 10 slots with Gated DeltaNet (GDN) recurrent layers:
GDN's delta-rule associative memory gives the model effectively infinite context at O(T) cost — ideal for TTT, where the model accumulates document-specific knowledge across the evaluation window. The two SWA layers share weights, saving ~4 M parameters reinvested into more GDN heads and a wider BigamHash projection.
2 · Legal Score-First TTT (strictly compliant)
The TTT protocol follows the exact compliance requirement stated in the repo FAQ:
Implementation:
torch.inference_mode()— no look-ahead bias, zero gradient accumulation.The GDN memory traces are naturally reinforced by TTT: repeated n-grams strengthen delta-rule memory writes, making the model document-adaptive without any training-data leakage.
3 · Full-Hessian GPTQ Int6 with Cholesky compensation
Standard GPTQ quantises each weight column in isolation. This submission:
Result: only +0.013 BPB degradation from FP32 -> Int6, compared to typical +0.05–0.10 for naive post-training quantisation of recurrent layers. All 66 linear layers used full GPTQ; 0 fell back to clip-search.
Final artifact: 13.93 MB model (int6 + brotli-11) + 0.10 MB code = 14.03 MB total, 1.0 MB under the 16 MB ceiling.
Training log excerpt
Why this is relevant for the unlimited-compute track
The competition README calls out several techniques it would love to see:
This submission directly addresses both: GDN is a state-space/linear-attention recurrent model, and the score-first TTT protocol makes it continuously adaptive during evaluation. The combination is novel — prior TTT submissions (PR #549, PR #1493) use standard Transformer attention. GDN's delta-rule memory is a natural target for TTT because the memory write is differentiable and directly encodes document-specific associations.
Compute request context
This run was bottlenecked from being run on a single H100 GPU. The run hit the wallclock cap at step 5,610 with loss still decreasing steeply. Extrapolating the BPB curve to 20,000 steps on 8×H100 suggests a target score meaningfully below 1.08 BPB.
The architecture, Score-First TTT protocol, and Hessian-GPTQ pipeline are all fully implemented and verified. The sole bottleneck is GPU hours.
Reproducibility
Full hyperparameter dump and training log included in
train_seed42.log.