Skip to content

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6#1749

Open
gracebml wants to merge 1 commit intoopenai:mainfrom
gracebml:gdn-hybrid-ttt-int6
Open

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6#1749
gracebml wants to merge 1 commit intoopenai:mainfrom
gracebml:gdn-hybrid-ttt-int6

Conversation

@gracebml
Copy link
Copy Markdown

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6

Non-Record Submission — Unlimited Compute Track
Author: mlinh · @gracebml
Track: records/track_non_record_16mb/
Date: 2026-04-20


What this PR adds

A single new folder:

records/track_non_record_16mb/2026-04-20_GDN_Hybrid_ScoreFirst_TTT_HessianGPTQ_Int6/
├── README.md          # full write-up
├── submission.json    # metadata
├── train_gpt.py          # complete training script
├── requirements.txt   # pip dependencies
└── train_seed42.log   # full training log (seed=42)

No changes to train_gpt.py at the repo root or any other existing file.


Score

Metric Value
val_bpb (sliding window, stride=32) 1.0996
val_bpb (single-pass, post-GPTQ) 1.1237
Artifact size 14.03 MB
Steps run (compute-limited) 5,610 / 20,000
Hardware 1× H100 GPU, wallclock-capped at 4,800 s

For context, the current leaderboard top (PR #1493) sits at 1.0810 bpb on 8×H100 for 10 minutes. This submission achieves 1.0996 having completed only 28% of the planned training budget on a single H100.


Three novel contributions

1 · GDN-Hybrid architecture

Replaces 2 of the 12 standard attention layers with a shared Sliding Window Attention module and fills the remaining 10 slots with Gated DeltaNet (GDN) recurrent layers:

[GDN×5] -> [SWA] -> [GDN×5] -> [SWA_shared]

GDN's delta-rule associative memory gives the model effectively infinite context at O(T) cost — ideal for TTT, where the model accumulates document-specific knowledge across the evaluation window. The two SWA layers share weights, saving ~4 M parameters reinvested into more GDN heads and a wider BigamHash projection.

2 · Legal Score-First TTT (strictly compliant)

The TTT protocol follows the exact compliance requirement stated in the repo FAQ:

"you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded"

Implementation:

  1. Each 32,768-token chunk is evaluated in torch.inference_mode() — no look-ahead bias, zero gradient accumulation.
  2. After scoring, an isolated AdamW step updates model weights on the already-graded chunk.
  3. N-gram posterior tilt (PR Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437) and eval-time hash embeddings (PR Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) #1460) are applied on top of the adapted weights.

The GDN memory traces are naturally reinforced by TTT: repeated n-grams strengthen delta-rule memory writes, making the model document-adaptive without any training-data leakage.

3 · Full-Hessian GPTQ Int6 with Cholesky compensation

Standard GPTQ quantises each weight column in isolation. This submission:

  • Collects a per-layer full Hessian (input outer-product, 64 calibration batches of autoregressive sequences)
  • Applies Cholesky error compensation — residual quantisation error is propagated to remaining columns along the Hessian's principal directions
  • Routes sensitive layers to bfloat16 (based on weight norm vs Hessian eigenvalue threshold)

Result: only +0.013 BPB degradation from FP32 -> Int6, compared to typical +0.05–0.10 for naive post-training quantisation of recurrent layers. All 66 linear layers used full GPTQ; 0 fell back to clip-search.

Final artifact: 13.93 MB model (int6 + brotli-11) + 0.10 MB code = 14.03 MB total, 1.0 MB under the 16 MB ceiling.


Training log excerpt

  0/20000  val_bpb: 4.1097   (random init)
500/20000  train_loss: 2.2677
1000/20000 train_loss: 2.1755
2000/20000 train_loss: 2.0873
3000/20000 train_loss: 2.0338
4000/20000 val_bpb:  1.1718   (20% through budget)
5000/20000 train_loss: 1.9215
5610/20000 val_bpb:  1.1117   (single-pass, wallclock cap)

pre-quantization post-ema val_bpb:         1.1106
final_int6_roundtrip val_bpb:              1.1237     (+0.013 GPTQ degradation)
final_int6_sliding_window val_bpb:         1.0996     (stride=32)
Serialized model int6+brotli:              13,931,533 bytes
Total submission size:                     14,034,252 bytes

Why this is relevant for the unlimited-compute track

The competition README calls out several techniques it would love to see:

"State-space models, E2E TTT, super long context for evaluation or training"
"Test-time training"

This submission directly addresses both: GDN is a state-space/linear-attention recurrent model, and the score-first TTT protocol makes it continuously adaptive during evaluation. The combination is novel — prior TTT submissions (PR #549, PR #1493) use standard Transformer attention. GDN's delta-rule memory is a natural target for TTT because the memory write is differentiable and directly encodes document-specific associations.


Compute request context

This run was bottlenecked from being run on a single H100 GPU. The run hit the wallclock cap at step 5,610 with loss still decreasing steeply. Extrapolating the BPB curve to 20,000 steps on 8×H100 suggests a target score meaningfully below 1.08 BPB.

The architecture, Score-First TTT protocol, and Hessian-GPTQ pipeline are all fully implemented and verified. The sole bottleneck is GPU hours.


Reproducibility

# Install deps
pip install torch sentencepiece zstandard brotli flash-attn --no-build-isolation flash-linear-attention

# Download data (same as all other submissions)
python3 data/cached_challenge_fineweb.py --variant sp1024

# Run (single GPU, unlimited-track wallclock)
SEED=42 ITERATIONS=20000 TRAIN_SEQ_LEN=2048 \
TTT_ENABLED=1 MAX_WALLCLOCK_SECONDS=4800 \
python3 records/track_non_record_16mb/2026-04-20_GDN_Hybrid_ScoreFirst_TTT_HessianGPTQ_Int6/train_gpt.py

# Run (8×H100, 10-minute leaderboard timing)
SEED=42 ITERATIONS=20000 TRAIN_SEQ_LEN=2048 \
TTT_ENABLED=1 GPTQ_ENABLED=1 MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 \
  records/track_non_record_16mb/2026-04-20_GDN_Hybrid_ScoreFirst_TTT_HessianGPTQ_Int6/train_gpt.py

Full hyperparameter dump and training log included in train_seed42.log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant