GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6 by gracebml · Pull Request #1749 · openai/parameter-golf

gracebml · 2026-04-20T13:24:13Z

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6

Non-Record Submission — Unlimited Compute Track
Author: mlinh · @gracebml
Track: records/track_non_record_16mb/
Date: 2026-04-20

What this PR adds

A single new folder:

records/track_non_record_16mb/2026-04-20_GDN_Hybrid_ScoreFirst_TTT_HessianGPTQ_Int6/
├── README.md          # full write-up
├── submission.json    # metadata
├── train_gpt.py          # complete training script
├── requirements.txt   # pip dependencies
└── train_seed42.log   # full training log (seed=42)

No changes to train_gpt.py at the repo root or any other existing file.

Score

Metric	Value
val_bpb (sliding window, stride=32)	1.0996
val_bpb (single-pass, post-GPTQ)	1.1237
Artifact size	14.03 MB
Steps run (compute-limited)	5,610 / 20,000
Hardware	1× H100 GPU, wallclock-capped at 4,800 s

For context, the current leaderboard top (PR #1493) sits at 1.0810 bpb on 8×H100 for 10 minutes. This submission achieves 1.0996 having completed only 28% of the planned training budget on a single H100.

Three novel contributions

1 · GDN-Hybrid architecture

Replaces 2 of the 12 standard attention layers with a shared Sliding Window Attention module and fills the remaining 10 slots with Gated DeltaNet (GDN) recurrent layers:

[GDN×5] -> [SWA] -> [GDN×5] -> [SWA_shared]

GDN's delta-rule associative memory gives the model effectively infinite context at O(T) cost — ideal for TTT, where the model accumulates document-specific knowledge across the evaluation window. The two SWA layers share weights, saving ~4 M parameters reinvested into more GDN heads and a wider BigamHash projection.

2 · Legal Score-First TTT (strictly compliant)

The TTT protocol follows the exact compliance requirement stated in the repo FAQ:

"you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded"

Implementation:

Each 32,768-token chunk is evaluated in torch.inference_mode() — no look-ahead bias, zero gradient accumulation.
After scoring, an isolated AdamW step updates model weights on the already-graded chunk.
N-gram posterior tilt (PR Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437) and eval-time hash embeddings (PR Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) #1460) are applied on top of the adapted weights.

The GDN memory traces are naturally reinforced by TTT: repeated n-grams strengthen delta-rule memory writes, making the model document-adaptive without any training-data leakage.

3 · Full-Hessian GPTQ Int6 with Cholesky compensation

Standard GPTQ quantises each weight column in isolation. This submission:

Collects a per-layer full Hessian (input outer-product, 64 calibration batches of autoregressive sequences)
Applies Cholesky error compensation — residual quantisation error is propagated to remaining columns along the Hessian's principal directions
Routes sensitive layers to bfloat16 (based on weight norm vs Hessian eigenvalue threshold)

Result: only +0.013 BPB degradation from FP32 -> Int6, compared to typical +0.05–0.10 for naive post-training quantisation of recurrent layers. All 66 linear layers used full GPTQ; 0 fell back to clip-search.

Final artifact: 13.93 MB model (int6 + brotli-11) + 0.10 MB code = 14.03 MB total, 1.0 MB under the 16 MB ceiling.

Training log excerpt

  0/20000  val_bpb: 4.1097   (random init)
500/20000  train_loss: 2.2677
1000/20000 train_loss: 2.1755
2000/20000 train_loss: 2.0873
3000/20000 train_loss: 2.0338
4000/20000 val_bpb:  1.1718   (20% through budget)
5000/20000 train_loss: 1.9215
5610/20000 val_bpb:  1.1117   (single-pass, wallclock cap)

pre-quantization post-ema val_bpb:         1.1106
final_int6_roundtrip val_bpb:              1.1237     (+0.013 GPTQ degradation)
final_int6_sliding_window val_bpb:         1.0996     (stride=32)
Serialized model int6+brotli:              13,931,533 bytes
Total submission size:                     14,034,252 bytes

Why this is relevant for the unlimited-compute track

The competition README calls out several techniques it would love to see:

"State-space models, E2E TTT, super long context for evaluation or training"
"Test-time training"

This submission directly addresses both: GDN is a state-space/linear-attention recurrent model, and the score-first TTT protocol makes it continuously adaptive during evaluation. The combination is novel — prior TTT submissions (PR #549, PR #1493) use standard Transformer attention. GDN's delta-rule memory is a natural target for TTT because the memory write is differentiable and directly encodes document-specific associations.

Compute request context

This run was bottlenecked from being run on a single H100 GPU. The run hit the wallclock cap at step 5,610 with loss still decreasing steeply. Extrapolating the BPB curve to 20,000 steps on 8×H100 suggests a target score meaningfully below 1.08 BPB.

The architecture, Score-First TTT protocol, and Hessian-GPTQ pipeline are all fully implemented and verified. The sole bottleneck is GPU hours.

Reproducibility

# Install deps
pip install torch sentencepiece zstandard brotli flash-attn --no-build-isolation flash-linear-attention

# Download data (same as all other submissions)
python3 data/cached_challenge_fineweb.py --variant sp1024

# Run (single GPU, unlimited-track wallclock)
SEED=42 ITERATIONS=20000 TRAIN_SEQ_LEN=2048 \
TTT_ENABLED=1 MAX_WALLCLOCK_SECONDS=4800 \
python3 records/track_non_record_16mb/2026-04-20_GDN_Hybrid_ScoreFirst_TTT_HessianGPTQ_Int6/train_gpt.py

# Run (8×H100, 10-minute leaderboard timing)
SEED=42 ITERATIONS=20000 TRAIN_SEQ_LEN=2048 \
TTT_ENABLED=1 GPTQ_ENABLED=1 MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 \
  records/track_non_record_16mb/2026-04-20_GDN_Hybrid_ScoreFirst_TTT_HessianGPTQ_Int6/train_gpt.py

Full hyperparameter dump and training log included in train_seed42.log.

…bpb, 1x H100)

Non-record: GDN-Hybrid + Score-First TTT + Hessian GPTQ Int6 (1.0996 …

e79807b

…bpb, 1x H100)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6#1749

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6#1749
gracebml wants to merge 1 commit intoopenai:mainfrom
gracebml:gdn-hybrid-ttt-int6

gracebml commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gracebml commented Apr 20, 2026

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6

What this PR adds

Score

Three novel contributions

1 · GDN-Hybrid architecture

2 · Legal Score-First TTT (strictly compliant)

3 · Full-Hessian GPTQ Int6 with Cholesky compensation

Training log excerpt

Why this is relevant for the unlimited-compute track

Compute request context

Reproducibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant