Skip to content

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)#2005

Open
jamesEmerson112 wants to merge 2 commits intoopenai:mainfrom
jamesEmerson112:submission/fullstack-headwise-gate
Open

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed)#2005
jamesEmerson112 wants to merge 2 commits intoopenai:mainfrom
jamesEmerson112:submission/fullstack-headwise-gate

Conversation

@jamesEmerson112
Copy link
Copy Markdown

Record: SP8192 + Full Stack + Headwise Gated Attention + Legal TTT

val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.74 MB | 8xH100 SXM

Non-record submission documenting a novel architecture modification (headwise gated attention) and systematic ablation study across 40+ experiments.

3-Seed Results

Seed Sliding BPB TTT BPB Artifact
42 1.0834 1.0818 15,697,552
1337 1.0810 1.0794 15,694,065
2025 1.0820 1.0804 15,693,855
Mean 1.0821 1.0805 15,695,157
Std 0.0012 0.0012

Novel Contribution: Headwise Gated Attention

Post-attention sigmoid gate applied per-head, after FA3+XSA. Q projection widened by gate_dim, gate modulates each head's contribution before output projection. ~50K extra params, zero latency cost, consistent -0.0005 BPB. Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

Compliance

  • No SLOT, no ETLB, no n-gram cache
  • No Pre-Quantization TTT — fully legal
  • Score-first TTT (legal per Issue A Field Guide to Valid Submissions #1017)
  • All artifacts under 16 MB, training under 600s, eval under 600s

Research Contributions

  • 29-paper systematic survey (NeurIPS/ICML/ICLR/ACL 2024-2025)
  • 40+ experiments across 2xH100 and 8xH100
  • 5 negative results documented (SLM, ResFormer, LR Warmup, Structured FFN, Peri-LN)
  • EMA decay scaling law discovery at short training durations

Base Stack

Built on @bigbag's stack (PR #1493) with @clarkkev's SP8192/GPTQ/MuonEq-R (PR #1394), @dexhunter's depth recurrence + legal TTT (PR #1331, #1413), and community contributions.

Reproduction

SEED=42 GATED_ATTN=headwise EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
  TTT_ENABLED=1 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

jamesEmerson112 and others added 2 commits April 30, 2026 09:38
…al_bpb=1.0511)

3-seed mean: 1.0511 BPB (std 0.0008), 8xH100 SXM
Seeds: 42 (1.0517), 1337 (1.0513), 2025 (1.0502)
All artifacts under 16 MB, train <600s, eval <600s

Techniques: Small Batch (ga=1) + EMA=0.990 + Headwise Gated Attention + PreQuantTTT 21ep
Base stack: @bigbag PR openai#1493 (FA3, depth recurrence, parallel residuals, XSA, MuonEq-R, GPTQ int6+brotli, score-first TTT)
- Rename folder: remove PreQuantTTT from name
- Update to C6 3-seed data (mean 1.0805 BPB)
- Remove PreQuantTTT from train_gpt.py, submission.json, README
- Mark as non-record, fully legal submission
- compliance.no_pre_quant_ttt: true

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jamesEmerson112
Copy link
Copy Markdown
Author

Credits

This submission builds on the work of many contributors to the Parameter Golf community:

  • @bigbag — Base stack: 3-layer depth recurrence, parallel residuals, sigmoid skip gates, QK-Gain 5.25, LeakyReLU^2, LN scale, legal TTT (PR #1493)
  • @clarkkev (Kevin Clark) — SP8192 vocabulary, GPTQ with SDClip, MuonEq-R optimizer, embedding GPTQ (PR #1394)
  • @dexhunter — Depth recurrence on SP8192 (PR #1331, #1437), legal score-first TTT on SP8192 (PR #1413)
  • @abaybektursun — Score-first TTT framework and legality analysis (PR #549)
  • @Robby955 — Parallel residuals on SP8192 (PR #1412)
  • @msisovic — Parallel residuals concept (PR #1204)
  • @X-Abhishek-X — Hyperparameter tuning and optimizer experiments (PR #1445, #1471)
  • @aryanbhosale — Parallel residuals + score-first TTT stack (PR #1517)
  • An Thien Vo (James Emerson Vo) — Headwise gated attention (novel contribution), 29-paper literature survey, 40+ experiment ablation study

Acknowledgements

  • OpenAI — for hosting the Parameter Golf challenge and the development grant
  • RunPod — for compute credits supporting our 2xH100 and 8xH100 experiments
  • Georgia Tech PACE — for supplementary compute resources
  • @sranganath02 (Sid Ranganathan) — for collaborating on nanochat research and tokenizer investigation as part of our CS 7643 Deep Learning team project
  • CS 7643 Deep Learning at Georgia Tech, taught by Dr. Zsolt Kira — course context for this research

@jamesEmerson112
Copy link
Copy Markdown
Author

jamesEmerson112 commented Apr 30, 2026

PR #2005 Supplement — Expanded Research Contributions

val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.70 MB | 8xH100 SXM


Research Contributions

1. Headwise Gated Attention (Novel Architecture)

Post-attention sigmoid gate applied per-head, after FlashAttention-3 + XSA compute the attention output. A learned gate modulates each head's contribution before the output projection:

  • Q projection widened by gate_dim extra dimensions
  • Gate signal extracted from extra Q dims, passed through sigmoid
  • Applied elementwise per-head: attn_out *= gate.unsqueeze(-1)
  • ~50K extra parameters (~0.14% overhead), zero inference latency cost

Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

Consistent improvement across scales:

Scale Headwise (F2/A3) Control (F1/A1) Delta
2×H100 1.1636 BPB 1.1641 BPB -0.0005
8×H100 1.0801 BPB 1.0806 BPB -0.0005

The -0.0005 BPB improvement is preserved exactly when scaling from 2×H100 to 8×H100, confirming the technique's robustness.

We also tested elementwise gating (1 gate per dim per head, +2.36M params). It achieved slightly better BPB (1.2602 vs 1.2653 on SP1024) but exceeded the 16 MB budget (17.87 MB). Headwise is the Pareto-optimal choice: nearly free parameters, fits under budget, and provides consistent improvement.

2. 29-Paper Systematic Survey

Surveyed papers from NeurIPS 2024-2025, ICML 2025, ICLR 2025, and ACL 2025 covering attention modifications, normalization strategies, optimizer scheduling, data selection, structured layers, and compression techniques. Each paper was assessed for Parameter Golf feasibility (16 MB / 10-min / 36M-param constraints) and mapped to PG leaderboard presence.

Key finding: most techniques published for 125M+ parameter models do not transfer to the 36M regime. Of 10 papers tested experimentally, 8 produced negative or null results. Only 2 showed gains on the V2 stack, and those gains did not survive the jump to 8×H100.

3. EMA Decay Scaling Law at Short Training Durations

Discovered that optimal EMA decay shifts dramatically lower when training steps are limited. The rank 1 default (0.9965) is suboptimal for short runs (~1,000-4,500 steps).

2×H100 EMA sweep (C6 base, ~1,030 steps):

EMA Decay TTT BPB vs C6 (1.1622)
0.9965 (default) 1.1622
0.995 1.1562 -0.0060
0.993 1.1526 -0.0096
0.990 1.1505 -0.0117
0.997 1.1690 +0.0068
0.999 1.3475 +0.1853 (catastrophic)

Gains are monotonic from 0.995 to 0.990 — more aggressive averaging helps when the training window is short. But EMA sensitivity is extreme: 0.997 is already worse than default, and 0.999 is catastrophic.

Critical caveat — does NOT transfer to 8×H100:

Config 2×H100 BPB 8×H100 BPB Delta vs C6 (8×H100)
C6 (EMA=0.9965) 1.1622 1.0805
L1 (EMA=0.990) 1.1505 1.0830 +0.0025 (worse)

With only ~4,486 steps on 8×H100, aggressive EMA averages too few checkpoints. The optimal decay depends on the number of training steps, not just the training duration.

4. 3×3 Factorial Technique Interaction Study

Systematic 9-run factorial experiment on the rank 1 stack to isolate technique interactions. Three factors, three levels each:

  • Residual strategy: Parallel Residuals (PR) only vs ResFormer (RF, α=0.5) only vs PR+RF
  • Gate type: No Gate vs Headwise vs Elementwise

Factor matrix (TTT BPB, 2×H100):

No Gate Headwise Elementwise
PR only 1.1641 1.1636 1.1665 (over budget)
RF only 1.1666 1.1661 1.1700 (over budget)
PR + RF 1.1636 1.1650 1.1686 (over budget)

Key interactions discovered:

  • Headwise gate helps PR (F2 vs F1: -0.0005) but hurts PR+RF (F8 vs F7: +0.0014). The two residual mechanisms compete for the same "residual quality" niche.
  • ResFormer helps when stacked with PR (F7 vs F1: -0.0005) but hurts alone (F4 vs F1: +0.0025). ResFormer is only beneficial as a complement to parallel residuals, not a replacement.
  • Elementwise busts budget in all configurations (+2.9M params → 17.2+ MB). Dead on arrival for 16 MB submissions.

This factorial design revealed that technique interactions are non-trivial — you cannot predict combined performance from individual ablations.

5. Small Batch Size for Short Wall-Clock Training

Tested reducing effective batch size to get more optimizer updates within the fixed 10-minute window (inspired by Liao et al., NeurIPS 2024).

2×H100 results (C6 base):

Config Batch Tokens Steps TTT BPB vs C6
C6 (default) 786,432 ~1,030 1.1622
B2 (small batch) 196,608 3,349 1.1419 -0.0203

4× smaller batch → 3.3× more steps → -0.020 BPB improvement. The largest single-technique gain we found on the V2 stack.

But does NOT transfer to 8×H100:

Config 2×H100 BPB 8×H100 BPB Delta vs C6 (8×H100)
C6 (default batch) 1.1622 1.0805
L2 (small batch + EMA=0.990) 1.1368 1.0926 +0.0121 (worse)

Despite achieving 13,146 steps on 8×H100 (where EMA=0.990 should help), the smaller batch size degrades quality more than extra steps help.

6. Technique Transfer Failure Across GPU Counts

Perhaps the most important meta-finding: hyperparameter improvements on 2×H100 do not reliably transfer to 8×H100.

Technique 2×H100 Delta 8×H100 Delta Transferred?
Headwise Gated Attention -0.0005 -0.0005 Yes
EMA=0.990 -0.0117 +0.0025 No
Small Batch (ga=1, 196K) -0.0203 +0.0121 No
EMA + Small Batch -0.0254 +0.0121 No

Architectural changes (headwise gate) transfer perfectly — the delta is preserved exactly across scales. Training hyperparameters (EMA decay, batch size) do not transfer because they depend on the number of training steps, which changes with GPU count. On 8×H100, the model sees ~4× more tokens per step, so the total step count drops from ~3,000-4,000 to ~1,000-4,500. This shifts the optimal EMA and batch size trade-offs.

Implication: For Parameter Golf, always validate hyperparameter tuning at competition scale (8×H100). Architecture changes can be safely prototyped on fewer GPUs.

7. GPTQ Compression Analysis

Systematic comparison of our GPTQ implementation vs leaderboard leaders revealed a 5× quality gap (+0.05 BPB vs +0.01 for Kevin Clark rank 5 / dexhunter rank 7). Root cause analysis identified 5 compounding factors:

  1. Depth recurrence — looped layers share weights, reducing unique matrices to quantize (fewer surfaces = less total error)
  2. Quantization-Aware Training (QAT) — leaderboard models train expecting quantization; ours gets shocked post-hoc
  3. Higher weight decay (0.085-0.090) — produces smaller weights that compress better under brotli
  4. EMA — averaging out training noise makes weights smoother and more compressible
  5. register_forward_hook Hessian collection — captures true activation statistics through the live network

The leaderboard doesn't just "use GPTQ" — they use GPTQ as the final step of a quantization-aware pipeline (WD tuning → EMA → QAT → GPTQ → brotli). We're only doing the last two steps.


Techniques That Failed (Expanded)

Tested on the V2 rank 1 stack (36M params, SP8192, 10-min wall clock). All produced negative results.

# Technique Paper Result Why It Failed
1 SLM / Rho-1 NeurIPS 2024 All ratios worse (+0.002 to +0.155 BPB) At 17M params, the model hasn't mastered basic tokens yet — skipping them removes gradient signal the model genuinely needs. Paper tested at 1B+. No reference model means we can't distinguish learnable (H→L) from unlearnable (H→H) tokens. Fixed wall clock means fewer effective tokens per step = worse model.
2 ResFormer (Value Residual) ACL 2025 +0.0022 BPB on 8×H100 (context-dependent) Works on our V1 stack (-0.0048 BPB, α=0.5 optimal) where it provides a gradient highway. Fails on V2 stack because parallel residuals already provide that highway — the two mechanisms are redundant.
3 LR Warmup NeurIPS 2024 +0.0024 to +0.0066 (monotonically worse with more warmup) MuonEq-R has its own momentum warmup; extra LR ramp wastes precious steps in a 10-min window.
4 Structured FFN NeurIPS 2024 +0.04 to +0.05 BPB Low-rank (r=0.5-0.75) + block-diagonal saves 30-56% MLP params but the approximation is too lossy at 36M. Paper tested at 125M+ where redundancy exists.
5 Peri-LN ICML 2025 Immediate NaN Output RMSNorm on attention + MLP conflicts with existing attn_scale/mlp_scale + ln_scale_factor in the rank 1 stack. Independently confirmed by teammate on rank 4 stack (different base, same failure).
6 Differential Attention ICLR 2025 Oral +0.0138 BPB Two-softmax-subtract attention requires 2× FlashAttention-3 calls per layer, reducing throughput by 22% (3,292 steps vs 4,221). The throughput penalty outweighs attention quality at 36M scale.
7 HybridNorm NeurIPS 2025 +0.011 BPB V-norm + Post-Norm FFN hurt on rank 4 stack. The stack is already heavily normalized (Q/K-norm, ln_scale_factor, resid_mix, attn_scale/mlp_scale). Adding more normalization conflicts — the normalization axis is closed.
8 GPTQ Sequential / Embed GPTQ tuning (Frantar et al., ICLR 2023) +0.19 to +0.66 BPB (dramatically worse) Sequential block quantization: Hessians collected through dequantized blocks are inferior to full-precision Hessians. Embedding GPTQ: frequency-weighted column correlation is the wrong Hessian for lookup tables (embeddings aren't linear projections). Combined: errors compound catastrophically.

Meta-finding: 8 of 10 tested papers produced negative results at the 36M-parameter scale. The 36M / 16 MB / 10-min constraint regime fundamentally changes which optimizations matter. Techniques designed for 125M+ parameter models with large compute budgets are not "free gains" at small scale — they often interact negatively with the aggressive training stack (MuonEq-R, depth recurrence, parallel residuals, XSA) that already exists.


Experiment Scale

Metric Value
Total experiments 40+
2×H100 sessions 7 (Sessions 3-8, 11-16)
8×H100 sessions 3 (Sessions 14, 15, 18)
Env configs created 21
Run scripts created 6
Papers surveyed 29
Papers tested experimentally 10
Total compute spend ~$280+ on RunPod

BPB Progression

Milestone BPB Config
V1 SP1024 baseline (Run 6) 1.2667 9L×512d, GQA, SP1024, int8+zlib
V1 SP8192 best (Run 11, 3-seed) 1.2073 9L×448d, headwise gate, SP8192, int8+zlib
V2 fork of rank 1 (F2) 1.1636 11L×512d, full stack + headwise gate
V2 best 2×H100 (N1) 1.1368 C6 + EMA=0.990 + Small Batch
C6 submission (3-seed mean) 1.0805 V2 + headwise gate + emb7+eclip15, 8×H100

2×H100 → 8×H100 Scaling

Consistent improvement across all V2 configurations:

Config 2×H100 BPB 8×H100 BPB Improvement
F1 (control) 1.1641 1.0806 -0.0835
F2 (headwise) 1.1636 1.0801 -0.0835
C6 (headwise + emb7+eclip15) 1.1622 1.0818 -0.0804

Technique deltas are preserved across scales for architectural changes. ~0.083 BPB consistent scaling factor from 4× GPU count increase.


3-Seed Reproducibility

Seed Sliding BPB TTT BPB Artifact
42 1.0834 1.0818 15,697,552 bytes
1337 1.0810 1.0794 15,694,065 bytes
2025 1.0820 1.0804 15,693,855 bytes
Mean 1.0821 1.0805 15,695,157 bytes
Std 0.0012 0.0012

All seeds: artifact under 16 MB, training under 600s, eval under 600s.

@jamesEmerson112
Copy link
Copy Markdown
Author


Run ID Reference

Internal run IDs used throughout this document. Each describes a specific configuration.

Run ID Full Name Description
C6 Headwise + emb7+eclip15 Our submission config: @bigbag's full stack + headwise gated attention + int7 embedding quantization (clip σ=15.0). Submitted with 3-seed verification.
F1 Control (no additions) @bigbag's rank 1 stack unmodified. Baseline for ablation.
F2 PR + Headwise Gate @bigbag's stack + headwise gated attention (default compression).
F7 PR + ResFormer (α=0.5) @bigbag's stack + ResFormer value residual learning. No gate.
A1 Control (8×H100) F1 run at competition scale (8×H100).
A3 Headwise (8×H100) F2 run at competition scale (8×H100, default compression).
A2 PR + ResFormer (8×H100) F7 run at competition scale (8×H100).
E1 EMA=0.995 C6 base + more aggressive EMA averaging (decay 0.995 vs default 0.9965).
R3 EMA=0.990 C6 base + most aggressive EMA averaging tested. Best on 2×H100.
L1 EMA=0.990 (8×H100) R3 config at competition scale. Did NOT transfer.
L2 Small Batch + EMA (8×H100) Small batch (196K tokens) + EMA=0.990 at competition scale.
B2 Small Batch (2×H100) C6 base + grad_accum=1, 196K batch tokens (4× smaller, 3.3× more steps).
N1 EMA + Small Batch (2×H100) C6 + EMA=0.990 + small batch. Best-ever legal 2×H100 result.
N2 N1 + Differential Attention N1 config + two-softmax-subtract attention (Paper #19).
P1a SOTA hparams (8×H100) C6 + 6 hyperparameter overrides from PR #1855 (warmdown, min_lr, clip, beta2).
Q0-Q7 GPTQ tuning runs Sequential blocks (Q1), embed GPTQ (Q3), all combined (Q7). All worse.
F1-F9 3×3 factorial 9-run sweep: {PR, RF, PR+RF} × {No Gate, Headwise, Elementwise}.

Base stack ("@bigbag's stack" / "rank 1 stack"): SP8192 vocabulary, 11L×512d×8H/4KV, 4×MLP with LeakyReLU(0.5)², 3-layer depth recurrence (layers 3-4-5 looped 2×), parallel residuals (layers 7+), sigmoid skip gates, partial RoPE (16/64 dims), XSA on all layers, QK-Gain 5.25, MuonEq-R optimizer, EMA (0.9965), GPTQ int6+brotli, score-first TTT. From @bigbag PR #1493.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant