Skip to content

Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed)#2071

Open
jamesEmerson112 wants to merge 6 commits intoopenai:mainfrom
jamesEmerson112:submission/p3-headwise-gate-ema990-smallbatch
Open

Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed)#2071
jamesEmerson112 wants to merge 6 commits intoopenai:mainfrom
jamesEmerson112:submission/p3-headwise-gate-ema990-smallbatch

Conversation

@jamesEmerson112
Copy link
Copy Markdown

@jamesEmerson112 jamesEmerson112 commented May 1, 2026

Record: SP8192 + PR#1851 Fork + Headwise Gate + EMA 0.990 + Small Batch + Emb6

val_bpb = 1.0066 (3-seed mean, std 0.0009) | ~15.97 MB | 8xH100 SXM

Record submission — beats previous SOTA (1.0611, PR #1855) by 0.0545 nats.

No PPM. No PreQuantTTT

3-Seed Results

Seed Pre-Q BPB Quant BPB TTT BPB Artifact
42 1.0025 1.0205 1.0069 15,975,827
1337 1.0017 1.0190 1.0057 15,973,108
2025 1.0030 1.0206 1.0073 15,973,714
Mean 1.0024 1.0200 1.0066 15,974,216
Std 0.0007 0.0009 0.0009

Author & Research Approach

An Thien Vo (James Emerson Vo) — Georgia Tech, CS 7643 Deep Learning.

This submission forks PR #1851 (@aquariouseworkman) and adds 4 novel contributions discovered through a systematic research effort: 29+ papers surveyed, 40+ experiments across 2×H100 and 8×H100, and careful ablation to identify which techniques transfer to the extreme compression regime of Parameter Golf.

Novel Contributions

  1. Headwise Gated Attention — Post-attention sigmoid gate applied per-head after FA3+XSA. Q projection widened by gate_dim, gate modulates each head's contribution before output projection. ~50K extra parameters, zero inference cost, consistent -0.0005 BPP improvement across scales. Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

  2. EMA Decay = 0.990 — Discovered optimal EMA decay shifts dramatically lower when training steps are limited. Default 0.9965 → optimal 0.990 on 8×H100: more aggressive weight averaging captures better training signal when the training window is fixed at 10 minutes.

  3. Small Batch (ga=1, 196K tokens) — Reducing effective batch size from 786K to 196K tokens yields 3.3× more optimizer steps in the same wall clock. On 8×H100, this enables 12,382 steps vs ~4,500 with default batch size, giving the optimizer more fine-grained updates.

  4. 6-bit Embedding Quantization — Reducing EMBED_BITS from 8 to 6 saves ~1 MB on the compressed artifact, enabling headwise gated attention's extra parameters to fit under the 16 MB budget. Costs ~0.013 BPB in quantization gap but enables the complete technique stack.

Key Techniques

Technique Source Impact
Headwise Gated Attention James Vo (novel) -0.0005 BPB
EMA Decay = 0.990 James Vo (novel finding) Enables better weight averaging
Small Batch (196K tokens) James Vo (novel finding) 3.3× more optimizer steps
6-bit Embedding Quant James Vo (novel) -1 MB artifact size

Base Stack

PR #1851 (@aquariouseworkman)

Extends @bigbag's PR #1493 with:

  • LQER (asymmetric int4 error correction) — lqer_enabled=True, lqer_asym_enabled=True
  • Phased TTT (multi-phase test-time training with LoRA)
  • Fused softcapped cross-entropy kernel (Triton) — fused_ce_enabled=True
  • Brotli compression
  • SmearGate (available but disabled in our runs: smear_gate_enabled=False)
  • Sparse Attention Gate (available but replaced by our headwise gate: sparse_attn_gate_enabled=False)
  • CaseOps tokenizer (active via symlinked data — env var was caseops_enabled=False but pod data paths pointed to CaseOps-tokenized shards)

Upstream: @bigbag PR #1493

  • SP8192 vocabulary (8192-token SentencePiece BPE, CaseOps variant via symlinked data)
  • 11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)²
  • 3-layer depth recurrence (layers 3-4-5 looped 2×, 17 virtual from 11 physical)
  • Parallel residuals (layers 7+), sigmoid skip gates
  • Partial RoPE (16/64 dims), layerwise LN scale
  • XSA on all 11 layers, QK-Gain 5.0, logit softcap 30.0
  • MuonEq-R optimizer, GPTQ int6+brotli, score-first TTT

Techniques That Failed

Tested on the V2 rank 1 stack. All produced negative results at the 36M-parameter scale.

# Technique Paper Result Why It Failed
1 SLM / Rho-1 NeurIPS 2024 All ratios worse 17M model needs every gradient signal; paper tested at 1B+
2 ResFormer ACL 2025 +0.0022 BPB Parallel residuals already provide the gradient highway
3 LR Warmup NeurIPS 2024 +0.0024 to +0.0066 MuonEq-R has its own momentum warmup
4 Structured FFN NeurIPS 2024 +0.04 to +0.05 BPB Low-rank too lossy at 36M; tested at 125M+
5 Peri-LN ICML 2025 Immediate NaN Conflicts with existing normalization stack
6 Differential Attention ICLR 2025 Oral +0.0138 BPB 2× FA3 calls reduces throughput 22%
7 HybridNorm NeurIPS 2025 +0.011 BPB Normalization axis already saturated
8 GPTQ Sequential/Embed Frantar et al. +0.19 to +0.66 Sequential Hessians through dequantized blocks are inferior

Architecture

11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)², partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (enabled at frac=0.35). Parallel residuals from layer 8. Skip gates. Headwise gated attention: Q widened by gate_dim, sigmoid gate per-head after FA3+XSA. LQER asymmetric int4 error correction from PR #1851. Fused softcapped CE kernel (Triton).

Total parameters: ~35.99M.

Training

MuonEq-R optimizer for matrix params, AdamW for embeddings/scalars. GRAD_ACCUM_STEPS=1 (8 GPUs), TRAIN_BATCH_TOKENS=196,608 (small batch), ~12,382 steps in ~596s on 8×H100 SXM (PyTorch 2.11, CUDA 13.0). Linear warmdown over final 75%. EMA decay 0.990. Weight decay: Muon WD=0.095, Embed WD=0.085, Adam WD=0.02.

Quantization

Full-Hessian GPTQ with SDClip: clip = k * std(row).

  • int6 for attention/MLP matrices (MATRIX_CLIP_SIGMAS=12.85, ATTN_CLIP_SIGMAS=13.0)
  • int6 for token embeddings (EMBED_BITS=6, EMBED_CLIP_SIGMAS=20.0)
  • Byte-shuffle + Brotli-11 compression
  • 16 calibration batches from training data

Evaluation

Phased TTT (from PR #1851) — multi-phase test-time training with LoRA adaptation:

  • LoRA rank 96, applied to K/MLP/O projections
  • Adam optimizer, lr=0.0001, weight decay=1.0
  • Eval seq len 2048, chunk-based scoring
  • Score-first: tokens scored under torch.no_grad() BEFORE gradient updates
  • Total eval time: ~353-389s (within 600s budget)

Compliance

Per Issue #1017 (Track B):

  • C1 (Causality): Causal eval only — each position scored from prefix tokens.
  • C2 (Normalized): Standard softmax over full vocab.
  • C3 (Score before update): Each chunk fully scored before any LoRA update.
  • C4 (Single pass): Each token scored exactly once.
  • No SLOT, No PreQuantTTT, No ETLB, No n-gram cache.
  • All artifacts under 16,000,000 bytes on all 3 seeds.
  • Training under 600s on all 3 seeds.
  • Eval under 600s on all 3 seeds.

Reproduction

git clone https://github.com/jamesEmerson112/DL-Team-Proposal.git
cd DL-Team-Proposal && git checkout James-experiment
cd parameter-golf

pip install --upgrade torch
pip install brotli sentencepiece numpy python-minifier
pip install huggingface_hub
pip install --no-cache-dir \
  "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl"

# Step 1: Download regular SP8192 data
python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards 80

# Step 2: Download CaseOps-tokenized data
# Due to a technical issue (regular sp8192 and CaseOps sp8192 datasets conflict
# on disk), CaseOps data was symlinked into the standard paths on our pod.
# The env var caseops_enabled=False, but training used CaseOps-tokenized shards.
MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
  MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \
  python3 data/cached_challenge_fineweb.py \
    --variant sp8192_lossless_caps_caseops_v1_reserved \
    --train-shards 80

# Step 3: Symlink CaseOps data into standard paths
mv data/datasets/fineweb10B_sp8192 data/datasets/fineweb10B_sp8192_regular 2>/dev/null || true
mv data/tokenizers/fineweb_8192_bpe.model data/tokenizers/fineweb_8192_bpe.model.regular 2>/dev/null || true
ln -s fineweb10B_sp8192_lossless_caps_caseops_v1_reserved data/datasets/fineweb10B_sp8192
ln -s fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model data/tokenizers/fineweb_8192_bpe.model

# Step 4: Train (CASEOPS_ENABLED=0 — byte sidecar not used, but data is CaseOps-tokenized)
SEED=42 GATED_ATTN_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=0 \
  EMA_DECAY=0.990 GRAD_ACCUM_STEPS=1 TRAIN_BATCH_TOKENS=196608 EMBED_BITS=6 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Repeat with SEED=1337 and SEED=2025 for 3-seed verification.

Credits

Acknowledgements

  • OpenAI — for hosting the Parameter Golf challenge and the development grant
  • RunPod — for compute credits supporting our 2×H100 and 8×H100 experiments
  • Georgia Tech PACE — for supplementary compute resources
  • CS 7643 Deep Learning at Georgia Tech, taught by Dr. Zsolt Kira
  • @sranganath2 (Sid Ranganathan) — contributed to discussion about tokenizer investigation, nanochat research, fused CE kernel insights, and research papers
  • @Ashray14 — contributed to discussion about research papers
  • @ialeksic3 — contributed to discussion about research papers

Total personal compute cost: ~$1,165 ($640 out-of-pocket + $525 OpenAI development grant) across 130+ experiments on RunPod.

In memory of Moomoo, my cat.

Included Files

  • README.md (this file)
  • submission.json
  • train_gpt.py
  • train_seed42.log
  • train_seed1337.log
  • train_seed2025.log

…ch (1.0066 BPB, 3-seed)

3-seed mean val_bpb = 1.0066 (std 0.0009) on 8xH100 SXM.
Beats SOTA (1.0611, PR openai#1855) by 0.0545 nats.

Novel contributions:
- Headwise Gated Attention (post-FA3+XSA sigmoid gate per-head)
- EMA Decay = 0.990 (vs default 0.9965)
- Small Batch (ga=1, 196K tokens, 3.3x more optimizer steps)
- 6-bit Embedding Quantization (EMBED_BITS=6)

Base: PR openai#1851 (@aquariouseworkman) + @bigbag PR openai#1493 upstream.
All seeds under 16 MB, training < 600s, eval < 600s.
Score-first TTT, no PreQuantTTT, fully compliant.
@someone114514
Copy link
Copy Markdown

I’m having trouble verifying the 1.0066 claim from the submitted logs.

The raw train_seed42.log I can see stops after:

diagnostic pre-quantization post-ema val_bpb:1.00247081
GPTQ:collecting Hessians from calibration data...

I can’t find the final post-GPTQ / artifact reload / TTT eval lines, e.g.:

  • GPTQ:collected ...
  • Serialized model quantized+brotli ...
  • Total submission size quantized+brotli ...
  • diagnostic quantized ...
  • quantized_ttt_phased val_bpb=... eval_time=...

The submission.json reports per-seed TTT BPB and artifact bytes, but I don’t see those values supported by the logs currently attached.

Could you upload the complete per-seed logs showing the final quantized artifact evaluated with TTT on the full validation split?

@andrewbaggio1
Copy link
Copy Markdown

what's your data/datasets/fineweb10B_sp8192/fineweb_val_000000.bin MD5?

@jamesEmerson112
Copy link
Copy Markdown
Author

jamesEmerson112 commented May 1, 2026

@andrewbaggio1 Thank you for pointing it out. I knew something was missing. That's actually the SP8192 + CaseOps mentioned by #1868 and #1855 (and perhaps more)

I'm updating the report now

@jamesEmerson112
Copy link
Copy Markdown
Author

@someone114514 Thanks for asking. My logs actually follow the format from #1868. Please be kind enough to refer that PR for further details

CaseOps-tokenized data was used via pod symlinks even though
CASEOPS_ENABLED=0 was set. Updated README and submission.json
to accurately reflect this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andrewbaggio1
Copy link
Copy Markdown

also i'd edit your teammate's names out of the submission, the rules explicitly say "result submission is limited to individuals"

@jamesEmerson112
Copy link
Copy Markdown
Author

@andrewbaggio1 Much appreciated, Andrew. We mainly discussed regarding papers, so I had it updated accordingly

@jamesEmerson112
Copy link
Copy Markdown
Author

jamesEmerson112 commented May 1, 2026

PR Supplement — Expanded Research Contributions

P3 SOTA: val_bpb = 1.0066 (3-seed mean, std 0.0009) | ~15.97 MB | 8xH100 SXM
Previous C6: val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.70 MB


Attempted techniques and benchmarks


Run ID Reference

Internal run IDs used throughout this document (starting from my previous "SOTA", which is C6; please refer to the README and findings.md for a detailed document). Each describes a specific configuration.

Run ID Full Name Description
C6 Headwise + emb7+eclip15 Previous submission config (superseded by P3): @bigbag's full stack + headwise gated attention + int7 embedding quantization (clip σ=15.0). Submitted with 3-seed verification.
F1 Control (no additions) @bigbag's stack (PR #1493) unmodified. Baseline for ablation.
F2 PR + Headwise Gate @bigbag's stack + headwise gated attention (default compression).
F7 PR + ResFormer (α=0.5) @bigbag's stack + ResFormer value residual learning. No gate.
A1 Control (8×H100) F1 run at competition scale (8×H100).
A3 Headwise (8×H100) F2 run at competition scale (8×H100, default compression).
A2 PR + ResFormer (8×H100) F7 run at competition scale (8×H100).
E1 EMA=0.995 C6 base + more aggressive EMA averaging (decay 0.995 vs default 0.9965).
R3 EMA=0.990 C6 base + most aggressive EMA averaging tested. Best on 2×H100.
L1 EMA=0.990 (8×H100) R3 config at competition scale. Did NOT transfer.
L2 Small Batch + EMA (8×H100) Small batch (196K tokens) + EMA=0.990 at competition scale.
B2 Small Batch (2×H100) C6 base + grad_accum=1, 196K batch tokens (4× smaller, 3.3× more steps).
N1 EMA + Small Batch (2×H100) C6 + EMA=0.990 + small batch. Best-ever legal 2×H100 result.
N2 N1 + Differential Attention N1 config + two-softmax-subtract attention (Paper #19).
P1a SOTA hparams (8×H100) C6 + 6 hyperparameter overrides from PR #1855 (warmdown, min_lr, clip, beta2).
Q0-Q7 GPTQ tuning runs Sequential blocks (Q1), embed GPTQ (Q3), all combined (Q7). All worse.
F1-F9 3×3 factorial 9-run sweep: {PR, RF, PR+RF} × {No Gate, Headwise, Elementwise}.
P3-s42 PR#1851 + gate + EMA990 + smallbatch + emb6 (seed 42) SOTA run, 1.0069 TTT BPB, 15,975,827 bytes
P3-s1337 PR#1851 + gate + EMA990 + smallbatch + emb6 (seed 1337) SOTA run, 1.0057 TTT BPB, 15,973,108 bytes
P3-s2025 PR#1851 + gate + EMA990 + smallbatch + emb6 (seed 2025) SOTA run, 1.0073 TTT BPB, 15,973,714 bytes

Base stack ("@bigbag's stack" / "PR #1493 stack"): SP8192 vocabulary, 11L×512d×8H/4KV, 4×MLP with LeakyReLU(0.5)², 3-layer depth recurrence (layers 3-4-5 looped 2×), parallel residuals (layers 7+), sigmoid skip gates, partial RoPE (16/64 dims), XSA on all layers, QK-Gain 5.25, MuonEq-R optimizer, EMA (0.9965), GPTQ int6+brotli, score-first TTT. From @bigbag PR #1493.


Research Contributions

1. Headwise Gated Attention (Novel Architecture)

Post-attention sigmoid gate applied per-head, after FlashAttention-3 + XSA compute the attention output. A learned gate modulates each head's contribution before the output projection:

  • Q projection widened by gate_dim extra dimensions
  • Gate signal extracted from extra Q dims, passed through sigmoid
  • Applied elementwise per-head: attn_out *= gate.unsqueeze(-1)
  • ~50K extra parameters (~0.14% overhead), zero inference latency cost

Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

Consistent improvement across scales:

Scale Headwise (F2/A3) Control (F1/A1) Delta
2×H100 1.1636 BPB 1.1641 BPB -0.0005
8×H100 1.0801 BPB 1.0806 BPB -0.0005

The -0.0005 BPB improvement is preserved exactly when scaling from 2×H100 to 8×H100, confirming the technique's robustness.

I also tested elementwise gating (1 gate per dim per head, +2.36M params). It achieved slightly better BPB (1.2602 vs 1.2653 on SP1024) but exceeded the 16 MB budget (17.87 MB). Headwise is the Pareto-optimal choice: nearly free parameters, fits under budget, and provides consistent improvement.

2. 29-Paper Systematic Survey

Surveyed papers from NeurIPS 2024-2025, ICML 2025, ICLR 2025, and ACL 2025 covering attention modifications, normalization strategies, optimizer scheduling, data selection, structured layers, and compression techniques. Each paper was assessed for Parameter Golf feasibility (16 MB / 10-min / 36M-param constraints) and mapped to PG leaderboard presence.

Key finding: most techniques published for 125M+ parameter models do not transfer to the 36M regime. Of 10 papers tested experimentally, 8 produced negative or null results. Only 2 showed gains on the V2 @bigbag stack, and those gains did not survive the jump to 8×H100 on that base. However, the same techniques (EMA=0.990, small batch) transferred successfully on PR #1851's LQER base, producing P3 SOTA (1.0066 BPB). Transfer is stack-dependent.

3. EMA Decay Scaling Law at Short Training Durations

Discovered that optimal EMA decay shifts dramatically lower when training steps are limited. The @bigbag default (0.9965) is suboptimal for short runs (~1,000-4,500 steps).

2×H100 EMA sweep (C6 base, ~1,030 steps):

EMA Decay TTT BPB vs C6 (1.1622)
0.9965 (default) 1.1622
0.995 1.1562 -0.0060
0.993 1.1526 -0.0096
0.990 1.1505 -0.0117
0.997 1.1690 +0.0068
0.999 1.3475 +0.1853 (catastrophic)

Gains are monotonic from 0.995 to 0.990 — more aggressive averaging helps when the training window is short. But EMA sensitivity is extreme: 0.997 is already worse than default, and 0.999 is catastrophic.

Critical caveat — stack-dependent transfer:

Config 2×H100 BPB 8×H100 BPB Delta vs baseline (8×H100)
C6 (EMA=0.9965, @bigbag stack) 1.1622 1.0805
L1 (EMA=0.990, @bigbag stack) 1.1505 1.0830 +0.0025 (worse)
P3 (EMA=0.990, PR #1851 stack) 1.0066 NEW SOTA

On @bigbag stack with ~4,486 steps, aggressive EMA averages too few checkpoints. But on PR #1851's LQER base with small batch (12,382 steps), EMA=0.990 contributes to SOTA. The optimal decay depends on both step count and base stack dynamics.

4. 3×3 Factorial Technique Interaction Study

Systematic 9-run factorial experiment on the @bigbag stack to isolate technique interactions. Three factors, three levels each:

  • Residual strategy: Parallel Residuals (PR) only vs ResFormer (RF, α=0.5) only vs PR+RF
  • Gate type: No Gate vs Headwise vs Elementwise

Factor matrix (TTT BPB, 2×H100):

No Gate Headwise Elementwise
PR only 1.1641 1.1636 1.1665 (over budget)
RF only 1.1666 1.1661 1.1700 (over budget)
PR + RF 1.1636 1.1650 1.1686 (over budget)

Key interactions discovered:

  • Headwise gate helps PR (F2 vs F1: -0.0005) but hurts PR+RF (F8 vs F7: +0.0014). The two residual mechanisms compete for the same "residual quality" niche.
  • ResFormer helps when stacked with PR (F7 vs F1: -0.0005) but hurts alone (F4 vs F1: +0.0025). ResFormer is only beneficial as a complement to parallel residuals, not a replacement.
  • Elementwise busts budget in all configurations (+2.9M params → 17.2+ MB). Dead on arrival for 16 MB submissions.

This factorial design revealed that technique interactions are non-trivial — you cannot predict combined performance from individual ablations.

5. Small Batch Size for Short Wall-Clock Training

Tested reducing effective batch size to get more optimizer updates within the fixed 10-minute window (inspired by Liao et al., NeurIPS 2024).

2×H100 results (C6 base):

Config Batch Tokens Steps TTT BPB vs C6
C6 (default) 786,432 ~1,030 1.1622
B2 (small batch) 196,608 3,349 1.1419 -0.0203

4× smaller batch → 3.3× more steps → -0.020 BPB improvement. The largest single-technique gain I found on the V2 stack.

Stack-dependent transfer to 8×H100:

Config 2×H100 BPB 8×H100 BPB Delta vs baseline (8×H100)
C6 (default batch, @bigbag stack) 1.1622 1.0805
L2 (small batch + EMA, @bigbag stack) 1.1368 1.0926 +0.0121 (worse)
P3 (small batch + EMA, PR #1851 stack) 1.0066 NEW SOTA

On @bigbag stack, small batch hurts at 8×H100 scale. But on PR #1851's LQER base, small batch + EMA=0.990 combine to produce SOTA (12,382 steps, 1.0066 BPB). The transfer depends on the base stack — LQER's quantization-aware residuals create a regime where frequent updates compound rather than conflict.

6. Technique Transfer Across GPU Counts and Stacks

Perhaps the most important meta-finding: hyperparameter improvements on 2×H100 do not reliably transfer to 8×H100 — transfer depends on both GPU count and base stack.

Technique 2×H100 Delta 8×H100 Delta (@bigbag) 8×H100 Delta (PR #1851) Transferred?
Headwise Gated Attention -0.0005 -0.0005 part of P3 SOTA Yes (always)
EMA=0.990 -0.0117 +0.0025 part of P3 SOTA Stack-dependent
Small Batch (ga=1, 196K) -0.0203 +0.0121 part of P3 SOTA Stack-dependent
EMA + Small Batch -0.0254 +0.0121 1.0066 (SOTA) Stack-dependent

Architectural changes (headwise gate) transfer perfectly — the delta is preserved across scales and stacks. Training hyperparameters (EMA decay, batch size) are stack-dependent: they failed on the @bigbag stack at 8×H100 (L1, L2) but succeeded on PR #1851's LQER base (P3 = 1.0066 SOTA). The difference is that PR #1851's LQER creates a training dynamic where aggressive EMA and frequent updates compound — the same hyperparameters that conflict on one stack can synergize on another.

Updated implication: Hyperparameter transfer depends on both GPU count AND base stack. The @bigbag → PR #1851 transition changed the optimal EMA/batch regime. Architecture changes remain safe to prototype on fewer GPUs.

7. GPTQ Compression Analysis

Systematic comparison of our GPTQ implementation vs leaderboard leaders revealed a 5× quality gap (+0.05 BPB vs +0.01 for Kevin Clark (then-rank 5) / dexhunter (then-rank 7)). Root cause analysis identified 5 compounding factors:

  1. Depth recurrence — looped layers share weights, reducing unique matrices to quantize (fewer surfaces = less total error)
  2. Quantization-Aware Training (QAT) — leaderboard models train expecting quantization; ours gets shocked post-hoc
  3. Higher weight decay (0.085-0.090) — produces smaller weights that compress better under brotli
  4. EMA — averaging out training noise makes weights smoother and more compressible
  5. register_forward_hook Hessian collection — captures true activation statistics through the live network

The leaderboard doesn't just "use GPTQ" — they use GPTQ as the final step of a quantization-aware pipeline (WD tuning → EMA → QAT → GPTQ → brotli). I am only doing the last two steps.


Techniques That Failed (Expanded)

Tested on the V2 @bigbag stack (36M params, SP8192, 10-min wall clock). All produced negative results.

# Technique Paper Result Why It Failed
1 SLM / Rho-1 NeurIPS 2024 All ratios worse (+0.002 to +0.155 BPB) At 17M params, the model hasn't mastered basic tokens yet — skipping them removes gradient signal the model genuinely needs. Paper tested at 1B+. No reference model means I can't distinguish learnable (H→L) from unlearnable (H→H) tokens. Fixed wall clock means fewer effective tokens per step = worse model.
2 ResFormer (Value Residual) ACL 2025 +0.0022 BPB on 8×H100 (context-dependent) Works on our V1 stack (-0.0048 BPB, α=0.5 optimal) where it provides a gradient highway. Fails on V2 stack because parallel residuals already provide that highway — the two mechanisms are redundant.
3 LR Warmup NeurIPS 2024 +0.0024 to +0.0066 (monotonically worse with more warmup) MuonEq-R has its own momentum warmup; extra LR ramp wastes precious steps in a 10-min window.
4 Structured FFN NeurIPS 2024 +0.04 to +0.05 BPB Low-rank (r=0.5-0.75) + block-diagonal saves 30-56% MLP params but the approximation is too lossy at 36M. Paper tested at 125M+ where redundancy exists.
5 Peri-LN ICML 2025 Immediate NaN Output RMSNorm on attention + MLP conflicts with existing attn_scale/mlp_scale + ln_scale_factor in the @bigbag stack.
6 Differential Attention ICLR 2025 Oral +0.0138 BPB Two-softmax-subtract attention requires 2× FlashAttention-3 calls per layer, reducing throughput by 22% (3,292 steps vs 4,221). The throughput penalty outweighs attention quality at 36M scale.
7 HybridNorm NeurIPS 2025 +0.011 BPB V-norm + Post-Norm FFN hurt on rank 4 stack. The stack is already heavily normalized (Q/K-norm, ln_scale_factor, resid_mix, attn_scale/mlp_scale). Adding more normalization conflicts — the normalization axis is closed.
8 GPTQ Sequential / Embed GPTQ tuning (Frantar et al., ICLR 2023) +0.19 to +0.66 BPB (dramatically worse) Sequential block quantization: Hessians collected through dequantized blocks are inferior to full-precision Hessians. Embedding GPTQ: frequency-weighted column correlation is the wrong Hessian for lookup tables (embeddings aren't linear projections). Combined: errors compound catastrophically.

Meta-finding: 8 of 10 tested papers produced negative results at the 36M-parameter scale. The 36M / 16 MB / 10-min constraint regime fundamentally changes which optimizations matter. Techniques designed for 125M+ parameter models with large compute budgets are not "free gains" at small scale — they often interact negatively with the aggressive training stack (MuonEq-R, depth recurrence, parallel residuals, XSA) that already exists.


Experiment Scale

Metric Value
Total experiments 130+
2×H100 sessions 8 (Sessions 3-8, 11-17)
8×H100 sessions 4 (Sessions 14, 15, 18-19)
Env configs created 21
Run scripts created 6
Papers surveyed 29
Papers tested experimentally 10
Total compute spend ~$1,165 on RunPod

BPB Progression

Milestone BPB Config
V1 SP1024 baseline (Run 6) 1.2667 9L×512d, GQA, SP1024, int8+zlib
V1 SP8192 best (Run 11, 3-seed) 1.2073 9L×448d, headwise gate, SP8192, int8+zlib
V2 fork of @bigbag (F2) 1.1636 11L×512d, full stack + headwise gate
V2 best 2×H100 (N1) 1.1368 C6 + EMA=0.990 + Small Batch
C6 submission (3-seed mean) 1.0805 V2 + headwise gate + emb7+eclip15, 8×H100
P3 SOTA (3-seed mean) 1.0066 PR #1851 fork + headwise gate + EMA=0.990 + small batch + emb6, 8×H100

2×H100 → 8×H100 Scaling

Consistent improvement across all V2 configurations:

Config 2×H100 BPB 8×H100 BPB Improvement
F1 (control) 1.1641 1.0806 -0.0835
F2 (headwise) 1.1636 1.0801 -0.0835
C6 (headwise + emb7+eclip15) 1.1622 1.0818 -0.0804

Technique deltas are preserved across scales for architectural changes. ~0.083 BPB consistent scaling factor from 4× GPU count increase.


3-Seed Reproducibility

P3 (SOTA — PR #1851 fork + headwise gate + EMA=0.990 + small batch + emb6)

Seed Pre-Q BPB Quant BPB TTT BPB Artifact
42 1.0025 1.0205 1.0069 15,975,827 bytes
1337 1.0017 1.0190 1.0057 15,973,108 bytes
2025 1.0030 1.0206 1.0073 15,973,714 bytes
Mean 1.0024 1.0200 1.0066 15,974,216 bytes
Std 0.0007 0.0009 0.0009

C6 (Previous — @bigbag fork + headwise gate + emb7+eclip15)

Seed Sliding BPB TTT BPB Artifact
42 1.0834 1.0818 15,697,552 bytes
1337 1.0810 1.0794 15,694,065 bytes
2025 1.0820 1.0804 15,693,855 bytes
Mean 1.0821 1.0805 15,695,157 bytes
Std 0.0012 0.0012

All seeds: artifact under 16 MB, training under 600s, eval under 600s.

jamesEmerson112 and others added 2 commits April 30, 2026 21:49
Due to a technical issue where regular SP8192 and CaseOps SP8192
datasets conflict on disk, CaseOps data was symlinked into the
standard paths. Added exact download + symlink commands so judges
can reproduce the same setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sharpobject
Copy link
Copy Markdown

hello, the below content is AI-generated so you should probably disregard it.

PR #2071 Legality Feedback

I think the main legality concern is byte accounting, not non-normalized
next-token distributions.

PR #2071 appears to evaluate CaseOps-tokenized validation shards while
CASEOPS_ENABLED=0. The README says CaseOps data was symlinked into the
standard SP8192 paths, but the code only loads the CaseOps byte sidecar when
CASEOPS_ENABLED=1. With CASEOPS_ENABLED=0, the evaluator falls back to
tokenizer-derived byte counting via build_sentencepiece_luts(...).

That means the reported BPB is likely using bytes inferred from the transformed
CaseOps token stream, not the canonical raw FineWeb validation byte count.

Relevant log evidence from train_seed42.log:

caseops_enabled: False
datasets_dir: ./data/datasets/fineweb10B_sp8192
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
val_tokens: 47851520
quantized_ttt_phased val_loss:2.40073153 val_bpb:1.00692894

Those values imply a denominator of about 164,594,398 bytes:

bytes = val_loss / ln(2) * val_tokens / val_bpb
      = 2.40073153 / ln(2) * 47,851,520 / 1.00692894
      ~= 164,594,398

If the canonical raw eval byte count is 153,880,891, the same loss and token
count would give:

2.40073153 / ln(2) * 47,851,520 / 153,880,891 ~= 1.0770 BPB

So the claimed 1.0069 looks like it may be an artifact of an inflated byte
denominator. This would be a serious issue because the submission changes the
effective tokenizer/data representation and therefore needs to prove BPB is
computed against the correct raw-byte denominator.

I did not find an obvious probability-normalization issue. The eval path uses
F.cross_entropy(...), and the fused training CE computes a normal log-sum-exp
over the full vocab. I also did not find an obvious score-after-update leak in
the main phased TTT loop: it computes per_tok_loss, accumulates/scores it, and
then applies the LoRA update.

The byte denominator issue seems sufficient to make the submitted score
suspect unless they can show that the sidecar/canonical raw-byte accounting was
actually used or that 164,594,398 is the correct canonical denominator for this
validation set.

@jamesEmerson112
Copy link
Copy Markdown
Author

Thank you, @sharpobject. Let me look into this.

@jamesEmerson112
Copy link
Copy Markdown
Author

@sharpobject I'm still actively investigating this.
@someone114514 sharpobject found out the issue, so I'm validating the data and tokenizer, then reinforce the log format for you.

Thanks, guys

@jamesEmerson112
Copy link
Copy Markdown
Author

jamesEmerson112 commented May 1, 2026

@sharpobject You are correct Robert. the byte inflation is there. I'm running multiple times to confirm

@someone114514 Unfortunately, Robert is correct. the BPB value may not be as accurate as I thought. I'm running multiple times to confirm

P/s: I knew it was too good, and I had barely slept for 2 days before this submission

@jamesEmerson112
Copy link
Copy Markdown
Author

jamesEmerson112 commented May 1, 2026

Odd though... the number of steps is still absurdly high
p3_corrected_seed42.txt
p3_corrected_seed42_incomplete.txt
p3_corrected_seed1337.txt

Opus 4.7:

  The ~12,140 steps in P3-fix is entirely from our small batch override (TRAIN_BATCH_TOKENS=196608). That was James's Paper #15 contribution — not part of PR
   #1851's base config.                   
                                                                                                                                                             
  So the step counts line up:                                                                                                                                
  - Default batch (786K): ~4,500-5,000 steps → C6, P1a, PR #1851, SOTA                                                                                       
  - Small batch (196K): ~12,000 steps → P3 only (our addition, didn't help)   

@jamesEmerson112
Copy link
Copy Markdown
Author

quantized_ttt_phased val_loss:2.37890832 val_bpb:0.99777572 eval_time:443623ms
total_eval_time:443.6s

Low val_bpb comes back again after fixing the bug. Currently looking into this
p4_clip12.txt

@jamesEmerson112
Copy link
Copy Markdown
Author

I reset the pod to rerun after the SP8192+CaseOps bug, and still got that number returned again:
2500/20000 train_loss: 2.8979 train_time: 9.7m tok/s: 3377557
2500/20000 val_loss: 2.8573 val_bpb: 1.0096
2542/20000 val_loss: 2.8535 val_bpb: 1.0082
stopping_early: wallclock_cap train_time: 602920ms step: 2542/20000
peak memory allocated: 41587 MiB reserved: 46952 MiB

Though this is BEFORE evaluation (missing library stops the run prematurely)

p4_clip12_reset.txt

I must head to the Gemini workshop.
I'm gonna see if I can continue this remotely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants