Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed) by jamesEmerson112 · Pull Request #2071 · openai/parameter-golf

jamesEmerson112 · 2026-05-01T02:13:22Z

Record: SP8192 + PR#1851 Fork + Headwise Gate + EMA 0.990 + Small Batch + Emb6

val_bpb = 1.0066 (3-seed mean, std 0.0009) | ~15.97 MB | 8xH100 SXM

Record submission — beats previous SOTA (1.0611, PR #1855) by 0.0545 nats.

No PPM. No PreQuantTTT

3-Seed Results

Seed	Pre-Q BPB	Quant BPB	TTT BPB	Artifact
42	1.0025	1.0205	1.0069	15,975,827
1337	1.0017	1.0190	1.0057	15,973,108
2025	1.0030	1.0206	1.0073	15,973,714
Mean	1.0024	1.0200	1.0066	15,974,216
Std	0.0007	0.0009	0.0009

Author & Research Approach

An Thien Vo (James Emerson Vo) — Georgia Tech, CS 7643 Deep Learning.

This submission forks PR #1851 (@aquariouseworkman) and adds 4 novel contributions discovered through a systematic research effort: 29+ papers surveyed, 40+ experiments across 2×H100 and 8×H100, and careful ablation to identify which techniques transfer to the extreme compression regime of Parameter Golf.

Novel Contributions

Headwise Gated Attention — Post-attention sigmoid gate applied per-head after FA3+XSA. Q projection widened by gate_dim, gate modulates each head's contribution before output projection. ~50K extra parameters, zero inference cost, consistent -0.0005 BPP improvement across scales. Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).
EMA Decay = 0.990 — Discovered optimal EMA decay shifts dramatically lower when training steps are limited. Default 0.9965 → optimal 0.990 on 8×H100: more aggressive weight averaging captures better training signal when the training window is fixed at 10 minutes.
Small Batch (ga=1, 196K tokens) — Reducing effective batch size from 786K to 196K tokens yields 3.3× more optimizer steps in the same wall clock. On 8×H100, this enables 12,382 steps vs ~4,500 with default batch size, giving the optimizer more fine-grained updates.
6-bit Embedding Quantization — Reducing EMBED_BITS from 8 to 6 saves ~1 MB on the compressed artifact, enabling headwise gated attention's extra parameters to fit under the 16 MB budget. Costs ~0.013 BPB in quantization gap but enables the complete technique stack.

Key Techniques

Technique	Source	Impact
Headwise Gated Attention	James Vo (novel)	-0.0005 BPB
EMA Decay = 0.990	James Vo (novel finding)	Enables better weight averaging
Small Batch (196K tokens)	James Vo (novel finding)	3.3× more optimizer steps
6-bit Embedding Quant	James Vo (novel)	-1 MB artifact size

Base Stack

PR #1851 (@aquariouseworkman)

Extends @bigbag's PR #1493 with:

LQER (asymmetric int4 error correction) — lqer_enabled=True, lqer_asym_enabled=True
Phased TTT (multi-phase test-time training with LoRA)
Fused softcapped cross-entropy kernel (Triton) — fused_ce_enabled=True
Brotli compression
SmearGate (available but disabled in our runs: smear_gate_enabled=False)
Sparse Attention Gate (available but replaced by our headwise gate: sparse_attn_gate_enabled=False)
CaseOps tokenizer (active via symlinked data — env var was caseops_enabled=False but pod data paths pointed to CaseOps-tokenized shards)

Upstream: @bigbag PR #1493

SP8192 vocabulary (8192-token SentencePiece BPE, CaseOps variant via symlinked data)
11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)²
3-layer depth recurrence (layers 3-4-5 looped 2×, 17 virtual from 11 physical)
Parallel residuals (layers 7+), sigmoid skip gates
Partial RoPE (16/64 dims), layerwise LN scale
XSA on all 11 layers, QK-Gain 5.0, logit softcap 30.0
MuonEq-R optimizer, GPTQ int6+brotli, score-first TTT

Techniques That Failed

Tested on the V2 rank 1 stack. All produced negative results at the 36M-parameter scale.

#	Technique	Paper	Result	Why It Failed
1	SLM / Rho-1	NeurIPS 2024	All ratios worse	17M model needs every gradient signal; paper tested at 1B+
2	ResFormer	ACL 2025	+0.0022 BPB	Parallel residuals already provide the gradient highway
3	LR Warmup	NeurIPS 2024	+0.0024 to +0.0066	MuonEq-R has its own momentum warmup
4	Structured FFN	NeurIPS 2024	+0.04 to +0.05 BPB	Low-rank too lossy at 36M; tested at 125M+
5	Peri-LN	ICML 2025	Immediate NaN	Conflicts with existing normalization stack
6	Differential Attention	ICLR 2025 Oral	+0.0138 BPB	2× FA3 calls reduces throughput 22%
7	HybridNorm	NeurIPS 2025	+0.011 BPB	Normalization axis already saturated
8	GPTQ Sequential/Embed	Frantar et al.	+0.19 to +0.66	Sequential Hessians through dequantized blocks are inferior

Architecture

11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)², partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (enabled at frac=0.35). Parallel residuals from layer 8. Skip gates. Headwise gated attention: Q widened by gate_dim, sigmoid gate per-head after FA3+XSA. LQER asymmetric int4 error correction from PR #1851. Fused softcapped CE kernel (Triton).

Total parameters: ~35.99M.

Training

MuonEq-R optimizer for matrix params, AdamW for embeddings/scalars. GRAD_ACCUM_STEPS=1 (8 GPUs), TRAIN_BATCH_TOKENS=196,608 (small batch), ~12,382 steps in ~596s on 8×H100 SXM (PyTorch 2.11, CUDA 13.0). Linear warmdown over final 75%. EMA decay 0.990. Weight decay: Muon WD=0.095, Embed WD=0.085, Adam WD=0.02.

Quantization

Full-Hessian GPTQ with SDClip: clip = k * std(row).

int6 for attention/MLP matrices (MATRIX_CLIP_SIGMAS=12.85, ATTN_CLIP_SIGMAS=13.0)
int6 for token embeddings (EMBED_BITS=6, EMBED_CLIP_SIGMAS=20.0)
Byte-shuffle + Brotli-11 compression
16 calibration batches from training data

Evaluation

Phased TTT (from PR #1851) — multi-phase test-time training with LoRA adaptation:

LoRA rank 96, applied to K/MLP/O projections
Adam optimizer, lr=0.0001, weight decay=1.0
Eval seq len 2048, chunk-based scoring
Score-first: tokens scored under torch.no_grad() BEFORE gradient updates
Total eval time: ~353-389s (within 600s budget)

Compliance

Per Issue #1017 (Track B):

C1 (Causality): Causal eval only — each position scored from prefix tokens.
C2 (Normalized): Standard softmax over full vocab.
C3 (Score before update): Each chunk fully scored before any LoRA update.
C4 (Single pass): Each token scored exactly once.
No SLOT, No PreQuantTTT, No ETLB, No n-gram cache.
All artifacts under 16,000,000 bytes on all 3 seeds.
Training under 600s on all 3 seeds.
Eval under 600s on all 3 seeds.

Reproduction

git clone https://github.com/jamesEmerson112/DL-Team-Proposal.git
cd DL-Team-Proposal && git checkout James-experiment
cd parameter-golf

pip install --upgrade torch
pip install brotli sentencepiece numpy python-minifier
pip install huggingface_hub
pip install --no-cache-dir \
  "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl"

# Step 1: Download regular SP8192 data
python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards 80

# Step 2: Download CaseOps-tokenized data
# Due to a technical issue (regular sp8192 and CaseOps sp8192 datasets conflict
# on disk), CaseOps data was symlinked into the standard paths on our pod.
# The env var caseops_enabled=False, but training used CaseOps-tokenized shards.
MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
  MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \
  python3 data/cached_challenge_fineweb.py \
    --variant sp8192_lossless_caps_caseops_v1_reserved \
    --train-shards 80

# Step 3: Symlink CaseOps data into standard paths
mv data/datasets/fineweb10B_sp8192 data/datasets/fineweb10B_sp8192_regular 2>/dev/null || true
mv data/tokenizers/fineweb_8192_bpe.model data/tokenizers/fineweb_8192_bpe.model.regular 2>/dev/null || true
ln -s fineweb10B_sp8192_lossless_caps_caseops_v1_reserved data/datasets/fineweb10B_sp8192
ln -s fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model data/tokenizers/fineweb_8192_bpe.model

# Step 4: Train (CASEOPS_ENABLED=0 — byte sidecar not used, but data is CaseOps-tokenized)
SEED=42 GATED_ATTN_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=0 \
  EMA_DECAY=0.990 GRAD_ACCUM_STEPS=1 TRAIN_BATCH_TOKENS=196608 EMBED_BITS=6 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Repeat with SEED=1337 and SEED=2025 for 3-seed verification.

Credits

@aquariouseworkman — PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 base: LQER, phased TTT, fused CE kernel, brotli compression (PR #1851)
@bigbag — Upstream stack: depth recurrence, parallel residuals, skip gates, QK-Gain 5.0, LeakyReLU², XSA, MuonEq-R, EMA, legal TTT (PR #1493)
@clarkkev (Kevin Clark) — SP8192 vocabulary, GPTQ with SDClip, MuonEq-R (PR #1394)
@dexhunter — Depth recurrence (PR #1331, #1437), legal score-first TTT (PR #1413)
@abaybektursun — Score-first TTT framework (PR #549)
@Robby955 — Parallel residuals (PR #1412)
@msisovic — Parallel residuals concept (PR #1204)
An Thien Vo (James Emerson Vo) — Headwise gated attention, EMA=0.990, small batch, 6-bit embedding quant, 29-paper survey, 40+ experiment ablation

Acknowledgements

OpenAI — for hosting the Parameter Golf challenge and the development grant
RunPod — for compute credits supporting our 2×H100 and 8×H100 experiments
Georgia Tech PACE — for supplementary compute resources
CS 7643 Deep Learning at Georgia Tech, taught by Dr. Zsolt Kira
@sranganath2 (Sid Ranganathan) — contributed to discussion about tokenizer investigation, nanochat research, fused CE kernel insights, and research papers
@Ashray14 — contributed to discussion about research papers
@ialeksic3 — contributed to discussion about research papers

Total personal compute cost: ~$1,165 ($640 out-of-pocket + $525 OpenAI development grant) across 130+ experiments on RunPod.

In memory of Moomoo, my cat.

Included Files

README.md (this file)
submission.json
train_gpt.py
train_seed42.log
train_seed1337.log
train_seed2025.log

@aquariouseworkman

…ch (1.0066 BPB, 3-seed) 3-seed mean val_bpb = 1.0066 (std 0.0009) on 8xH100 SXM. Beats SOTA (1.0611, PR openai#1855) by 0.0545 nats. Novel contributions: - Headwise Gated Attention (post-FA3+XSA sigmoid gate per-head) - EMA Decay = 0.990 (vs default 0.9965) - Small Batch (ga=1, 196K tokens, 3.3x more optimizer steps) - 6-bit Embedding Quantization (EMBED_BITS=6) Base: PR openai#1851 (@aquariouseworkman) + @bigbag PR openai#1493 upstream. All seeds under 16 MB, training < 600s, eval < 600s. Score-first TTT, no PreQuantTTT, fully compliant.

someone114514 · 2026-05-01T02:37:59Z

I’m having trouble verifying the 1.0066 claim from the submitted logs.

The raw train_seed42.log I can see stops after:

diagnostic pre-quantization post-ema val_bpb:1.00247081
GPTQ:collecting Hessians from calibration data...

I can’t find the final post-GPTQ / artifact reload / TTT eval lines, e.g.:

GPTQ:collected ...
Serialized model quantized+brotli ...
Total submission size quantized+brotli ...
diagnostic quantized ...
quantized_ttt_phased val_bpb=... eval_time=...

The submission.json reports per-seed TTT BPB and artifact bytes, but I don’t see those values supported by the logs currently attached.

Could you upload the complete per-seed logs showing the final quantized artifact evaluated with TTT on the full validation split?

andrewbaggio1 · 2026-05-01T02:45:04Z

what's your data/datasets/fineweb10B_sp8192/fineweb_val_000000.bin MD5?

jamesEmerson112 · 2026-05-01T03:12:11Z

@andrewbaggio1 Thank you for pointing it out. I knew something was missing. That's actually the SP8192 + CaseOps mentioned by #1868 and #1855 (and perhaps more)

I'm updating the report now

jamesEmerson112 · 2026-05-01T03:13:12Z

@someone114514 Thanks for asking. My logs actually follow the format from #1868. Please be kind enough to refer that PR for further details

CaseOps-tokenized data was used via pod symlinks even though CASEOPS_ENABLED=0 was set. Updated README and submission.json to accurately reflect this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewbaggio1 · 2026-05-01T03:18:27Z

also i'd edit your teammate's names out of the submission, the rules explicitly say "result submission is limited to individuals"

jamesEmerson112 · 2026-05-01T04:19:25Z

@andrewbaggio1 Much appreciated, Andrew. We mainly discussed regarding papers, so I had it updated accordingly

jamesEmerson112 · 2026-05-01T04:22:12Z

PR Supplement — Expanded Research Contributions

P3 SOTA: val_bpb = 1.0066 (3-seed mean, std 0.0009) | ~15.97 MB | 8xH100 SXM
Previous C6: val_bpb = 1.0805 (3-seed mean, std 0.0012) | ~15.70 MB

Attempted techniques and benchmarks

Run ID Reference

Internal run IDs used throughout this document (starting from my previous "SOTA", which is C6; please refer to the README and findings.md for a detailed document). Each describes a specific configuration.

Run ID	Full Name	Description
C6	Headwise + emb7+eclip15	Previous submission config (superseded by P3): @bigbag's full stack + headwise gated attention + int7 embedding quantization (clip σ=15.0). Submitted with 3-seed verification.
F1	Control (no additions)	@bigbag's stack (PR #1493) unmodified. Baseline for ablation.
F2	PR + Headwise Gate	@bigbag's stack + headwise gated attention (default compression).
F7	PR + ResFormer (α=0.5)	@bigbag's stack + ResFormer value residual learning. No gate.
A1	Control (8×H100)	F1 run at competition scale (8×H100).
A3	Headwise (8×H100)	F2 run at competition scale (8×H100, default compression).
A2	PR + ResFormer (8×H100)	F7 run at competition scale (8×H100).
E1	EMA=0.995	C6 base + more aggressive EMA averaging (decay 0.995 vs default 0.9965).
R3	EMA=0.990	C6 base + most aggressive EMA averaging tested. Best on 2×H100.
L1	EMA=0.990 (8×H100)	R3 config at competition scale. Did NOT transfer.
L2	Small Batch + EMA (8×H100)	Small batch (196K tokens) + EMA=0.990 at competition scale.
B2	Small Batch (2×H100)	C6 base + grad_accum=1, 196K batch tokens (4× smaller, 3.3× more steps).
N1	EMA + Small Batch (2×H100)	C6 + EMA=0.990 + small batch. Best-ever legal 2×H100 result.
N2	N1 + Differential Attention	N1 config + two-softmax-subtract attention (Paper #19).
P1a	SOTA hparams (8×H100)	C6 + 6 hyperparameter overrides from PR #1855 (warmdown, min_lr, clip, beta2).
Q0-Q7	GPTQ tuning runs	Sequential blocks (Q1), embed GPTQ (Q3), all combined (Q7). All worse.
F1-F9	3×3 factorial	9-run sweep: {PR, RF, PR+RF} × {No Gate, Headwise, Elementwise}.
P3-s42	PR#1851 + gate + EMA990 + smallbatch + emb6 (seed 42)	SOTA run, 1.0069 TTT BPB, 15,975,827 bytes
P3-s1337	PR#1851 + gate + EMA990 + smallbatch + emb6 (seed 1337)	SOTA run, 1.0057 TTT BPB, 15,973,108 bytes
P3-s2025	PR#1851 + gate + EMA990 + smallbatch + emb6 (seed 2025)	SOTA run, 1.0073 TTT BPB, 15,973,714 bytes

Base stack ("@bigbag's stack" / "PR #1493 stack"): SP8192 vocabulary, 11L×512d×8H/4KV, 4×MLP with LeakyReLU(0.5)², 3-layer depth recurrence (layers 3-4-5 looped 2×), parallel residuals (layers 7+), sigmoid skip gates, partial RoPE (16/64 dims), XSA on all layers, QK-Gain 5.25, MuonEq-R optimizer, EMA (0.9965), GPTQ int6+brotli, score-first TTT. From @bigbag PR #1493.

Research Contributions

1. Headwise Gated Attention (Novel Architecture)

Post-attention sigmoid gate applied per-head, after FlashAttention-3 + XSA compute the attention output. A learned gate modulates each head's contribution before the output projection:

Q projection widened by gate_dim extra dimensions
Gate signal extracted from extra Q dims, passed through sigmoid
Applied elementwise per-head: attn_out *= gate.unsqueeze(-1)
~50K extra parameters (~0.14% overhead), zero inference latency cost

Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).

Consistent improvement across scales:

Scale	Headwise (F2/A3)	Control (F1/A1)	Delta
2×H100	1.1636 BPB	1.1641 BPB	-0.0005
8×H100	1.0801 BPB	1.0806 BPB	-0.0005

The -0.0005 BPB improvement is preserved exactly when scaling from 2×H100 to 8×H100, confirming the technique's robustness.

I also tested elementwise gating (1 gate per dim per head, +2.36M params). It achieved slightly better BPB (1.2602 vs 1.2653 on SP1024) but exceeded the 16 MB budget (17.87 MB). Headwise is the Pareto-optimal choice: nearly free parameters, fits under budget, and provides consistent improvement.

2. 29-Paper Systematic Survey

Surveyed papers from NeurIPS 2024-2025, ICML 2025, ICLR 2025, and ACL 2025 covering attention modifications, normalization strategies, optimizer scheduling, data selection, structured layers, and compression techniques. Each paper was assessed for Parameter Golf feasibility (16 MB / 10-min / 36M-param constraints) and mapped to PG leaderboard presence.

Key finding: most techniques published for 125M+ parameter models do not transfer to the 36M regime. Of 10 papers tested experimentally, 8 produced negative or null results. Only 2 showed gains on the V2 @bigbag stack, and those gains did not survive the jump to 8×H100 on that base. However, the same techniques (EMA=0.990, small batch) transferred successfully on PR #1851's LQER base, producing P3 SOTA (1.0066 BPB). Transfer is stack-dependent.

3. EMA Decay Scaling Law at Short Training Durations

Discovered that optimal EMA decay shifts dramatically lower when training steps are limited. The @bigbag default (0.9965) is suboptimal for short runs (~1,000-4,500 steps).

2×H100 EMA sweep (C6 base, ~1,030 steps):

EMA Decay	TTT BPB	vs C6 (1.1622)
0.9965 (default)	1.1622	—
0.995	1.1562	-0.0060
0.993	1.1526	-0.0096
0.990	1.1505	-0.0117
0.997	1.1690	+0.0068
0.999	1.3475	+0.1853 (catastrophic)

Gains are monotonic from 0.995 to 0.990 — more aggressive averaging helps when the training window is short. But EMA sensitivity is extreme: 0.997 is already worse than default, and 0.999 is catastrophic.

Critical caveat — stack-dependent transfer:

Config	2×H100 BPB	8×H100 BPB	Delta vs baseline (8×H100)
C6 (EMA=0.9965, @bigbag stack)	1.1622	1.0805	—
L1 (EMA=0.990, @bigbag stack)	1.1505	1.0830	+0.0025 (worse)
P3 (EMA=0.990, PR #1851 stack)	—	1.0066	NEW SOTA

On @bigbag stack with ~4,486 steps, aggressive EMA averages too few checkpoints. But on PR #1851's LQER base with small batch (12,382 steps), EMA=0.990 contributes to SOTA. The optimal decay depends on both step count and base stack dynamics.

4. 3×3 Factorial Technique Interaction Study

Systematic 9-run factorial experiment on the @bigbag stack to isolate technique interactions. Three factors, three levels each:

Residual strategy: Parallel Residuals (PR) only vs ResFormer (RF, α=0.5) only vs PR+RF
Gate type: No Gate vs Headwise vs Elementwise

Factor matrix (TTT BPB, 2×H100):

	No Gate	Headwise	Elementwise
PR only	1.1641	1.1636	1.1665 (over budget)
RF only	1.1666	1.1661	1.1700 (over budget)
PR + RF	1.1636	1.1650	1.1686 (over budget)

Key interactions discovered:

Headwise gate helps PR (F2 vs F1: -0.0005) but hurts PR+RF (F8 vs F7: +0.0014). The two residual mechanisms compete for the same "residual quality" niche.
ResFormer helps when stacked with PR (F7 vs F1: -0.0005) but hurts alone (F4 vs F1: +0.0025). ResFormer is only beneficial as a complement to parallel residuals, not a replacement.
Elementwise busts budget in all configurations (+2.9M params → 17.2+ MB). Dead on arrival for 16 MB submissions.

This factorial design revealed that technique interactions are non-trivial — you cannot predict combined performance from individual ablations.

5. Small Batch Size for Short Wall-Clock Training

Tested reducing effective batch size to get more optimizer updates within the fixed 10-minute window (inspired by Liao et al., NeurIPS 2024).

2×H100 results (C6 base):

Config	Batch Tokens	Steps	TTT BPB	vs C6
C6 (default)	786,432	~1,030	1.1622	—
B2 (small batch)	196,608	3,349	1.1419	-0.0203

4× smaller batch → 3.3× more steps → -0.020 BPB improvement. The largest single-technique gain I found on the V2 stack.

Stack-dependent transfer to 8×H100:

Config	2×H100 BPB	8×H100 BPB	Delta vs baseline (8×H100)
C6 (default batch, @bigbag stack)	1.1622	1.0805	—
L2 (small batch + EMA, @bigbag stack)	1.1368	1.0926	+0.0121 (worse)
P3 (small batch + EMA, PR #1851 stack)	—	1.0066	NEW SOTA

On @bigbag stack, small batch hurts at 8×H100 scale. But on PR #1851's LQER base, small batch + EMA=0.990 combine to produce SOTA (12,382 steps, 1.0066 BPB). The transfer depends on the base stack — LQER's quantization-aware residuals create a regime where frequent updates compound rather than conflict.

6. Technique Transfer Across GPU Counts and Stacks

Perhaps the most important meta-finding: hyperparameter improvements on 2×H100 do not reliably transfer to 8×H100 — transfer depends on both GPU count and base stack.

Technique	2×H100 Delta	8×H100 Delta (@bigbag)	8×H100 Delta (PR #1851)	Transferred?
Headwise Gated Attention	-0.0005	-0.0005	part of P3 SOTA	Yes (always)
EMA=0.990	-0.0117	+0.0025	part of P3 SOTA	Stack-dependent
Small Batch (ga=1, 196K)	-0.0203	+0.0121	part of P3 SOTA	Stack-dependent
EMA + Small Batch	-0.0254	+0.0121	1.0066 (SOTA)	Stack-dependent

Architectural changes (headwise gate) transfer perfectly — the delta is preserved across scales and stacks. Training hyperparameters (EMA decay, batch size) are stack-dependent: they failed on the @bigbag stack at 8×H100 (L1, L2) but succeeded on PR #1851's LQER base (P3 = 1.0066 SOTA). The difference is that PR #1851's LQER creates a training dynamic where aggressive EMA and frequent updates compound — the same hyperparameters that conflict on one stack can synergize on another.

Updated implication: Hyperparameter transfer depends on both GPU count AND base stack. The @bigbag → PR #1851 transition changed the optimal EMA/batch regime. Architecture changes remain safe to prototype on fewer GPUs.

7. GPTQ Compression Analysis

Systematic comparison of our GPTQ implementation vs leaderboard leaders revealed a 5× quality gap (+0.05 BPB vs +0.01 for Kevin Clark (then-rank 5) / dexhunter (then-rank 7)). Root cause analysis identified 5 compounding factors:

Depth recurrence — looped layers share weights, reducing unique matrices to quantize (fewer surfaces = less total error)
Quantization-Aware Training (QAT) — leaderboard models train expecting quantization; ours gets shocked post-hoc
Higher weight decay (0.085-0.090) — produces smaller weights that compress better under brotli
EMA — averaging out training noise makes weights smoother and more compressible
register_forward_hook Hessian collection — captures true activation statistics through the live network

The leaderboard doesn't just "use GPTQ" — they use GPTQ as the final step of a quantization-aware pipeline (WD tuning → EMA → QAT → GPTQ → brotli). I am only doing the last two steps.

Techniques That Failed (Expanded)

Tested on the V2 @bigbag stack (36M params, SP8192, 10-min wall clock). All produced negative results.

#	Technique	Paper	Result	Why It Failed
1	SLM / Rho-1	NeurIPS 2024	All ratios worse (+0.002 to +0.155 BPB)	At 17M params, the model hasn't mastered basic tokens yet — skipping them removes gradient signal the model genuinely needs. Paper tested at 1B+. No reference model means I can't distinguish learnable (H→L) from unlearnable (H→H) tokens. Fixed wall clock means fewer effective tokens per step = worse model.
2	ResFormer (Value Residual)	ACL 2025	+0.0022 BPB on 8×H100 (context-dependent)	Works on our V1 stack (-0.0048 BPB, α=0.5 optimal) where it provides a gradient highway. Fails on V2 stack because parallel residuals already provide that highway — the two mechanisms are redundant.
3	LR Warmup	NeurIPS 2024	+0.0024 to +0.0066 (monotonically worse with more warmup)	MuonEq-R has its own momentum warmup; extra LR ramp wastes precious steps in a 10-min window.
4	Structured FFN	NeurIPS 2024	+0.04 to +0.05 BPB	Low-rank (r=0.5-0.75) + block-diagonal saves 30-56% MLP params but the approximation is too lossy at 36M. Paper tested at 125M+ where redundancy exists.
5	Peri-LN	ICML 2025	Immediate NaN	Output RMSNorm on attention + MLP conflicts with existing attn_scale/mlp_scale + ln_scale_factor in the @bigbag stack.
6	Differential Attention	ICLR 2025 Oral	+0.0138 BPB	Two-softmax-subtract attention requires 2× FlashAttention-3 calls per layer, reducing throughput by 22% (3,292 steps vs 4,221). The throughput penalty outweighs attention quality at 36M scale.
7	HybridNorm	NeurIPS 2025	+0.011 BPB	V-norm + Post-Norm FFN hurt on rank 4 stack. The stack is already heavily normalized (Q/K-norm, ln_scale_factor, resid_mix, attn_scale/mlp_scale). Adding more normalization conflicts — the normalization axis is closed.
8	GPTQ Sequential / Embed	GPTQ tuning (Frantar et al., ICLR 2023)	+0.19 to +0.66 BPB (dramatically worse)	Sequential block quantization: Hessians collected through dequantized blocks are inferior to full-precision Hessians. Embedding GPTQ: frequency-weighted column correlation is the wrong Hessian for lookup tables (embeddings aren't linear projections). Combined: errors compound catastrophically.

Meta-finding: 8 of 10 tested papers produced negative results at the 36M-parameter scale. The 36M / 16 MB / 10-min constraint regime fundamentally changes which optimizations matter. Techniques designed for 125M+ parameter models with large compute budgets are not "free gains" at small scale — they often interact negatively with the aggressive training stack (MuonEq-R, depth recurrence, parallel residuals, XSA) that already exists.

Experiment Scale

Metric	Value
Total experiments	130+
2×H100 sessions	8 (Sessions 3-8, 11-17)
8×H100 sessions	4 (Sessions 14, 15, 18-19)
Env configs created	21
Run scripts created	6
Papers surveyed	29
Papers tested experimentally	10
Total compute spend	~$1,165 on RunPod

BPB Progression

Milestone	BPB	Config
V1 SP1024 baseline (Run 6)	1.2667	9L×512d, GQA, SP1024, int8+zlib
V1 SP8192 best (Run 11, 3-seed)	1.2073	9L×448d, headwise gate, SP8192, int8+zlib
V2 fork of @bigbag (F2)	1.1636	11L×512d, full stack + headwise gate
V2 best 2×H100 (N1)	1.1368	C6 + EMA=0.990 + Small Batch
C6 submission (3-seed mean)	1.0805	V2 + headwise gate + emb7+eclip15, 8×H100
P3 SOTA (3-seed mean)	1.0066	PR #1851 fork + headwise gate + EMA=0.990 + small batch + emb6, 8×H100

2×H100 → 8×H100 Scaling

Consistent improvement across all V2 configurations:

Config	2×H100 BPB	8×H100 BPB	Improvement
F1 (control)	1.1641	1.0806	-0.0835
F2 (headwise)	1.1636	1.0801	-0.0835
C6 (headwise + emb7+eclip15)	1.1622	1.0818	-0.0804

Technique deltas are preserved across scales for architectural changes. ~0.083 BPB consistent scaling factor from 4× GPU count increase.

3-Seed Reproducibility

P3 (SOTA — PR #1851 fork + headwise gate + EMA=0.990 + small batch + emb6)

Seed	Pre-Q BPB	Quant BPB	TTT BPB	Artifact
42	1.0025	1.0205	1.0069	15,975,827 bytes
1337	1.0017	1.0190	1.0057	15,973,108 bytes
2025	1.0030	1.0206	1.0073	15,973,714 bytes
Mean	1.0024	1.0200	1.0066	15,974,216 bytes
Std	0.0007	0.0009	0.0009

C6 (Previous — @bigbag fork + headwise gate + emb7+eclip15)

Seed	Sliding BPB	TTT BPB	Artifact
42	1.0834	1.0818	15,697,552 bytes
1337	1.0810	1.0794	15,694,065 bytes
2025	1.0820	1.0804	15,693,855 bytes
Mean	1.0821	1.0805	15,695,157 bytes
Std	0.0012	0.0012

All seeds: artifact under 16 MB, training under 600s, eval under 600s.

Due to a technical issue where regular SP8192 and CaseOps SP8192 datasets conflict on disk, CaseOps data was symlinked into the standard paths. Added exact download + symlink commands so judges can reproduce the same setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sharpobject · 2026-05-01T05:24:04Z

hello, the below content is AI-generated so you should probably disregard it.

PR #2071 Legality Feedback

I think the main legality concern is byte accounting, not non-normalized
next-token distributions.

PR #2071 appears to evaluate CaseOps-tokenized validation shards while
CASEOPS_ENABLED=0. The README says CaseOps data was symlinked into the
standard SP8192 paths, but the code only loads the CaseOps byte sidecar when
CASEOPS_ENABLED=1. With CASEOPS_ENABLED=0, the evaluator falls back to
tokenizer-derived byte counting via build_sentencepiece_luts(...).

That means the reported BPB is likely using bytes inferred from the transformed
CaseOps token stream, not the canonical raw FineWeb validation byte count.

Relevant log evidence from train_seed42.log:

caseops_enabled: False
datasets_dir: ./data/datasets/fineweb10B_sp8192
tokenizer_path: ./data/tokenizers/fineweb_8192_bpe.model
val_tokens: 47851520
quantized_ttt_phased val_loss:2.40073153 val_bpb:1.00692894

Those values imply a denominator of about 164,594,398 bytes:

bytes = val_loss / ln(2) * val_tokens / val_bpb
      = 2.40073153 / ln(2) * 47,851,520 / 1.00692894
      ~= 164,594,398

If the canonical raw eval byte count is 153,880,891, the same loss and token
count would give:

2.40073153 / ln(2) * 47,851,520 / 153,880,891 ~= 1.0770 BPB

So the claimed 1.0069 looks like it may be an artifact of an inflated byte
denominator. This would be a serious issue because the submission changes the
effective tokenizer/data representation and therefore needs to prove BPB is
computed against the correct raw-byte denominator.

I did not find an obvious probability-normalization issue. The eval path uses
F.cross_entropy(...), and the fused training CE computes a normal log-sum-exp
over the full vocab. I also did not find an obvious score-after-update leak in
the main phased TTT loop: it computes per_tok_loss, accumulates/scores it, and
then applies the LoRA update.

The byte denominator issue seems sufficient to make the submitted score
suspect unless they can show that the sidecar/canonical raw-byte accounting was
actually used or that 164,594,398 is the correct canonical denominator for this
validation set.

jamesEmerson112 · 2026-05-01T15:29:49Z

Thank you, @sharpobject. Let me look into this.

jamesEmerson112 · 2026-05-01T17:33:23Z

@sharpobject I'm still actively investigating this.
@someone114514 sharpobject found out the issue, so I'm validating the data and tokenizer, then reinforce the log format for you.

Thanks, guys

jamesEmerson112 · 2026-05-01T17:45:36Z

@sharpobject You are correct Robert. the byte inflation is there. I'm running multiple times to confirm

@someone114514 Unfortunately, Robert is correct. the BPB value may not be as accurate as I thought. I'm running multiple times to confirm

P/s: I knew it was too good, and I had barely slept for 2 days before this submission

jamesEmerson112 · 2026-05-01T19:54:48Z

Odd though... the number of steps is still absurdly high
p3_corrected_seed42.txt
p3_corrected_seed42_incomplete.txt
p3_corrected_seed1337.txt

Opus 4.7:

  The ~12,140 steps in P3-fix is entirely from our small batch override (TRAIN_BATCH_TOKENS=196608). That was James's Paper #15 contribution — not part of PR
   #1851's base config.                   
                                                                                                                                                             
  So the step counts line up:                                                                                                                                
  - Default batch (786K): ~4,500-5,000 steps → C6, P1a, PR #1851, SOTA                                                                                       
  - Small batch (196K): ~12,000 steps → P3 only (our addition, didn't help)

jamesEmerson112 · 2026-05-01T21:09:28Z

quantized_ttt_phased val_loss:2.37890832 val_bpb:0.99777572 eval_time:443623ms
total_eval_time:443.6s

Low val_bpb comes back again after fixing the bug. Currently looking into this
p4_clip12.txt

jamesEmerson112 · 2026-05-01T22:03:07Z

I reset the pod to rerun after the SP8192+CaseOps bug, and still got that number returned again:
2500/20000 train_loss: 2.8979 train_time: 9.7m tok/s: 3377557
2500/20000 val_loss: 2.8573 val_bpb: 1.0096
2542/20000 val_loss: 2.8535 val_bpb: 1.0082
stopping_early: wallclock_cap train_time: 602920ms step: 2542/20000
peak memory allocated: 41587 MiB reserved: 46952 MiB

Though this is BEFORE evaluation (missing library stops the run prematurely)

p4_clip12_reset.txt

I must head to the Gemini workshop.
I'm gonna see if I can continue this remotely

jamesEmerson112 added 3 commits April 30, 2026 19:12

Update compute cost to $1,165 across 130+ experiments

36bb286

Clarify personal compute cost

0d67a66

Fix CaseOps status: data was active via symlinked paths

bf986c3

CaseOps-tokenized data was used via pod symlinks even though CASEOPS_ENABLED=0 was set. Updated README and submission.json to accurately reflect this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jamesEmerson112 and others added 2 commits April 30, 2026 21:49

Update team member acknowledgements

c6ca8af

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

Conversation

jamesEmerson112 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: SP8192 + PR#1851 Fork + Headwise Gate + EMA 0.990 + Small Batch + Emb6

3-Seed Results

Author & Research Approach

Novel Contributions

Key Techniques

Base Stack

PR #1851 (@aquariouseworkman)

Upstream: @bigbag PR #1493

Techniques That Failed

Architecture

Training

Quantization

Evaluation

Compliance

Reproduction

Credits

Acknowledgements

Included Files

Uh oh!

someone114514 commented May 1, 2026

Uh oh!

andrewbaggio1 commented May 1, 2026

Uh oh!

jamesEmerson112 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesEmerson112 commented May 1, 2026

Uh oh!

andrewbaggio1 commented May 1, 2026

Uh oh!

jamesEmerson112 commented May 1, 2026

Uh oh!

jamesEmerson112 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Supplement — Expanded Research Contributions

Attempted techniques and benchmarks

Run ID Reference

Research Contributions

1. Headwise Gated Attention (Novel Architecture)

2. 29-Paper Systematic Survey

3. EMA Decay Scaling Law at Short Training Durations

4. 3×3 Factorial Technique Interaction Study

5. Small Batch Size for Short Wall-Clock Training

6. Technique Transfer Across GPU Counts and Stacks

7. GPTQ Compression Analysis

Techniques That Failed (Expanded)

Experiment Scale

BPB Progression

2×H100 → 8×H100 Scaling

3-Seed Reproducibility

P3 (SOTA — PR #1851 fork + headwise gate + EMA=0.990 + small batch + emb6)

C6 (Previous — @bigbag fork + headwise gate + emb7+eclip15)

Uh oh!

sharpobject commented May 1, 2026

PR #2071 Legality Feedback

Uh oh!

jamesEmerson112 commented May 1, 2026

Uh oh!

jamesEmerson112 commented May 1, 2026

Uh oh!

jamesEmerson112 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesEmerson112 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesEmerson112 commented May 1, 2026

Uh oh!

jamesEmerson112 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jamesEmerson112 commented May 1, 2026 •

edited

Loading

jamesEmerson112 commented May 1, 2026 •

edited

Loading

jamesEmerson112 commented May 1, 2026 •

edited

Loading

jamesEmerson112 commented May 1, 2026 •

edited

Loading

jamesEmerson112 commented May 1, 2026 •

edited

Loading