Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed)#2071
Conversation
…ch (1.0066 BPB, 3-seed) 3-seed mean val_bpb = 1.0066 (std 0.0009) on 8xH100 SXM. Beats SOTA (1.0611, PR openai#1855) by 0.0545 nats. Novel contributions: - Headwise Gated Attention (post-FA3+XSA sigmoid gate per-head) - EMA Decay = 0.990 (vs default 0.9965) - Small Batch (ga=1, 196K tokens, 3.3x more optimizer steps) - 6-bit Embedding Quantization (EMBED_BITS=6) Base: PR openai#1851 (@aquariouseworkman) + @bigbag PR openai#1493 upstream. All seeds under 16 MB, training < 600s, eval < 600s. Score-first TTT, no PreQuantTTT, fully compliant.
|
I’m having trouble verifying the 1.0066 claim from the submitted logs. The raw train_seed42.log I can see stops after: diagnostic pre-quantization post-ema val_bpb:1.00247081 I can’t find the final post-GPTQ / artifact reload / TTT eval lines, e.g.:
The submission.json reports per-seed TTT BPB and artifact bytes, but I don’t see those values supported by the logs currently attached. Could you upload the complete per-seed logs showing the final quantized artifact evaluated with TTT on the full validation split? |
|
what's your data/datasets/fineweb10B_sp8192/fineweb_val_000000.bin MD5? |
|
@andrewbaggio1 Thank you for pointing it out. I knew something was missing. That's actually the SP8192 + CaseOps mentioned by #1868 and #1855 (and perhaps more) I'm updating the report now |
|
@someone114514 Thanks for asking. My logs actually follow the format from #1868. Please be kind enough to refer that PR for further details |
CaseOps-tokenized data was used via pod symlinks even though CASEOPS_ENABLED=0 was set. Updated README and submission.json to accurately reflect this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
also i'd edit your teammate's names out of the submission, the rules explicitly say "result submission is limited to individuals" |
|
@andrewbaggio1 Much appreciated, Andrew. We mainly discussed regarding papers, so I had it updated accordingly |
PR Supplement — Expanded Research ContributionsP3 SOTA: val_bpb = 1.0066 (3-seed mean, std 0.0009) | ~15.97 MB | 8xH100 SXM Attempted techniques and benchmarksRun ID ReferenceInternal run IDs used throughout this document (starting from my previous "SOTA", which is C6; please refer to the README and findings.md for a detailed document). Each describes a specific configuration.
Base stack ("@bigbag's stack" / "PR #1493 stack"): SP8192 vocabulary, 11L×512d×8H/4KV, 4×MLP with LeakyReLU(0.5)², 3-layer depth recurrence (layers 3-4-5 looped 2×), parallel residuals (layers 7+), sigmoid skip gates, partial RoPE (16/64 dims), XSA on all layers, QK-Gain 5.25, MuonEq-R optimizer, EMA (0.9965), GPTQ int6+brotli, score-first TTT. From @bigbag PR #1493. Research Contributions1. Headwise Gated Attention (Novel Architecture)Post-attention sigmoid gate applied per-head, after FlashAttention-3 + XSA compute the attention output. A learned gate modulates each head's contribution before the output projection:
Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708). Consistent improvement across scales:
The -0.0005 BPB improvement is preserved exactly when scaling from 2×H100 to 8×H100, confirming the technique's robustness. I also tested elementwise gating (1 gate per dim per head, +2.36M params). It achieved slightly better BPB (1.2602 vs 1.2653 on SP1024) but exceeded the 16 MB budget (17.87 MB). Headwise is the Pareto-optimal choice: nearly free parameters, fits under budget, and provides consistent improvement. 2. 29-Paper Systematic SurveySurveyed papers from NeurIPS 2024-2025, ICML 2025, ICLR 2025, and ACL 2025 covering attention modifications, normalization strategies, optimizer scheduling, data selection, structured layers, and compression techniques. Each paper was assessed for Parameter Golf feasibility (16 MB / 10-min / 36M-param constraints) and mapped to PG leaderboard presence. Key finding: most techniques published for 125M+ parameter models do not transfer to the 36M regime. Of 10 papers tested experimentally, 8 produced negative or null results. Only 2 showed gains on the V2 @bigbag stack, and those gains did not survive the jump to 8×H100 on that base. However, the same techniques (EMA=0.990, small batch) transferred successfully on PR #1851's LQER base, producing P3 SOTA (1.0066 BPB). Transfer is stack-dependent. 3. EMA Decay Scaling Law at Short Training DurationsDiscovered that optimal EMA decay shifts dramatically lower when training steps are limited. The @bigbag default (0.9965) is suboptimal for short runs (~1,000-4,500 steps). 2×H100 EMA sweep (C6 base, ~1,030 steps):
Gains are monotonic from 0.995 to 0.990 — more aggressive averaging helps when the training window is short. But EMA sensitivity is extreme: 0.997 is already worse than default, and 0.999 is catastrophic. Critical caveat — stack-dependent transfer:
On @bigbag stack with ~4,486 steps, aggressive EMA averages too few checkpoints. But on PR #1851's LQER base with small batch (12,382 steps), EMA=0.990 contributes to SOTA. The optimal decay depends on both step count and base stack dynamics. 4. 3×3 Factorial Technique Interaction StudySystematic 9-run factorial experiment on the @bigbag stack to isolate technique interactions. Three factors, three levels each:
Factor matrix (TTT BPB, 2×H100):
Key interactions discovered:
This factorial design revealed that technique interactions are non-trivial — you cannot predict combined performance from individual ablations. 5. Small Batch Size for Short Wall-Clock TrainingTested reducing effective batch size to get more optimizer updates within the fixed 10-minute window (inspired by Liao et al., NeurIPS 2024). 2×H100 results (C6 base):
4× smaller batch → 3.3× more steps → -0.020 BPB improvement. The largest single-technique gain I found on the V2 stack. Stack-dependent transfer to 8×H100:
On @bigbag stack, small batch hurts at 8×H100 scale. But on PR #1851's LQER base, small batch + EMA=0.990 combine to produce SOTA (12,382 steps, 1.0066 BPB). The transfer depends on the base stack — LQER's quantization-aware residuals create a regime where frequent updates compound rather than conflict. 6. Technique Transfer Across GPU Counts and StacksPerhaps the most important meta-finding: hyperparameter improvements on 2×H100 do not reliably transfer to 8×H100 — transfer depends on both GPU count and base stack.
Architectural changes (headwise gate) transfer perfectly — the delta is preserved across scales and stacks. Training hyperparameters (EMA decay, batch size) are stack-dependent: they failed on the @bigbag stack at 8×H100 (L1, L2) but succeeded on PR #1851's LQER base (P3 = 1.0066 SOTA). The difference is that PR #1851's LQER creates a training dynamic where aggressive EMA and frequent updates compound — the same hyperparameters that conflict on one stack can synergize on another. Updated implication: Hyperparameter transfer depends on both GPU count AND base stack. The @bigbag → PR #1851 transition changed the optimal EMA/batch regime. Architecture changes remain safe to prototype on fewer GPUs. 7. GPTQ Compression AnalysisSystematic comparison of our GPTQ implementation vs leaderboard leaders revealed a 5× quality gap (+0.05 BPB vs +0.01 for Kevin Clark (then-rank 5) / dexhunter (then-rank 7)). Root cause analysis identified 5 compounding factors:
The leaderboard doesn't just "use GPTQ" — they use GPTQ as the final step of a quantization-aware pipeline (WD tuning → EMA → QAT → GPTQ → brotli). I am only doing the last two steps. Techniques That Failed (Expanded)Tested on the V2 @bigbag stack (36M params, SP8192, 10-min wall clock). All produced negative results.
Meta-finding: 8 of 10 tested papers produced negative results at the 36M-parameter scale. The 36M / 16 MB / 10-min constraint regime fundamentally changes which optimizations matter. Techniques designed for 125M+ parameter models with large compute budgets are not "free gains" at small scale — they often interact negatively with the aggressive training stack (MuonEq-R, depth recurrence, parallel residuals, XSA) that already exists. Experiment Scale
BPB Progression
2×H100 → 8×H100 ScalingConsistent improvement across all V2 configurations:
Technique deltas are preserved across scales for architectural changes. ~0.083 BPB consistent scaling factor from 4× GPU count increase. 3-Seed ReproducibilityP3 (SOTA — PR #1851 fork + headwise gate + EMA=0.990 + small batch + emb6)
C6 (Previous — @bigbag fork + headwise gate + emb7+eclip15)
All seeds: artifact under 16 MB, training under 600s, eval under 600s. |
Due to a technical issue where regular SP8192 and CaseOps SP8192 datasets conflict on disk, CaseOps data was symlinked into the standard paths. Added exact download + symlink commands so judges can reproduce the same setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
hello, the below content is AI-generated so you should probably disregard it. PR #2071 Legality FeedbackI think the main legality concern is byte accounting, not non-normalized PR #2071 appears to evaluate CaseOps-tokenized validation shards while That means the reported BPB is likely using bytes inferred from the transformed Relevant log evidence from Those values imply a denominator of about 164,594,398 bytes: If the canonical raw eval byte count is 153,880,891, the same loss and token So the claimed I did not find an obvious probability-normalization issue. The eval path uses The byte denominator issue seems sufficient to make the submitted score |
|
Thank you, @sharpobject. Let me look into this. |
|
@sharpobject I'm still actively investigating this. Thanks, guys |
|
@sharpobject You are correct Robert. the byte inflation is there. I'm running multiple times to confirm @someone114514 Unfortunately, Robert is correct. the BPB value may not be as accurate as I thought. I'm running multiple times to confirm P/s: I knew it was too good, and I had barely slept for 2 days before this submission |
|
Odd though... the number of steps is still absurdly high Opus 4.7: |
|
quantized_ttt_phased val_loss:2.37890832 val_bpb:0.99777572 eval_time:443623ms Low val_bpb comes back again after fixing the bug. Currently looking into this |
|
I reset the pod to rerun after the SP8192+CaseOps bug, and still got that number returned again: Though this is BEFORE evaluation (missing library stops the run prematurely) I must head to the Gemini workshop. |
Record: SP8192 + PR#1851 Fork + Headwise Gate + EMA 0.990 + Small Batch + Emb6
val_bpb = 1.0066 (3-seed mean, std 0.0009) | ~15.97 MB | 8xH100 SXM
Record submission — beats previous SOTA (1.0611, PR #1855) by 0.0545 nats.
No PPM. No PreQuantTTT
3-Seed Results
Author & Research Approach
An Thien Vo (James Emerson Vo) — Georgia Tech, CS 7643 Deep Learning.
This submission forks PR #1851 (@aquariouseworkman) and adds 4 novel contributions discovered through a systematic research effort: 29+ papers surveyed, 40+ experiments across 2×H100 and 8×H100, and careful ablation to identify which techniques transfer to the extreme compression regime of Parameter Golf.
Novel Contributions
Headwise Gated Attention — Post-attention sigmoid gate applied per-head after FA3+XSA. Q projection widened by
gate_dim, gate modulates each head's contribution before output projection. ~50K extra parameters, zero inference cost, consistent -0.0005 BPP improvement across scales. Inspired by NeurIPS 2025 Best Paper (arxiv:2505.06708).EMA Decay = 0.990 — Discovered optimal EMA decay shifts dramatically lower when training steps are limited. Default 0.9965 → optimal 0.990 on 8×H100: more aggressive weight averaging captures better training signal when the training window is fixed at 10 minutes.
Small Batch (ga=1, 196K tokens) — Reducing effective batch size from 786K to 196K tokens yields 3.3× more optimizer steps in the same wall clock. On 8×H100, this enables 12,382 steps vs ~4,500 with default batch size, giving the optimizer more fine-grained updates.
6-bit Embedding Quantization — Reducing
EMBED_BITSfrom 8 to 6 saves ~1 MB on the compressed artifact, enabling headwise gated attention's extra parameters to fit under the 16 MB budget. Costs ~0.013 BPB in quantization gap but enables the complete technique stack.Key Techniques
Base Stack
PR #1851 (@aquariouseworkman)
Extends @bigbag's PR #1493 with:
lqer_enabled=True,lqer_asym_enabled=Truefused_ce_enabled=Truesmear_gate_enabled=False)sparse_attn_gate_enabled=False)caseops_enabled=Falsebut pod data paths pointed to CaseOps-tokenized shards)Upstream: @bigbag PR #1493
Techniques That Failed
Tested on the V2 rank 1 stack. All produced negative results at the 36M-parameter scale.
Architecture
11L × 512d × 8H/4KV, MLP 4×, LeakyReLU(0.5)², partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (enabled at frac=0.35). Parallel residuals from layer 8. Skip gates. Headwise gated attention: Q widened by gate_dim, sigmoid gate per-head after FA3+XSA. LQER asymmetric int4 error correction from PR #1851. Fused softcapped CE kernel (Triton).
Total parameters: ~35.99M.
Training
MuonEq-R optimizer for matrix params, AdamW for embeddings/scalars.
GRAD_ACCUM_STEPS=1(8 GPUs),TRAIN_BATCH_TOKENS=196,608(small batch), ~12,382 steps in ~596s on 8×H100 SXM (PyTorch 2.11, CUDA 13.0). Linear warmdown over final 75%. EMA decay 0.990. Weight decay: Muon WD=0.095, Embed WD=0.085, Adam WD=0.02.Quantization
Full-Hessian GPTQ with SDClip:
clip = k * std(row).MATRIX_CLIP_SIGMAS=12.85,ATTN_CLIP_SIGMAS=13.0)EMBED_BITS=6,EMBED_CLIP_SIGMAS=20.0)Evaluation
Phased TTT (from PR #1851) — multi-phase test-time training with LoRA adaptation:
torch.no_grad()BEFORE gradient updatesCompliance
Per Issue #1017 (Track B):
Reproduction
Repeat with
SEED=1337andSEED=2025for 3-seed verification.Credits
Acknowledgements
Total personal compute cost: ~$1,165 ($640 out-of-pocket + $525 OpenAI development grant) across 130+ experiments on RunPod.
In memory of Moomoo, my cat.
Included Files
README.md(this file)submission.jsontrain_gpt.pytrain_seed42.logtrain_seed1337.logtrain_seed2025.log