Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta by PiyushDatta · Pull Request #2106 · openai/parameter-golf

PiyushDatta · 2026-05-01T06:56:53Z

Overview

Author: @PiyushDatta
Submission date: April 30, 2026

Probably does not beat current number 1 spot, but wanted to share regardless. Code is not thoroughly review (sorry!) but code works on 4xa100 and 8xh100 SXM. You can reproduce (most likely and hopefully) but code might not be 100% correct or reviewed (apologies).

Dependencies: sentencepiece, brotli

To reproduce, read
records/track_10min_16mb/2026-04-30_PiyushDatta_SP8192_DepthRecur_PolarNS_LoRATTT/runpod_8xh100_parameter_golf_setup.md

Command to run for all 3 seeds (can specify seed numbers too)

bash records/track_10min_16mb/2026-04-30_PiyushDatta_SP8192_DepthRecur_PolarNS_LoRATTT/run_final_submission.sh --nproc 8

I ran the steps myself

Log Summary

The staged seed logs do not contain a completed quantized_ttt_phased result.
All three runs were interrupted during TTT compile/eval, so the last completed
validation metric is quantized_sliding_window val_bpb.

Final Metrics

Seed	Train stop step	Last scheduled val step	Final completed val_bpb	Artifact size
42	8597	8597	1.08934733	15,999,684 bytes
314	8631	8631	1.09035192	15,997,730 bytes
999	8620	8620	1.09285937	15,998,747 bytes

Mean

quantized_sliding_window val_bpb mean: 1.09085287

Source Logs

logs/seed_42.log
logs/seed_314.log
logs/seed_999.log

Log summary

Full summary: records/track_10min_16mb/2026-04-30_PiyushDatta_SP8192_DepthRecur_PolarNS_LoRATTT/logs_summary.md

Feature Uniqueness Analysis

Analyzed all 2,015 PRs (open, closed, merged) on openai/parameter-golf as of 2026-05-01.

Unique (I think no one else tried it..)

Multi-Trajectory SWA — Each GPU rank follows independent trajectory during warmdown (grad sync off), then SWA averages combined across ranks.
Scale Tuning Post-GPTQ — Freeze quantized int weights, fine-tune only per-row scales via CE loss backprop (Adam, 20 steps).
Two-Pass GPTQ — Run GPTQ, dequantize, re-collect Hessians on quantized model, run GPTQ again.

Partially Unique (maybe our variant is novel? But concept was explored)

Selective 2:4 Sparsity (training-time) — Mid-training one-shot 2:4 pruning on MLP weights. PR Non-Record: CAT, Sparsity (Structured and Hessian-Guided), MoE, KAN Negative Results #1537 tried post-training 2:4 (negative). PR Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100) #1818 tried as compression codec (catastrophic).

Not Unique (tried by others)

Mixture of Softmax — PRs Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb) #266, 5 novel architecture ablations on SOTA baseline #584, Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline #908, Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie #1227, Non-record: 1x H100 SXM5 Explorations #1608, Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878) #1995. All neutral-to-harmful.
Hourglass Downsampling — PRs Add MLX heavy-share research harness for Parameter Golf #133, Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831, Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change) #1275, Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337) #1573, MirrorLoop HRC + LexLoRE non-record submission #2004. PR Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831 called it "catastrophic."
Loop Gate — PRs Record: sliding eval, FP16 tied embeddings, 10 layers, Muon WD 0.02, overtone init, and phase-transition residual mixing. (val_bpb 1.1876) #155, Nightcrawler — 1.176bpb 10mb #1208, Submission/fastattn mtp dr #1691, Non-record: Mixed-Temperature Self-Generated GPTQ Calibration on V6 #1996. Well-explored by 4-5 teams.
Gated MLP / SwiGLU — 20+ PRs, 2 merged. Most widely tried feature in competition.
Knowledge Distillation — PRs GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215 #578, Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #687, [Non-Record] JEPA Self-Distillation with EMA Target Encoder for Autoregressive LM | Controlled A/B Shows No Gain Over Vanilla CE (val_bpb: 1.19) #896, Non-record: Knowledge Distillation - A Negative Result (val_bpb=1.1529) #1029, Non-record: knowledge distillation teacher-student submission #1034, Bandit: ClownCar Crawler x Cubric Ngram9 — 0.4961 BPB, 9.9mb #1083, [10min_16mb] 0.9641 BPB: LeakyReLU² + Score-First TTT + N-gram Backoff Cache #1185, Add non-record SP8192 pass-gated recurrence submission #1697. All negative.
Hard Token Mining / Focal Loss — PRs Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #687, Non-record: sin² activation + causal screening pipeline #877, Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460 #1233, Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) #1325, Non-record: Gaussian per-token loss reweighting — what goes wrong and why (+0.014 bpb) #1360, Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380, Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience #1402, ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA #1510, Add non-record SP8192 tempered BPB-weighted loss submission #1702. All negative.
Byte-Weighted CE — PRs Record: 11L MLP3x + SmearGate + Error Correction Table #108, Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT #1033, Record: 0.4188 BPB mixed quant ngram #1359, BPB-weighted training loss: align training objective with eval metric #1519. None merged.
Momentum Cooldown — PRs Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534, [Non-Record] LegendreGPT: Legendre polynomial depth parameterization #1337. Neither merged.

Shared (in other merged submissions)

Fused Softcapped CE (Triton) — PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787
Batch Size Schedule — Ternary PR Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184
Auxiliary CE / Deep Supervision — Ternary PR Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184
Phased LoRA TTT + Global SGD — PRs Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530, Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610, Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626
LQER Asymmetric Quantization — PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851
Value Residual Mixing — msisovic (2026-03-31), SOTA (2026-04-09)
Warmup State Reset — msisovic (2026-03-31)

11-layer GPT with SP8192, MLP 4x, depth recurrence (layers 3-5 looped), parallel residuals, Polar Express Newton-Schulz optimizer, SDClip GPTQ (int6 + int8 embed), brotli compression, SWA, and phased LoRA TTT. 3-seed quantized_sliding_window val_bpb mean: 1.09085 PR: openai#2106

- Fill in val_bpb 1.09085 (3-seed mean) and per-seed results - Fix README reproduction paths to match actual directory name - Add pytorch_version 2.7

11-layer GPT with SP8192, MLP 4x, depth recurrence (layers 3-5 looped), parallel residuals, Polar Express Newton-Schulz optimizer, SDClip GPTQ (int6 + int8 embed), brotli compression, SWA, and phased LoRA TTT. 3-seed quantized_sliding_window val_bpb mean: 1.09085 PR: openai#2106

PiyushDatta changed the title ~~Record submission: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb <>~~ Record submission: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.0909 (3-seed mean) May 1, 2026

PiyushDatta changed the title ~~Record submission: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.0909 (3-seed mean)~~ Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.0909 (3-seed mean) May 1, 2026

PiyushDatta changed the title ~~Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.0909 (3-seed mean)~~ Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta May 1, 2026

PiyushDatta force-pushed the submission-piyushdatta branch from 88f4241 to e22b72c Compare May 1, 2026 07:20

PiyushDatta force-pushed the submission-piyushdatta branch from f600ccf to 91ca41c Compare May 1, 2026 07:32

Piyush Datta added 2 commits May 1, 2026 00:35

Fix submission.json and README with actual 3-seed results

9a92e15

- Fill in val_bpb 1.09085 (3-seed mean) and per-seed results - Fix README reproduction paths to match actual directory name - Add pytorch_version 2.7

Fix README reproduction commands to use --nproc flag

6cbe6dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta#2106

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta#2106
PiyushDatta wants to merge 3 commits intoopenai:mainfrom
PiyushDatta:submission-piyushdatta

PiyushDatta commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PiyushDatta commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Log Summary

Final Metrics

Mean

Source Logs

Log summary

Feature Uniqueness Analysis

Unique (I think no one else tried it..)

Partially Unique (maybe our variant is novel? But concept was explored)

Not Unique (tried by others)

Shared (in other merged submissions)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PiyushDatta commented May 1, 2026 •

edited

Loading