Skip to content

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta#2106

Open
PiyushDatta wants to merge 3 commits intoopenai:mainfrom
PiyushDatta:submission-piyushdatta
Open

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta#2106
PiyushDatta wants to merge 3 commits intoopenai:mainfrom
PiyushDatta:submission-piyushdatta

Conversation

@PiyushDatta
Copy link
Copy Markdown

@PiyushDatta PiyushDatta commented May 1, 2026

Overview

Author: @PiyushDatta
Submission date: April 30, 2026

Probably does not beat current number 1 spot, but wanted to share regardless. Code is not thoroughly review (sorry!) but code works on 4xa100 and 8xh100 SXM. You can reproduce (most likely and hopefully) but code might not be 100% correct or reviewed (apologies).

Dependencies: sentencepiece, brotli

To reproduce, read
records/track_10min_16mb/2026-04-30_PiyushDatta_SP8192_DepthRecur_PolarNS_LoRATTT/runpod_8xh100_parameter_golf_setup.md

Command to run for all 3 seeds (can specify seed numbers too)

bash records/track_10min_16mb/2026-04-30_PiyushDatta_SP8192_DepthRecur_PolarNS_LoRATTT/run_final_submission.sh --nproc 8

I ran the steps myself

Log Summary

The staged seed logs do not contain a completed quantized_ttt_phased result.
All three runs were interrupted during TTT compile/eval, so the last completed
validation metric is quantized_sliding_window val_bpb.

Final Metrics

Seed Train stop step Last scheduled val step Final completed val_bpb Artifact size
42 8597 8597 1.08934733 15,999,684 bytes
314 8631 8631 1.09035192 15,997,730 bytes
999 8620 8620 1.09285937 15,998,747 bytes

Mean

  • quantized_sliding_window val_bpb mean: 1.09085287

Source Logs

  • logs/seed_42.log
  • logs/seed_314.log
  • logs/seed_999.log

Log summary

Full summary: records/track_10min_16mb/2026-04-30_PiyushDatta_SP8192_DepthRecur_PolarNS_LoRATTT/logs_summary.md

Feature Uniqueness Analysis

Analyzed all 2,015 PRs (open, closed, merged) on openai/parameter-golf as of 2026-05-01.

Unique (I think no one else tried it..)

  1. Multi-Trajectory SWA — Each GPU rank follows independent trajectory during warmdown (grad sync off), then SWA averages combined across ranks.
  2. Scale Tuning Post-GPTQ — Freeze quantized int weights, fine-tune only per-row scales via CE loss backprop (Adam, 20 steps).
  3. Two-Pass GPTQ — Run GPTQ, dequantize, re-collect Hessians on quantized model, run GPTQ again.

Partially Unique (maybe our variant is novel? But concept was explored)

  1. Selective 2:4 Sparsity (training-time) — Mid-training one-shot 2:4 pruning on MLP weights. PR Non-Record: CAT, Sparsity (Structured and Hessian-Guided), MoE, KAN Negative Results #1537 tried post-training 2:4 (negative). PR Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100) #1818 tried as compression codec (catastrophic).

Not Unique (tried by others)

  1. Mixture of Softmax — PRs Non-record: Mixture of Softmax K=2 R=64 (1xH100, 10min, 1.3932 bpb) #266, 5 novel architecture ablations on SOTA baseline #584, Non-record: Higher-Rank Output Heads — Standard Tied Head Wins on a Frontier 11L Baseline #908, Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie #1227, Non-record: 1x H100 SXM5 Explorations #1608, Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878) #1995. All neutral-to-harmful.
  2. Hourglass Downsampling — PRs Add MLX heavy-share research harness for Parameter Golf #133, Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831, Non-record: Asymmetric 1/10 Split — 1.1492 pre-quant BPB on 8xH100 (one-line change) #1275, Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337) #1573, MirrorLoop HRC + LexLoRE non-record submission #2004. PR Research: Why Novel Architectures Fail at 16MB — Throughput-Quantization Co-optimization #831 called it "catastrophic."
  3. Loop Gate — PRs Record: sliding eval, FP16 tied embeddings, 10 layers, Muon WD 0.02, overtone init, and phase-transition residual mixing. (val_bpb 1.1876) #155, Nightcrawler — 1.176bpb 10mb  #1208, Submission/fastattn mtp dr #1691, Non-record: Mixed-Temperature Self-Generated GPTQ Calibration on V6 #1996. Well-explored by 4-5 teams.
  4. Gated MLP / SwiGLU — 20+ PRs, 2 merged. Most widely tried feature in competition.
  5. Knowledge Distillation — PRs GPTQ + Early QAT + Legal TTT — 3-seed mean val_bpb 1.1215 #578, Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #687, [Non-Record] JEPA Self-Distillation with EMA Target Encoder for Autoregressive LM | Controlled A/B Shows No Gain Over Vanilla CE (val_bpb: 1.19) #896, Non-record: Knowledge Distillation - A Negative Result (val_bpb=1.1529) #1029, Non-record: knowledge distillation teacher-student submission #1034, Bandit: ClownCar Crawler x Cubric Ngram9 — 0.4961 BPB, 9.9mb #1083, [10min_16mb] 0.9641 BPB: LeakyReLU² + Score-First TTT + N-gram Backoff Cache #1185, Add non-record SP8192 pass-gated recurrence submission #1697. All negative.
  6. Hard Token Mining / Focal Loss — PRs Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #687, Non-record: sin² activation + causal screening pipeline #877, Non-record: Focal Loss (gamma=2.0) — val_bpb=1.1460 #1233, Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) #1325, Non-record: Gaussian per-token loss reweighting — what goes wrong and why (+0.014 bpb) #1360, Non-record: Cosine LR Schedule — -0.070 BPB improvement + Focal Loss Investigation (corrected) #1380, Verifily Data Quality: Three-Tier Token Weighting + DCLS Salience #1402, ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA #1510, Add non-record SP8192 tempered BPB-weighted loss submission #1702. All negative.
  7. Byte-Weighted CE — PRs Record: 11L MLP3x + SmearGate + Error Correction Table #108, Record: 0.4311 BPB - Complementary Training + Backoff N-gram Mixer + TTT #1033, Record: 0.4188 BPB mixed quant ngram #1359, BPB-weighted training loss: align training objective with eval metric #1519. None merged.
  8. Momentum Cooldown — PRs Non-record: 11L Partial RoPE + XSA4 + VE128 + Tight SWA + GPTQ-lite (val_bpb=1.1804) #534, [Non-Record] LegendreGPT: Legendre polynomial depth parameterization #1337. Neither merged.

Shared (in other merged submissions)

  1. Fused Softcapped CE (Triton) — PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787
  2. Batch Size Schedule — Ternary PR Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184
  3. Auxiliary CE / Deep Supervision — Ternary PR Record: Scylla + Full GPTQ + XSA-all + FA3 — val_bpb 0.9485 (3-seed mean) #1184
  4. Phased LoRA TTT + Global SGD — PRs Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530, Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610, Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626
  5. LQER Asymmetric Quantization — PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851
  6. Value Residual Mixing — msisovic (2026-03-31), SOTA (2026-04-09)
  7. Warmup State Reset — msisovic (2026-03-31)

@PiyushDatta PiyushDatta changed the title Record submission: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb <> Record submission: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.0909 (3-seed mean) May 1, 2026
@PiyushDatta PiyushDatta changed the title Record submission: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.0909 (3-seed mean) Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.0909 (3-seed mean) May 1, 2026
@PiyushDatta PiyushDatta changed the title Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.0909 (3-seed mean) Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta May 1, 2026
@PiyushDatta PiyushDatta force-pushed the submission-piyushdatta branch from 88f4241 to e22b72c Compare May 1, 2026 07:20
11-layer GPT with SP8192, MLP 4x, depth recurrence (layers 3-5 looped),
parallel residuals, Polar Express Newton-Schulz optimizer, SDClip GPTQ
(int6 + int8 embed), brotli compression, SWA, and phased LoRA TTT.

3-seed quantized_sliding_window val_bpb mean: 1.09085

PR: openai#2106
@PiyushDatta PiyushDatta force-pushed the submission-piyushdatta branch from f600ccf to 91ca41c Compare May 1, 2026 07:32
Piyush Datta added 2 commits May 1, 2026 00:35
- Fill in val_bpb 1.09085 (3-seed mean) and per-seed results
- Fix README reproduction paths to match actual directory name
- Add pytorch_version 2.7
PiyushDatta added a commit to PiyushDatta/parameter-golf-fork that referenced this pull request May 1, 2026
11-layer GPT with SP8192, MLP 4x, depth recurrence (layers 3-5 looped),
parallel residuals, Polar Express Newton-Schulz optimizer, SDClip GPTQ
(int6 + int8 embed), brotli compression, SWA, and phased LoRA TTT.

3-seed quantized_sliding_window val_bpb mean: 1.09085

PR: openai#2106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant