Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack) by monisha-max · Pull Request #1325 · openai/parameter-golf

monisha-max · 2026-04-04T06:14:58Z

Summary

Six orthogonal improvements on the current SOTA (PR #549, 1.1194 BPB):

Polynomial degree-5 softcap replacing tanh — sharper gradients, proven in ternary PR Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding #640
Z-loss regularization (1e-4 * logsumexp²) — anchors logits near zero through quantization
YaRN positional encoding — better frequency interpolation for seq_len=2048
zstd-22 compression — 7.0 MB artifact vs ~16 MB with LZMA-6 (massive headroom)
Sliding eval stride=16 (was 64) — 4x more context overlap per token
FA3/FA2/SDPA fallback — enables testing on non-Hopper GPUs

Conservative estimated gain: -0.007 to -0.014 BPB → target ~1.105-1.112 BPB

Smoke Test (1xH100, Modal)

Metric	Value
Steps	890 / 9000 (single GPU, wallclock-limited)
val_bpb @ step 890	1.3868 (expected for ~12% of training)
Loss curve	6.93 → 2.34 (healthy convergence)
Artifact (zstd-22)	7.0 MB
All features	Verified working

Status

Smoke-tested on 1xH100 via Modal. Awaiting 8xH100 SXM verification, currently without RunPod/8xH100 access. Happy to collaborate with anyone who can run the 3-seed verification.

All techniques are individually proven in merged PRs (#549, #640, #414). This submission combines them for the first time.

Run Command

SEED=1337 SOFTCAP_TYPE=poly Z_LOSS_WEIGHT=1e-4 ROPE_TYPE=yarn \
YARN_MAX_LEN=2048 EVAL_STRIDE=16 TTT_ENABLED=1 TTT_FREEZE_BLOCKS=0 \
BIGRAM_VOCAB_SIZE=1536 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Built on the work of @abaybektursun (PR #549, #399), @signalrush (PR #414), @CiprianFlorin-Ifrim (PR #640), @Christopher-Lee-McClendon (PR #461), @parinzee (PR #493), @jfprincz (PR #315, #287), @unnir (PR #265).

Test plan

Verify 3-seed training on 8xH100 SXM (seeds: 42, 1337, 2025)
Confirm all artifacts under 16 MB
Confirm training completes within 600s
Report sliding window BPB (stride=16) + TTT BPB

…6 (on PR openai#549 stack) Six orthogonal improvements on the current SOTA (PR openai#549, 1.1194 BPB): - Polynomial degree-5 softcap replacing tanh (from ternary PR openai#640) - Z-loss regularization (1e-4 * logsumexp^2) for sharper gradients - YaRN positional encoding for better long-context handling - zstd-22 compression (7MB artifact vs 16MB with LZMA-6) - Sliding eval stride=16 (4x more context overlap) - FlashAttention 3/2/SDPA graceful fallback Smoke-tested on 1xH100 (Modal): 890 steps, healthy convergence. Awaiting 8xH100 SXM verification for official scoring.

1. Adaptive Focal Cross-Entropy Loss - Dynamically upweights hard tokens using model confidence - Focuses limited training budget on informative tokens - (1-p_t)^gamma weighting, normalized to preserve LR scale 2. Residual Vector Quantization (RVQ) - Two-pass: int6 base + int4 residual = ~10-bit effective precision - Exploits 9MB artifact headroom from zstd-22 compression - First application of RVQ to LLM weight compression in this challenge 3. Progressive Depth Warmup - Train layers bottom-up in 3 stages (bottom 1/3 -> 2/3 -> all) - Zero gradients for frozen layer banks after backward - Novel application of gradual unfreezing to from-scratch training

Apple added 2 commits April 4, 2026 11:40

PiyushDatta mentioned this pull request May 1, 2026

Record: SP8192+DepthRec+Half batch SWA+Polar NS+Phased LoRa TTT - val_bpb 1.089 (best), val_bpb 1.090 (3-seed mean) - PiyushDatta #2106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack)#1325

Record Submission: Poly5 Softcap + Z-Loss + YaRN + Zstd-22 + Stride-16 (on PR #549 stack)#1325
monisha-max wants to merge 2 commits intoopenai:mainfrom
monisha-max:submission/poly-softcap-zloss-yarn-zstd22-stride16

monisha-max commented Apr 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

monisha-max commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Smoke Test (1xH100, Modal)

Status

Run Command

Credits

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

monisha-max commented Apr 4, 2026 •

edited

Loading