Skip to content

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)#1926

Open
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/pr1874-chunk32
Open

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)#1926
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/pr1874-chunk32

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Apr 29, 2026

Summary

val_bpb = 1.06844 (3-seed mean, std 0.00058) | val_loss = 2.75955 nats/token | ~15.95 MB | 8×H100 SXM, 600s train / 600s eval

Runs PR #1874 (AjAnubolu) verbatim code with optimized hyperparameters via environment variables. No code modifications — all improvements come from properly activating PR #1874's intended settings.

Key env vars: MIN_LR=0.10 (prevents LR collapse to zero in warmdown), QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24, GPTQ_RESERVE_SECONDS=0.5 (maximizes training steps), VAL_LOSS_EVERY=0 (eliminates mid-train eval overhead).

No CaseOps, no casefold, no PPM — standard SP8192 UTF-8 byte counting throughout.

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Seed Steps Pre-quant post-EMA Quantized Post-TTT Artifact (bytes) train_time eval_time
1337 4834 1.06960 1.07925 1.06798 15,950,405 599.64s 409.6s
42 4844 1.06984 1.07948 1.06824 15,952,215 599.65s 421.0s
2025 4837 1.07060 1.08028 1.06909 15,948,755 599.56s 381.2s
Mean 1.07001 1.07967 1.06844 15,950,458 599.62s 403.9s
Std 0.00053 0.00053 0.00058 1,730 0.05s 20.2s

All three seeds clear size (< 16,000,000 B), train-time (< 600s), and eval-time (< 600s) budgets.

Key Techniques (all from PR #1874, activated via env vars)

  1. LQER Asymmetric Rank-4 (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797) — SVD-based quantization error reduction
  2. SmearGate + Attention Output Gate (width 24) (PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790) — per-layer smoothing + attention gating
  3. Polar Express Newton-Schulz (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) — minimax-tuned Muon coefficients
  4. MIN_LR=0.10 (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787) — warmdown LR floor (prevents collapse to zero)
  5. QK_GAIN_INIT=5.25 — per-head query-key attention scaling
  6. GATE_ATTN_WIDTH=24 — doubled attention gate capacity
  7. GPTQ_RESERVE_SECONDS=0.5 — maximizes training steps (+28 steps vs default 4.0)
  8. VAL_LOSS_EVERY=0 — eliminates mid-training eval overhead (+112 steps)
  9. Phased Score-First LoRA-TTT — 3-phase AdamW (rank 128)

How to reproduce

SEED=1337 MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 \
  GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Rule Compliance

  • Score-first phased TTT (no re-scoring)
  • No pre-quant TTT on validation data
  • No n-gram cache, no PPM
  • No CaseOps, no casefold
  • Artifact ≤ 16,000,000 bytes (max 15,952,215)
  • Train ≤ 600s (max 599.65s), eval ≤ 600s (max 421.0s)

Attribution

Built on PR #1874 (AjAnubolu), PR #1790 (miaoyuxun), PR #1344 (Polar Express), PR #1787 (nprime06), PR #1797 (dexhunter).

Test plan

  • Reproduce any seed with env vars from "How to reproduce" section
  • Verify artifact < 16,000,000 bytes in each seed log
  • Verify eval_time < 600s in each seed log
  • Verify score-first TTT ordering in code

🤖 Generated with Claude Code

Pavel Liashkov and others added 2 commits April 29, 2026 15:39
…3-seed mean)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1.06990)

Added MIN_LR=0.10, QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24,
GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0. Same code, env vars only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag
Copy link
Copy Markdown
Author

bigbag commented Apr 29, 2026

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) for making this work possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant