Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)#1948
Open
TimS-ml wants to merge 5 commits intoopenai:mainfrom
Open
Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)#1948TimS-ml wants to merge 5 commits intoopenai:mainfrom
TimS-ml wants to merge 5 commits intoopenai:mainfrom
Conversation
…enai#1938 (val_bpb=1.06242) 3-seed sweep on seeds 1334, 42, 999 of the v2b training script from PR openai#1867, extending Billy Li's PR openai#1938 stack with two algorithmically free wins: - LeakyReLU squared slope 0.5 -> 0.3 (Stage 4 ablation: -0.00073 BPB, size-neutral, wallclock-neutral; 4-point sweep confirms 0.3 is the minimum). - GPTQ Hinv path: cholesky_inverse + chol(upper) -> reverse Cholesky + triangular solve (Stage 7 ablation: mathematically equivalent within fp32 ULP, 2.07-2.24x faster on RTX 4090 cuSOLVER microbench at the GPTQ workload range n=512..4096). Plus compliance-tuned defaults baked into train_gpt.py's Hyperparameters: LQER_TOP_K=1, GATED_ATTN_QUANT_GATE=1, TTT_BATCH_SIZE=16, PHASED_TTT_NUM_PHASES=3, GPTQ_RESERVE_SECONDS=16. Result: val_bpb (3-seed mean) = 1.06242, sigma ~ 0.00013, ~15.95 MB artifact. Delta vs current SOTA (PR openai#1493, 1.0810): -0.0186 BPB, well past the 0.005-nat significance threshold. Joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16). Compute sponsored by Prof. Hao Lin (Fordham University). Concurrent ablation infrastructure by Hang Zhou (@greyjoeyzhou).
aerosta
pushed a commit
to aerosta/parameter-golf
that referenced
this pull request
Apr 30, 2026
3-seed mean val_bpb = 1.05851479 (std 0.000762, seeds 42/0/1234) on track_10min_16mb. Stack: - PR openai#1945 (alertcat) V21 base = PR openai#1908 + AWQ-Lite + AsymLogit Rescale - PR openai#1953 (andrewbaggio1) TTT/QK env knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask) - PR openai#1948 (TimS-ml + lijuncheng16) LeakyReLU squared slope 0.3 - PR openai#1145 (AnirudhRahul, valerio-endorsed) closed-form n-gram tilt with Σ P=1 Z renormalization Compliance: causal hints, single-pass, Σ P=1 by construction, no SLOT, no n-gram cache, no Pre-Quant TTT. System deps: gcc + lrzip auto-installed by setup.sh; PyTorch 2.9.1 + Triton + Flash Attn 3. One-command reproduction: bash setup.sh SEED={42,0,1234} bash run.sh
TanishGudise
added a commit
to TanishGudise/parameter-golf
that referenced
this pull request
Apr 30, 2026
… breakthrough NULL/NEUTRAL RESULTS (within ±0.0005 noise): - S37 GPTQ_BATCHES=32: 1.05884 (null) - S38 TTT_BETA2=0.995: 1.05884 (null) - S44 GLOBAL_TTT_LR=0.01: 1.05913 (within noise) - S46 GLOBAL_TTT_EPOCHS=2: 1.05902 (null) NEGATIVE RESULTS: - S36 lzma compressor: rejected - S36v2 LQER_TOP_K=2: 1.05912 - S41 openai#1965 bundle: 1.05916 - S42 LQER 8/5 + EMA 0.997: 1.05912 (EMA contaminated) - S43 LQER 8/5 isolated: 1.05925 - S52 LeakyReLU 0.3: 1.05977 (PR openai#1948 doesn't transfer to PR openai#1797) - S53 WARMDOWN_FRAC=0.95 + MIN_LR=0.05: 1.05950 (best pre-quant 1.06061 but bigger quant tax) INFRASTRUCTURE FIXES: - S39 lrzip -k flag bug, S40 SSH disconnect, S45 NCCL crash - S47/S49/S51 LeakyReLU integration bugs BREAKTHROUGH: - S54 n-gram tilt port from PR openai#1145/openai#1967: 1.05692 single seed (seed 314) - Pre-quant: 1.06057, Quantized: 1.06917, Final: 1.05692 - Eval: 503.4s under 600s cap, Size: 15,944,666 bytes under 16MB cap - Hint precompute outside timer: 173s (legal path) - Mode B with fused_log_softmax_dual_gather kernel - Hints fired on 13M of 47M tokens (27%) - Delta from current-env baseline: -0.00208 BPB Validating seeds 42, 1234 next.
Open
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)
val_bpb (3-seed mean) = 1.06242 | σ ≈ 0.00013 | ~15.95 MB | 8×H100 SXM | 600 s training + 600 s eval
A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16), with thanks to Prof. Lin Hao (Fordham University) for sponsoring the 8×H100 SXM and 4×RTX 4090 compute used in this submission, Xingyuan Ding for additional experiments, Bill (Yiyuan) Li for meaningful discussions on tokenizers, Lijun Yu (@Lijun-Yu) for his invaluable insights, and Hang Zhou (@greyjoeyzhou) for project discussions.
TL;DR
Extends PR #1938 (Billy Li & Tim Shen's S0/PR1851 + Cap Tokenizer + LQER + Global TTT, val_bpb=1.0713) with two algorithmically free wins:
−0.00073BPB free win; size-neutral, wallclock-neutral. (4-point sweep confirms 0.3 is the minimum — see Key Change 1.)chol → cholesky_inverse → chol(upper)— mathematically equivalent within fp32 ULP, 2.07–2.24× faster on RTX 4090 cuSOLVER microbench at the GPTQ workload range. (Key Change 2.)Both are hardcoded inside
train_gpt.py(the variant from PR #1867), which also ships this PR's compliance-tuned defaults on top of PR #1938:LQER_TOP_K=1,GATED_ATTN_QUANT_GATE=1,TTT_BATCH_SIZE=16,PHASED_TTT_NUM_PHASES=3,GPTQ_RESERVE_SECONDS=16.Result
GPTQ reserve-time accounting
Key Change 1: Leaky ReLU² slope = 0.3
4-point sweep at fixed seed=42 / 1.0× batch / 600 s wallclock:
Shallow V minimum at 0.3, size-neutral, no wallclock cost. Hardcoded in
train_gpt.pylines 694-695 (Triton kernel) and line 910 (eager fallback).Key Change 2: GPTQ reverse-Cholesky Hinv path
Replaces
with the mathematically equivalent single-pass
(The proof uses
chol(H^{-1}, upper)uniqueness under the positive-diagonal constraint; full derivation in the authors' Stage 7 ablation note.)RTX 4090 cuSOLVER fp32 microbench:
Numerics: max relative error ≤ 5.3e-7 across
n=64..2048; artifact bytes byte-equivalent within brotli noise. Hardcoded intrain_gpt.pylines 1870-1874.Compliance-tuned defaults (this PR vs PR #1938)
LQER_TOP_Ktok_emb) only; −0.00044 BPB, saves bytesGATED_ATTN_QUANT_GATEattn_gate_w; −0.00011 BPBTTT_BATCH_SIZEPHASED_TTT_NUM_PHASESGPTQ_RESERVE_SECONDStrain+GPTQ ≤ 600 sLEAKY_RELU_SQ_SLOPE(in script)cholesky_inverse + chol(upper)All other hparams inherit from
train_gpt.py'sHyperparametersdefaults, which match the PR #1938 envelope.Architecture
11L × 512d × 8H / 4KV, MLP 4× (2048 hidden), LeakyReLU(0.3)². Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings (vocab 8192, caseops-augmented), logit softcap=30.0. Depth recurrence (loops layers 3-5, ×2, activated at frac=0.35). Parallel residuals from layer 8. Skip gates. SmearGate with BOS mask. Sparse attention gates.
model_params = 35,945,671.Quantization
Full-Hessian GPTQ + SDClip, on the reverse-Cholesky Hinv path:
c_q,c_k,c_v,proj) and MLP (fc,proj) weightstok_emb.weightonly (LQER_TOP_K=1)attn_gate_w(GATED_ATTN_QUANT_GATE=1)TTT
Phased TTT, 3 phases × 2000 prefix docs, score-first, Adam optimizer, cosine LR (peak 1e-4). LoRA rank=96 over K, MLP, O projections.
TTT_BATCH_SIZE=16. The script'stotal_eval_timeis the canonical eval timer (matches the convention used by past SOTA records).Compliance
train + GPTQtotal_eval_timeDataset
This submission uses the pre-built case-op augmented FineWeb-10B tokenization from
romeerp/parameter-golf-caseops-v1(pre-built shards), the same dataset that PR #1729 / PR #1736 / PR #1851 use.
The bijective case-op tokenizer (
fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model,shipped in
tokenizers/) and the build script (prepare_caseops_data.py+lossless_caps.py) are included for byte-exact rebuild, but using thepre-built shards from
romeerp/parameter-golf-caseops-v1is the recommendedpath.
Reproducing
Hyperparametersdefaults already encode this PR's compliance-tuned envelope (this PR + b-series, on top of PR #1938); no other env exports are needed.Builds On
Acknowledgments
A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16).
With thanks to:
Additional credits (technique stack):