Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) by TimS-ml · Pull Request #1948 · openai/parameter-golf

TimS-ml · 2026-04-29T21:29:42Z

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)

Note: This README captures only the bare submission record. The full
set of insights from our parameter-golf run — every PR iteration we tried,
the hyperparameter-tuning experiments behind each design choice, and the
ablation results that drove our decisions — is being compiled into a
detailed write-up. A more detailed write-up is at: https://www.junchengbillyli.com/llm-notes.html

val_bpb (3-seed mean) = 1.06242 | σ ≈ 0.00013 | ~15.95 MB | 8×H100 SXM | 600 s training + 600 s eval

A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16), with thanks to Prof. Lin Hao (Fordham University) for sponsoring the 8×H100 SXM and 4×RTX 4090 compute used in this submission, Xingyuan Ding for additional experiments, Bill (Yiyuan) Li for meaningful discussions on tokenizers, Lijun Yu (@Lijun-Yu) for his invaluable insights, and Hang Zhou (@greyjoeyzhou) for project discussions.

TL;DR

Extends PR #1938 (Billy Li & Tim Shen's S0/PR1851 + Cap Tokenizer + LQER + Global TTT, val_bpb=1.0713) with two algorithmically free wins:

Leaky ReLU squared slope 0.5 → 0.3 — −0.00073 BPB free win; size-neutral, wallclock-neutral. (4-point sweep confirms 0.3 is the minimum — see Key Change 1.)
GPTQ reverse-Cholesky + triangular solve instead of the standard chol → cholesky_inverse → chol(upper) — mathematically equivalent within fp32 ULP, 2.07–2.24× faster on RTX 4090 cuSOLVER microbench at the GPTQ workload range. (Key Change 2.)

Both are hardcoded inside train_gpt.py (the variant from PR #1867), which also ships this PR's compliance-tuned defaults on top of PR #1938: LQER_TOP_K=1, GATED_ATTN_QUANT_GATE=1, TTT_BATCH_SIZE=16, PHASED_TTT_NUM_PHASES=3, GPTQ_RESERVE_SECONDS=16.

Result

Seed	Post-TTT val_bpb (final)	Artifact bytes
1334	1.06257	15,947,664
42	1.06232	15,945,920
999	1.06237	15,946,532
Mean	1.06242 (σ ≈ 0.00013)	15,946,705

GPTQ reserve-time accounting

(04-30): We've noticed that several
leaderboard submissions appear to exceed the 10-minute training cap once the
full GPTQ pipeline (Hessian collection, quantization, serialize, compress) is
accounted for. From our own measurements, gptq_reserve_seconds=0.5s is
far insufficient: GPTQ Hessian collection takes ~3.5-4 s (depending
on calibration batch size), GPTQ quantization itself ~10 s, and the
serialize+compress step adds another ~60-70 s for Brotli or ~90-100 s
for lrzip pergroup. Among the top leaderboard PRs we surveyed, observed
gptq_reserve_seconds values range across 0.5 / 4 / 8 s; this submission
uses 16 s so that the full pipeline completes inside the 600 s training
cap with margin. The few-second discrepancy is unlikely to be large enough
to materially change the leaderboard score or ranking, but we think it's
worth flagging.

Key Change 1: Leaky ReLU² slope = 0.3

4-point sweep at fixed seed=42 / 1.0× batch / 600 s wallclock:

slope	TTT BPB	Δ vs 0.30
0.25	1.06151	+0.00012
0.30	1.06139	0
0.35	1.06192	+0.00053
0.50 (prior baseline)	1.06212	+0.00073
0.70	1.06267	+0.00128

Shallow V minimum at 0.3, size-neutral, no wallclock cost. Hardcoded in train_gpt.py lines 694-695 (Triton kernel) and line 910 (eager fallback).

Key Change 2: GPTQ reverse-Cholesky Hinv path

Replaces

Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H))   # 1 chol + 2 tri-solve
Hinv = torch.linalg.cholesky(Hinv, upper=True)            # 1 chol on dense H^{-1}

with the mathematically equivalent single-pass

H_flip = torch.flip(H, dims=(0, 1))
L_flip = torch.linalg.cholesky(H_flip)
U      = torch.flip(L_flip, dims=(0, 1))
Hinv   = torch.linalg.solve_triangular(U, eye, upper=True)

(The proof uses chol(H^{-1}, upper) uniqueness under the positive-diagonal constraint; full derivation in the authors' Stage 7 ablation note.)

RTX 4090 cuSOLVER fp32 microbench:

n	baseline	reverse_cholesky	speedup
512	0.78 ms	0.38 ms	2.07×
1024	1.80 ms	0.82 ms	2.18×
2048	3.91 ms	1.75 ms	2.23×
4096	12.99 ms	5.81 ms	2.24×

Numerics: max relative error ≤ 5.3e-7 across n=64..2048; artifact bytes byte-equivalent within brotli noise. Hardcoded in train_gpt.py lines 1870-1874.

Compliance-tuned defaults (this PR vs PR #1938)

Hparam	PR #1938	This	Reason
`LQER_TOP_K`	3	1	top-error matrix (`tok_emb`) only; −0.00044 BPB, saves bytes
`GATED_ATTN_QUANT_GATE`	0	1	int8 row-quant for `attn_gate_w`; −0.00011 BPB
`TTT_BATCH_SIZE`	64	16	smaller phased batch
`PHASED_TTT_NUM_PHASES`	1	3	−0.00118 BPB
`GPTQ_RESERVE_SECONDS`	4	16	observed Hessian (3.5 s) + quantize (12.2 s) ≈ 16 s; required for `train+GPTQ ≤ 600 s`
`LEAKY_RELU_SQ_SLOPE` (in script)	0.5	0.3	Key Change 1
GPTQ Hinv path (in script)	`cholesky_inverse + chol(upper)`	reverse Cholesky + tri-solve	Key Change 2

All other hparams inherit from train_gpt.py's Hyperparameters defaults, which match the PR #1938 envelope.

Architecture

11L × 512d × 8H / 4KV, MLP 4× (2048 hidden), LeakyReLU(0.3)². Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings (vocab 8192, caseops-augmented), logit softcap=30.0. Depth recurrence (loops layers 3-5, ×2, activated at frac=0.35). Parallel residuals from layer 8. Skip gates. SmearGate with BOS mask. Sparse attention gates. model_params = 35,945,671.

Quantization

Full-Hessian GPTQ + SDClip, on the reverse-Cholesky Hinv path:

GPTQ int6 (clip_sigmas=12.85): all attn (c_q, c_k, c_v, proj) and MLP (fc, proj) weights
GPTQ int7 + LQER asymmetric (rank=4, factor int4, group_size=64): tok_emb.weight only (LQER_TOP_K=1)
Dedicated int8 row-quant: attn_gate_w (GATED_ATTN_QUANT_GATE=1)
fp16 passthrough: scalar params + small parameter weights
Brotli-11 final compression → artifact ≈ 15.95 MB

TTT

Phased TTT, 3 phases × 2000 prefix docs, score-first, Adam optimizer, cosine LR (peak 1e-4). LoRA rank=96 over K, MLP, O projections. TTT_BATCH_SIZE=16. The script's total_eval_time is the canonical eval timer (matches the convention used by past SOTA records).

Compliance

Cap	Limit	Observed	Margin
Artifact (decimal)	16,000,000 bytes	15,947,664 (max of 3 seeds)	52,336 bytes
`train + GPTQ`	600 s	584.1 s + 15.6 s ≈ 599.7 s	~0.3 s
`total_eval_time`	600 s	482.6 s / 485.6 s / 587.7 s	12–118 s

Dataset

This submission uses the pre-built case-op augmented FineWeb-10B tokenization from
romeerp/parameter-golf-caseops-v1
(pre-built shards), the same dataset that PR #1729 / PR #1736 / PR #1851 use.
The bijective case-op tokenizer (fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model,
shipped in tokenizers/) and the build script (prepare_caseops_data.py +
lossless_caps.py) are included for byte-exact rebuild, but using the
pre-built shards from romeerp/parameter-golf-caseops-v1 is the recommended
path.

Reproducing

# Option A (recommended): use pre-built shards from HF.
huggingface-cli download romeerp/parameter-golf-caseops-v1 \
  --repo-type dataset \
  --local-dir ./data/datasets/fineweb10B_sp8192_caseops/

# Option B: rebuild locally with the shipped scripts: prepare_caseops_data.py

# Either way, the script expects shards at
# ./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/
# (the path layout is preserved across both options).

export RUN_ID=repro_seed42
export SEED=42
torchrun --nproc_per_node=8 --standalone train_gpt.py

Hyperparameters defaults already encode this PR's compliance-tuned envelope (this PR + b-series, on top of PR #1938); no other env exports are needed.

Builds On

Layer	Origin
PR #1938 (@lijuncheng16 & @TimS-ml — S0/PR1851 + Cap Tokenizer + LQER + Global TTT, val_bpb=1.0713)	base submission stack
PR #1867 (@lijuncheng16 & @TimS-ml)	training script
PR #1851 (@aquariouseworkman — SmearGate BOS fix + LQER asymmetric + phased TTT)	architecture / quantization
PR #1797 (@dexhunter, audit by @cocohearts)	SmearGate, LQER asym
PR #1787 (@nprime06)	SparseAttnGate, FusedCE, MIN_LR
PR #1729 / PR #1736 (@romeerp)	CaseOps tokenizer + phased TTT
PR #1394 (@clarkkev)	GPTQ + SDClip + SP8192
PR #549 (@abaybektursun)	Score-first TTT framework

Acknowledgments

A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16).

With thanks to:

Prof. Lin Hao (Fordham University) — for sponsoring the 8×H100 SXM and 4×RTX 4090 compute used to produce all sweep, training, and microbench results in this record.
Xingyuan Ding — for experiments and A100 support.
Bill (Yiyuan) Li — for meaningful discussions on tokenizers.
Lijun Yu (@Lijun-Yu) - for his invaluable insights.
Hang Zhou (@greyjoeyzhou) — for project discussions and for the concurrent auto-research agent infrastructure.

Additional credits (technique stack):

@aquariouseworkman — PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 SmearGate BOS-fix base stack
@cocohearts — SmearGate BOS audit (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797)
@dexhunter — SmearGate + LQER asymmetric, phased TTT (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 / PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736)
@romeerp — CaseOps tokenizer (PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 / PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736)
@nprime06 — SparseAttnGate / FusedCE / MIN_LR (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787)
@abaybektursun — Score-first TTT framework (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549)
@clarkkev — GPTQ + SDClip + SP8192 (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394)

@TimS-ml

…enai#1938 (val_bpb=1.06242) 3-seed sweep on seeds 1334, 42, 999 of the v2b training script from PR openai#1867, extending Billy Li's PR openai#1938 stack with two algorithmically free wins: - LeakyReLU squared slope 0.5 -> 0.3 (Stage 4 ablation: -0.00073 BPB, size-neutral, wallclock-neutral; 4-point sweep confirms 0.3 is the minimum). - GPTQ Hinv path: cholesky_inverse + chol(upper) -> reverse Cholesky + triangular solve (Stage 7 ablation: mathematically equivalent within fp32 ULP, 2.07-2.24x faster on RTX 4090 cuSOLVER microbench at the GPTQ workload range n=512..4096). Plus compliance-tuned defaults baked into train_gpt.py's Hyperparameters: LQER_TOP_K=1, GATED_ATTN_QUANT_GATE=1, TTT_BATCH_SIZE=16, PHASED_TTT_NUM_PHASES=3, GPTQ_RESERVE_SECONDS=16. Result: val_bpb (3-seed mean) = 1.06242, sigma ~ 0.00013, ~15.95 MB artifact. Delta vs current SOTA (PR openai#1493, 1.0810): -0.0186 BPB, well past the 0.005-nat significance threshold. Joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16). Compute sponsored by Prof. Hao Lin (Fordham University). Concurrent ablation infrastructure by Hang Zhou (@greyjoeyzhou).

3-seed mean val_bpb = 1.05851479 (std 0.000762, seeds 42/0/1234) on track_10min_16mb. Stack: - PR openai#1945 (alertcat) V21 base = PR openai#1908 + AWQ-Lite + AsymLogit Rescale - PR openai#1953 (andrewbaggio1) TTT/QK env knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask) - PR openai#1948 (TimS-ml + lijuncheng16) LeakyReLU squared slope 0.3 - PR openai#1145 (AnirudhRahul, valerio-endorsed) closed-form n-gram tilt with Σ P=1 Z renormalization Compliance: causal hints, single-pass, Σ P=1 by construction, no SLOT, no n-gram cache, no Pre-Quant TTT. System deps: gcc + lrzip auto-installed by setup.sh; PyTorch 2.9.1 + Triton + Flash Attn 3. One-command reproduction: bash setup.sh SEED={42,0,1234} bash run.sh

… breakthrough NULL/NEUTRAL RESULTS (within ±0.0005 noise): - S37 GPTQ_BATCHES=32: 1.05884 (null) - S38 TTT_BETA2=0.995: 1.05884 (null) - S44 GLOBAL_TTT_LR=0.01: 1.05913 (within noise) - S46 GLOBAL_TTT_EPOCHS=2: 1.05902 (null) NEGATIVE RESULTS: - S36 lzma compressor: rejected - S36v2 LQER_TOP_K=2: 1.05912 - S41 openai#1965 bundle: 1.05916 - S42 LQER 8/5 + EMA 0.997: 1.05912 (EMA contaminated) - S43 LQER 8/5 isolated: 1.05925 - S52 LeakyReLU 0.3: 1.05977 (PR openai#1948 doesn't transfer to PR openai#1797) - S53 WARMDOWN_FRAC=0.95 + MIN_LR=0.05: 1.05950 (best pre-quant 1.06061 but bigger quant tax) INFRASTRUCTURE FIXES: - S39 lrzip -k flag bug, S40 SSH disconnect, S45 NCCL crash - S47/S49/S51 LeakyReLU integration bugs BREAKTHROUGH: - S54 n-gram tilt port from PR openai#1145/openai#1967: 1.05692 single seed (seed 314) - Pre-quant: 1.06057, Quantized: 1.06917, Final: 1.05692 - Eval: 503.4s under 600s cap, Size: 15,944,666 bytes under 16MB cap - Hint precompute outside timer: 173s (legal path) - Mode B with fused_log_softmax_dual_gather kernel - Hints fired on 13M of 47M tokens (27%) - Delta from current-env baseline: -0.00208 BPB Validating seeds 42, 1234 next.

TimS-ml and others added 3 commits April 29, 2026 17:28

Revise acknowledgments in README.md

83dae1f

Corrected some AI-generated errors in README

1aa0f97

ndokutovich mentioned this pull request Apr 30, 2026

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967

Open

bsisduck mentioned this pull request Apr 30, 2026

Ablation: WiderGate32, RoPE dims, activation slopes, hparam stack (8xH100) #1970

Open

TimS-ml mentioned this pull request Apr 30, 2026

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed) #1987

Open

update readme add tech blog link

0a73833

jamesEmerson112 mentioned this pull request Apr 30, 2026

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed) #1992

Closed

fix typo

7c670ed

simon-marcus mentioned this pull request Apr 30, 2026

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018

Closed

jorge-asenjo mentioned this pull request Apr 30, 2026

Record: SP8192 V21 + Inside-timer N-gram TTT (no Gated XSA) — val_bpb 1.05692 (3-seed mean) #2041

Open

8 tasks

anderamondarainh-stack mentioned this pull request Apr 30, 2026

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify) #2054

Open

This was referenced May 1, 2026

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2123

Closed

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2124

Open

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 vaibhavmishra1/parameter-golf#1

Merged

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)#1948

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)#1948
TimS-ml wants to merge 5 commits intoopenai:mainfrom
TimS-ml:submission-2026-04-29-f1-free-3seed-tbd

TimS-ml commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimS-ml commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)

TL;DR

Result

GPTQ reserve-time accounting

Key Change 1: Leaky ReLU² slope = 0.3

Key Change 2: GPTQ reverse-Cholesky Hinv path

Compliance-tuned defaults (this PR vs PR #1938)

Architecture

Quantization

TTT

Compliance

Dataset

Reproducing

Builds On

Acknowledgments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TimS-ml commented Apr 29, 2026 •

edited

Loading