Skip to content

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)#1948

Open
TimS-ml wants to merge 5 commits intoopenai:mainfrom
TimS-ml:submission-2026-04-29-f1-free-3seed-tbd
Open

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)#1948
TimS-ml wants to merge 5 commits intoopenai:mainfrom
TimS-ml:submission-2026-04-29-f1-free-3seed-tbd

Conversation

@TimS-ml
Copy link
Copy Markdown

@TimS-ml TimS-ml commented Apr 29, 2026

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242)

Note: This README captures only the bare submission record. The full
set of insights from our parameter-golf run — every PR iteration we tried,
the hyperparameter-tuning experiments behind each design choice, and the
ablation results that drove our decisions — is being compiled into a
detailed write-up. A more detailed write-up is at: https://www.junchengbillyli.com/llm-notes.html

val_bpb (3-seed mean) = 1.06242 | σ ≈ 0.00013 | ~15.95 MB | 8×H100 SXM | 600 s training + 600 s eval

A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16), with thanks to Prof. Lin Hao (Fordham University) for sponsoring the 8×H100 SXM and 4×RTX 4090 compute used in this submission, Xingyuan Ding for additional experiments, Bill (Yiyuan) Li for meaningful discussions on tokenizers, Lijun Yu (@Lijun-Yu) for his invaluable insights, and Hang Zhou (@greyjoeyzhou) for project discussions.

TL;DR

Extends PR #1938 (Billy Li & Tim Shen's S0/PR1851 + Cap Tokenizer + LQER + Global TTT, val_bpb=1.0713) with two algorithmically free wins:

  1. Leaky ReLU squared slope 0.5 → 0.3−0.00073 BPB free win; size-neutral, wallclock-neutral. (4-point sweep confirms 0.3 is the minimum — see Key Change 1.)
  2. GPTQ reverse-Cholesky + triangular solve instead of the standard chol → cholesky_inverse → chol(upper) — mathematically equivalent within fp32 ULP, 2.07–2.24× faster on RTX 4090 cuSOLVER microbench at the GPTQ workload range. (Key Change 2.)

Both are hardcoded inside train_gpt.py (the variant from PR #1867), which also ships this PR's compliance-tuned defaults on top of PR #1938: LQER_TOP_K=1, GATED_ATTN_QUANT_GATE=1, TTT_BATCH_SIZE=16, PHASED_TTT_NUM_PHASES=3, GPTQ_RESERVE_SECONDS=16.

Result

Seed Post-TTT val_bpb (final) Artifact bytes
1334 1.06257 15,947,664
42 1.06232 15,945,920
999 1.06237 15,946,532
Mean 1.06242 (σ ≈ 0.00013) 15,946,705

GPTQ reserve-time accounting

(04-30): We've noticed that several
leaderboard submissions appear to exceed the 10-minute training cap once the
full GPTQ pipeline (Hessian collection, quantization, serialize, compress) is
accounted for. From our own measurements, gptq_reserve_seconds=0.5s is
far insufficient: GPTQ Hessian collection takes ~3.5-4 s (depending
on calibration batch size), GPTQ quantization itself ~10 s, and the
serialize+compress step adds another ~60-70 s for Brotli or ~90-100 s
for lrzip pergroup
. Among the top leaderboard PRs we surveyed, observed
gptq_reserve_seconds values range across 0.5 / 4 / 8 s; this submission
uses 16 s so that the full pipeline completes inside the 600 s training
cap with margin. The few-second discrepancy is unlikely to be large enough
to materially change the leaderboard score or ranking, but we think it's
worth flagging.

Key Change 1: Leaky ReLU² slope = 0.3

4-point sweep at fixed seed=42 / 1.0× batch / 600 s wallclock:

slope TTT BPB Δ vs 0.30
0.25 1.06151 +0.00012
0.30 1.06139 0
0.35 1.06192 +0.00053
0.50 (prior baseline) 1.06212 +0.00073
0.70 1.06267 +0.00128

Shallow V minimum at 0.3, size-neutral, no wallclock cost. Hardcoded in train_gpt.py lines 694-695 (Triton kernel) and line 910 (eager fallback).

Key Change 2: GPTQ reverse-Cholesky Hinv path

Replaces

Hinv = torch.cholesky_inverse(torch.linalg.cholesky(H))   # 1 chol + 2 tri-solve
Hinv = torch.linalg.cholesky(Hinv, upper=True)            # 1 chol on dense H^{-1}

with the mathematically equivalent single-pass

H_flip = torch.flip(H, dims=(0, 1))
L_flip = torch.linalg.cholesky(H_flip)
U      = torch.flip(L_flip, dims=(0, 1))
Hinv   = torch.linalg.solve_triangular(U, eye, upper=True)

(The proof uses chol(H^{-1}, upper) uniqueness under the positive-diagonal constraint; full derivation in the authors' Stage 7 ablation note.)

RTX 4090 cuSOLVER fp32 microbench:

n baseline reverse_cholesky speedup
512 0.78 ms 0.38 ms 2.07×
1024 1.80 ms 0.82 ms 2.18×
2048 3.91 ms 1.75 ms 2.23×
4096 12.99 ms 5.81 ms 2.24×

Numerics: max relative error ≤ 5.3e-7 across n=64..2048; artifact bytes byte-equivalent within brotli noise. Hardcoded in train_gpt.py lines 1870-1874.

Compliance-tuned defaults (this PR vs PR #1938)

Hparam PR #1938 This Reason
LQER_TOP_K 3 1 top-error matrix (tok_emb) only; −0.00044 BPB, saves bytes
GATED_ATTN_QUANT_GATE 0 1 int8 row-quant for attn_gate_w; −0.00011 BPB
TTT_BATCH_SIZE 64 16 smaller phased batch
PHASED_TTT_NUM_PHASES 1 3 −0.00118 BPB
GPTQ_RESERVE_SECONDS 4 16 observed Hessian (3.5 s) + quantize (12.2 s) ≈ 16 s; required for train+GPTQ ≤ 600 s
LEAKY_RELU_SQ_SLOPE (in script) 0.5 0.3 Key Change 1
GPTQ Hinv path (in script) cholesky_inverse + chol(upper) reverse Cholesky + tri-solve Key Change 2

All other hparams inherit from train_gpt.py's Hyperparameters defaults, which match the PR #1938 envelope.

Architecture

11L × 512d × 8H / 4KV, MLP 4× (2048 hidden), LeakyReLU(0.3)². Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings (vocab 8192, caseops-augmented), logit softcap=30.0. Depth recurrence (loops layers 3-5, ×2, activated at frac=0.35). Parallel residuals from layer 8. Skip gates. SmearGate with BOS mask. Sparse attention gates. model_params = 35,945,671.

Quantization

Full-Hessian GPTQ + SDClip, on the reverse-Cholesky Hinv path:

  • GPTQ int6 (clip_sigmas=12.85): all attn (c_q, c_k, c_v, proj) and MLP (fc, proj) weights
  • GPTQ int7 + LQER asymmetric (rank=4, factor int4, group_size=64): tok_emb.weight only (LQER_TOP_K=1)
  • Dedicated int8 row-quant: attn_gate_w (GATED_ATTN_QUANT_GATE=1)
  • fp16 passthrough: scalar params + small parameter weights
  • Brotli-11 final compression → artifact ≈ 15.95 MB

TTT

Phased TTT, 3 phases × 2000 prefix docs, score-first, Adam optimizer, cosine LR (peak 1e-4). LoRA rank=96 over K, MLP, O projections. TTT_BATCH_SIZE=16. The script's total_eval_time is the canonical eval timer (matches the convention used by past SOTA records).

Compliance

Cap Limit Observed Margin
Artifact (decimal) 16,000,000 bytes 15,947,664 (max of 3 seeds) 52,336 bytes
train + GPTQ 600 s 584.1 s + 15.6 s ≈ 599.7 s ~0.3 s
total_eval_time 600 s 482.6 s / 485.6 s / 587.7 s 12–118 s

Dataset

This submission uses the pre-built case-op augmented FineWeb-10B tokenization from
romeerp/parameter-golf-caseops-v1
(pre-built shards), the same dataset that PR #1729 / PR #1736 / PR #1851 use.
The bijective case-op tokenizer (fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model,
shipped in tokenizers/) and the build script (prepare_caseops_data.py +
lossless_caps.py) are included for byte-exact rebuild, but using the
pre-built shards from romeerp/parameter-golf-caseops-v1 is the recommended
path
.

Reproducing

# Option A (recommended): use pre-built shards from HF.
huggingface-cli download romeerp/parameter-golf-caseops-v1 \
  --repo-type dataset \
  --local-dir ./data/datasets/fineweb10B_sp8192_caseops/

# Option B: rebuild locally with the shipped scripts: prepare_caseops_data.py

# Either way, the script expects shards at
# ./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/
# (the path layout is preserved across both options).

export RUN_ID=repro_seed42
export SEED=42
torchrun --nproc_per_node=8 --standalone train_gpt.py

Hyperparameters defaults already encode this PR's compliance-tuned envelope (this PR + b-series, on top of PR #1938); no other env exports are needed.

Builds On

Layer Origin
PR #1938 (@lijuncheng16 & @TimS-mlS0/PR1851 + Cap Tokenizer + LQER + Global TTT, val_bpb=1.0713) base submission stack
PR #1867 (@lijuncheng16 & @TimS-ml) training script
PR #1851 (@aquariouseworkman — SmearGate BOS fix + LQER asymmetric + phased TTT) architecture / quantization
PR #1797 (@dexhunter, audit by @cocohearts) SmearGate, LQER asym
PR #1787 (@nprime06) SparseAttnGate, FusedCE, MIN_LR
PR #1729 / PR #1736 (@romeerp) CaseOps tokenizer + phased TTT
PR #1394 (@clarkkev) GPTQ + SDClip + SP8192
PR #549 (@abaybektursun) Score-first TTT framework

Acknowledgments

A joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16).

With thanks to:

  • Prof. Lin Hao (Fordham University) — for sponsoring the 8×H100 SXM and 4×RTX 4090 compute used to produce all sweep, training, and microbench results in this record.
  • Xingyuan Ding — for experiments and A100 support.
  • Bill (Yiyuan) Li — for meaningful discussions on tokenizers.
  • Lijun Yu (@Lijun-Yu) - for his invaluable insights.
  • Hang Zhou (@greyjoeyzhou) — for project discussions and for the concurrent auto-research agent infrastructure.

Additional credits (technique stack):

TimS-ml and others added 3 commits April 29, 2026 17:28
…enai#1938 (val_bpb=1.06242)

3-seed sweep on seeds 1334, 42, 999 of the v2b training script from PR openai#1867,
extending Billy Li's PR openai#1938 stack with two algorithmically free wins:

- LeakyReLU squared slope 0.5 -> 0.3 (Stage 4 ablation: -0.00073 BPB,
  size-neutral, wallclock-neutral; 4-point sweep confirms 0.3 is the minimum).
- GPTQ Hinv path: cholesky_inverse + chol(upper) -> reverse Cholesky +
  triangular solve (Stage 7 ablation: mathematically equivalent within fp32
  ULP, 2.07-2.24x faster on RTX 4090 cuSOLVER microbench at the GPTQ workload
  range n=512..4096).

Plus compliance-tuned defaults baked into train_gpt.py's Hyperparameters:
LQER_TOP_K=1, GATED_ATTN_QUANT_GATE=1, TTT_BATCH_SIZE=16,
PHASED_TTT_NUM_PHASES=3, GPTQ_RESERVE_SECONDS=16.

Result: val_bpb (3-seed mean) = 1.06242, sigma ~ 0.00013, ~15.95 MB artifact.
Delta vs current SOTA (PR openai#1493, 1.0810): -0.0186 BPB, well past the 0.005-nat
significance threshold.

Joint effort by Tim Shen (@TimS-ml) and Billy Li (@lijuncheng16). Compute
sponsored by Prof. Hao Lin (Fordham University). Concurrent ablation
infrastructure by Hang Zhou (@greyjoeyzhou).
aerosta pushed a commit to aerosta/parameter-golf that referenced this pull request Apr 30, 2026
3-seed mean val_bpb = 1.05851479 (std 0.000762, seeds 42/0/1234) on track_10min_16mb.

Stack:
- PR openai#1945 (alertcat) V21 base = PR openai#1908 + AWQ-Lite + AsymLogit Rescale
- PR openai#1953 (andrewbaggio1) TTT/QK env knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask)
- PR openai#1948 (TimS-ml + lijuncheng16) LeakyReLU squared slope 0.3
- PR openai#1145 (AnirudhRahul, valerio-endorsed) closed-form n-gram tilt with Σ P=1 Z renormalization

Compliance: causal hints, single-pass, Σ P=1 by construction, no SLOT,
no n-gram cache, no Pre-Quant TTT.

System deps: gcc + lrzip auto-installed by setup.sh; PyTorch 2.9.1 + Triton + Flash Attn 3.

One-command reproduction:
  bash setup.sh
  SEED={42,0,1234} bash run.sh
TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request Apr 30, 2026
… breakthrough

NULL/NEUTRAL RESULTS (within ±0.0005 noise):
- S37 GPTQ_BATCHES=32: 1.05884 (null)
- S38 TTT_BETA2=0.995: 1.05884 (null)
- S44 GLOBAL_TTT_LR=0.01: 1.05913 (within noise)
- S46 GLOBAL_TTT_EPOCHS=2: 1.05902 (null)

NEGATIVE RESULTS:
- S36 lzma compressor: rejected
- S36v2 LQER_TOP_K=2: 1.05912
- S41 openai#1965 bundle: 1.05916
- S42 LQER 8/5 + EMA 0.997: 1.05912 (EMA contaminated)
- S43 LQER 8/5 isolated: 1.05925
- S52 LeakyReLU 0.3: 1.05977 (PR openai#1948 doesn't transfer to PR openai#1797)
- S53 WARMDOWN_FRAC=0.95 + MIN_LR=0.05: 1.05950 (best pre-quant 1.06061 but bigger quant tax)

INFRASTRUCTURE FIXES:
- S39 lrzip -k flag bug, S40 SSH disconnect, S45 NCCL crash
- S47/S49/S51 LeakyReLU integration bugs

BREAKTHROUGH:
- S54 n-gram tilt port from PR openai#1145/openai#1967: 1.05692 single seed (seed 314)
  - Pre-quant: 1.06057, Quantized: 1.06917, Final: 1.05692
  - Eval: 503.4s under 600s cap, Size: 15,944,666 bytes under 16MB cap
  - Hint precompute outside timer: 173s (legal path)
  - Mode B with fused_log_softmax_dual_gather kernel
  - Hints fired on 13M of 47M tokens (27%)
  - Delta from current-env baseline: -0.00208 BPB

Validating seeds 42, 1234 next.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant