Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val)#1923
Conversation
…1.06577 (3-seed mean)
|
PR #1855 reports 1.061 bpb, but yours is higher at 1.065 bpb? |
- spec 060N: compound AWQ-lite (PR openai#1908) + 4 TTT phases + 3000 prefix + 2 global-SGD epochs, eval-only on 060A's final_model.pt. Single-shot compound to use openai#1918's ~205s eval-time slack; safe fallback drops GLOBAL_TTT_EPOCHS if wallclock blows. - new idea 1925-matrix-lr-ttt-prefix-tune (PR openai#1925, hyperparam-only on openai#1855: MATRIX_LR=0.028 + PHASED_TTT_PREFIX_DOCS=3500 → 1.06109). - new idea 1915-per-doc-lora-ttt (PR openai#1915, per-doc-only LoRA TTT discipline; parked as fallback if global-SGD class is ruled out). - frontier scan: 21 new PRs (openai#1906-openai#1931). Headline: PRs openai#1908+openai#1918 independently confirm AWQ-lite mixed-bit GPTQ pattern at ~1.0608 on openai#1855 base; openai#1925 hyperparam-only at 1.06109; openai#1923 Asymmetric Logit Rescale = empirical negative; openai#1929 banned SLOT+prequant-TTT. - frontier-state.json: 21 PRs added; total 200. - diary/2026-04-29-frontier-scan.md: full scan report. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)
|
The +0.0047 gap is an environment offset, not a regression. I reproduced the #1855 stack verbatim (same code, same env vars, same seeds) on my pod and got 1.06754 (seed 42) — i.e. the verbatim #1855 baseline measures ~+0.006 higher under my setup than its reported 1.06108. AsymLogit on top of that same baseline lands at 1.06577 (3-seed mean), so the delta on this PR is −0.00177. What's submitted here is the AsymLogit delta on top of the #1855 stack, not a claim against the #1855 absolute number. A third party reproducing #1855 verbatim and then toggling |
|
Thanks for putting up the repro detail @jorge-asenjo. I took a look at the seed-42 logs side-by-side, and the gap looks like it's a different validation set rather than an environment offset:
All hyperparameters in the printed Your The intra-environment delta you're showing (verbatim #1855 = 1.06754, this PR = 1.06577 → −0.00177) is still informative as an A/B within your setup, but the absolute val_bpb won't be directly comparable to leaderboard numbers measured on the full 47.85 M-token val set. If you can rerun on the standard val partition (the same one used for #1855's reported 1.06108), the AsymLogit delta should carry over and be easier to evaluate against the chain. |
…1908 frontier V21 = PR openai#1855 base (cocohearts-merged openai#1) + PR openai#1908 AWQ-lite quantization + PR openai#1923 Asymmetric Logit Rescale. 3-seed results: seed 42: val_bpb 1.058336 (FSS=4920, wallclock 602.048s borderline*) seed 0: val_bpb 1.059394 (no FSS, wallclock 596.057s strict <600s) seed 1234: val_bpb 1.060243 (no FSS, wallclock 596.045s strict <600s) MEAN: 1.059324 STD: 0.000780 * seed 42 borderline matches PR openai#1908 seed 42 (601.153s, accepted by cocohearts) Seeds 0 + 1234 use GPTQ_RESERVE_SECONDS=4.0 to ensure strict <600s wallclock. Comparisons: vs PR openai#1908 frontier (1.06081): -0.00149 BPB ✅ WIN vs PR openai#1855 official openai#1 (1.06108): -0.00176 BPB ✅ vs win threshold (1.06021): -0.00089 BPB ✅ passes community floor vs MERGED SOTA bigbag (1.0810): -0.02168 BPB 🏆 vs record threshold (1.0738): -0.01448 BPB (breaks record by 2.0x margin) Welch one-sided t-test V21 vs PR openai#1908 (n=3 each, std 0.00078 vs 0.00089): t ≈ 2.18, p ≈ 0.045 — well below cocohearts-applied p<0.25 chain threshold Stack: - PR openai#1855 (codemath3000): 11L XSA + LQER + SparseAttnGate + BOS-fixed SmearGate + Polar-Express NS + Phased TTT 3-phase + lrzip pergroup - PR openai#1908 (romeerp): AWQ-lite mixed-precision GPTQ (1 group of 64 cols int8) - PR openai#1923 (jorge-asenjo): Asymmetric Logit Rescale (V21 INNOVATION on this stack) Code changes vs PR openai#1908: 5 surgical edits to train_gpt.py (+26 lines, eval-only). Train numerics bit-identical to PR openai#1908. Asymmetric softcap adds 8 bytes (2 fp16 passthrough scalars) to artifact. Compliance Issue openai#1017 Track A all 4 conditions verified: - Causality (VarLen + per-doc cu_seqlens) - Normalized softmax (full SP8192 vocab) - Score-before-update (Phased TTT 3-phase, gd:0 then gd:1) - Single pass (each val token scored exactly once) No SLOT, no pre-quant TTT, no n-gram cache, no ETLB. V21's empirical falsification of sunnypatneedi 2026-04-29 frontier-scan flag: PR openai#1923 standalone is -0.00469 BPB negative on PR openai#1855 base (1.06577 vs 1.06108) but +0.00128 BPB POSITIVE consistently across 3 seeds when stacked on PR openai#1908 quantization. Mechanism: per-doc LoRA in 3-phase TTT learns asymmetric logit distributions that the symmetric softcap cannot capture. Files included: - V21_README.md: full strategy + results + reproduction - submission.json: structured 3-seed metadata + comparison + attribution - train_seed42.log + train_seed0.log + train_seed1234.log: full per-seed logs - train_gpt.py: PR openai#1908 base + 5 V21 edits (already in branch) Hardware: 8xH100 80GB SXM (RunPod, AP-IN-1) Pytorch: 2.9.1+cu128 System dep: lrzip (apt-get install lrzip) Authors: V21 integration: @alertcat PR openai#1908 base: @romeerp PR openai#1855 stack: @codemath3000 PR openai#1923 axis: @jorge-asenjo
…— val_bpb 1.05971 (3-seed mean, full val) - Re-measured on full validation partition (47,851,520 tokens). Original v1 (1.06577) was on a truncated val partition (9,662,464 tokens) due to a corrupted shard in our local network volume. @codemath3000 caught this on review by comparing val_tokens lines. - Added AWQ-lite mixed-precision GPTQ (top-1 64-col group at int8) on top of the AsymLogit lever. Identical recipe to @romeerp's PR openai#1908; we converged on it independently while iterating on the AsymLogit branch and confirmed it stacks orthogonally on the full val partition. - 3-seed mean val_bpb 1.05971 (population std 0.000478). Train ≤599.6s, eval ≤532s, max artifact 15,985,176 bytes — all 3 seeds strict-compliant. Renamed records folder to reflect the corrected stack and BPB. Attribution updates: @romeerp (PR openai#1908 AWQ-lite) added to the credit list. @codemath3000 (PR openai#1855 base + val truncation catch).
|
@codemath3000 thanks again for catching the val truncation issue. I've now corrected and re-measured. Root cause: the Updated submission: I've pushed a new commit (
vs PR #1855 (1.06108): −0.00137 BPB, Welch t≈4.96 (p < 0.05). Stack also expanded: while re-running on full val I added AWQ-lite mixed-precision GPTQ (top-1 64-col group at int8) on top of AsymLogit. This converges on the same recipe @romeerp shipped in PR #1908; we landed on it independently and it stacks orthogonally with AsymLogit on full val. Credit added in the README.
Apologies for the v1 noise. Appreciate the careful audit. |
# Record: PR openai#1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval **val_bpb = 1.05847** (3-seed mean, std 0.00063) | **max artifact 15,985,934 bytes** | 8x H100 SXM | strict 600s train + eval ## Results | Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | **Post-TTT BPB** | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------| | 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | **1.05788** | 493.2s ✅ | 15,979,342 | | 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | **1.05840** | 420.5s ✅ | 15,979,187 | | 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | **1.05914** | 428.4s ✅ | 15,985,934 | | **Mean** | **4890** | **596.03s** | **1.06180** | **1.07025** | **1.05847** | **447.4s** | **15,981,488** | vs merged PR openai#1855 (1.06108): **-0.00261 BPB / -0.00571 nats** ## Stack Inherits the full PR openai#1855 base (codemath3000) and layers: 1. **AWQ-lite mixed-precision GPTQ** (PR openai#1908, romeerp) — activation-aware salient-group int8 promotion 2. **Asymmetric Logit Rescale** (PR openai#1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval 3. **no_qv TTT mask** (PR openai#1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O 4. **TTT_LOCAL_LR_MULT=0.75** — scaled TTT optimizer LR 5. **QK_GAIN_INIT=5.25** — per-head Q-gain initialization 6. **EVAL_SEQ_LEN=2560** — extended eval context 7. **PHASED_TTT_PREFIX_DOCS=3000** — larger global-TTT prefix 8. **TTT_LORA_RANK=56** — reduced LoRA rank (compute reallocation) ## Compliance - [x] Artifact under 16,000,000 bytes (max 15,985,934) - [x] Train wallclock under 600s (max 596.14s) - [x] Eval wallclock under 600s (max 493.2s) - [x] No PPM, no SLOT, no pre-quant TTT, no n-gram cache - [x] Single left-to-right pass, score-before-update - [x] Full normalized softmax distribution ## Reproduction ```bash apt-get install -y lrzip pip install sentencepiece brotli huggingface_hub numpy python-minifier pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ # Dataset HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c " from huggingface_hub import snapshot_download snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data') " # Run for SEED in 42 0 1234; do SEED=$SEED \ DATA_PATH=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ CASEOPS_ENABLED=1 VOCAB_SIZE=8192 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 \ SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \ GATED_ATTN_QUANT_GATE=1 FUSED_CE_ENABLED=1 QK_GAIN_INIT=5.25 \ EMBED_BITS=7 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \ GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 COMPRESSOR=pergroup \ LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \ AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \ ASYM_LOGIT_RESCALE=1 \ TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=3000 \ TTT_LORA_RANK=56 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 TTT_LOCAL_LR_MULT=0.75 \ TTT_CHUNK_SIZE=48 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 \ EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \ WARMDOWN_FRAC=0.85 BETA2=0.99 GRAD_CLIP_NORM=0.3 MIN_LR=0.1 MATRIX_LR=0.026 \ NCCL_NET=Socket GLOBAL_TTT_MOMENTUM=0.9 \ torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1 done ```
…ean 1.05831 BPB Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108), p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under the 600s wallclock budget. Per-seed: - 42: ttt=1.05793 art=15,986,149 eval=572.6s - 314: ttt=1.05852 art=15,987,257 eval=553.7s - 1234: ttt=1.05849 art=15,989,895 eval=574.1s Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/ contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923 -> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition C1-C4 legality check. submission.json author/github_id are placeholders pending the user's choice of submitting account. Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single 8xH100 SXM pod (~2.5h wall, ~$66 cost). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
AsymLogit Rescale (PR openai#1923) ported as 2 TTT-adaptable scalar params (softcap_pos, softcap_neg). Pre-quant 1.06160 (slightly worse than S55's 1.06058 — AsymLogit hurts un-adapted model). TTT recovery -0.01267 (much better than S55's -0.01103) — AsymLogit gives massive adaptive capacity. Final 1.05759 = -0.00055 vs S55. Single-seed matches PR openai#2014's 3-seed mean. Eval 521.7s (under 600s cap), Size 15,946,610. softcap_pos and softcap_neg init to logit_softcap=30.0, adapted per-doc via TTT-LoRA optimizer.
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
Summary
Verbatim PR #1855 stack with one orthogonal eval-path addition: asymmetric logit rescale (modded-nanogpt @classiclarryd PR #181). Two learnable scalars (
softcap_pos,softcap_neg) replace the singlelogit_softcapinforward_logitsandforward_ttt, trained inside Phased TTT via the global SGD loop. Init atlogit_softcap=30.0so eval is identity at start; train-time fused-CE kernel left untouched.3-seed Results
Reproduced on 8× H100 80GB SXM, torch 2.9.1+cu128, FA3, fused softcapped CE, lrzip pergroup compression.
What changed vs #1855
In
class GPT:asym_logit_enabledgated byASYM_LOGIT_RESCALEenv varnn.Parameterscalarssoftcap_pos,softcap_neg(init 30.0, fp32)_apply_asym_softcap(logits)doingwhere(logits>0, sp*tanh(logits/sp), sn*tanh(logits/sn))forward_logitsandforward_tttcall the helper when flag is on; training path unchangedThe two new scalars land in the
passthrough float16list at serialize time (8 bytes total, lossless under per-group lrzip).Test plan
Attribution