Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) by aquariouseworkman · Pull Request #2019 · openai/parameter-golf

aquariouseworkman · 2026-04-30T21:28:56Z

Record: PR #1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval

val_bpb = 1.05847 (3-seed mean, std 0.00063) | max artifact 15,985,934 bytes | 8x H100 SXM | strict 600s train + eval

Results

| Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | Post-TTT BPB | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------|
| 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | 1.05788 | 493.2s ✅ | 15,979,342 |
| 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | 1.05840 | 420.5s ✅ | 15,979,187 |
| 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | 1.05914 | 428.4s ✅ | 15,985,934 |
| Mean | 4890 | 596.03s | 1.06180 | 1.07025 | 1.05847 | 447.4s | 15,981,488 |

vs merged PR #1855 (1.06108): -0.00261 BPB / -0.00571 nats

Stack

Inherits the full PR #1855 base (codemath3000) and layers:

AWQ-lite mixed-precision GPTQ (PR Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean) #1908, romeerp) — activation-aware salient-group int8 promotion
Asymmetric Logit Rescale (PR Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval
no_qv TTT mask (PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O
TTT_LOCAL_LR_MULT=0.75 — scaled TTT optimizer LR
QK_GAIN_INIT=5.25 — per-head Q-gain initialization
EVAL_SEQ_LEN=2560 — extended eval context
PHASED_TTT_PREFIX_DOCS=3000 — larger global-TTT prefix
TTT_LORA_RANK=56 — reduced LoRA rank (compute reallocation)

Compliance

Artifact under 16,000,000 bytes (max 15,985,934)
Train wallclock under 600s (max 596.14s)
Eval wallclock under 600s (max 493.2s)
No PPM, no SLOT, no pre-quant TTT, no n-gram cache
Single left-to-right pass, score-before-update
Full normalized softmax distribution

Reproduction

apt-get install -y lrzip
pip install sentencepiece brotli huggingface_hub numpy python-minifier
pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Dataset
HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data')
"

# Record: PR openai#1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval **val_bpb = 1.05847** (3-seed mean, std 0.00063) | **max artifact 15,985,934 bytes** | 8x H100 SXM | strict 600s train + eval ## Results | Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | **Post-TTT BPB** | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------| | 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | **1.05788** | 493.2s ✅ | 15,979,342 | | 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | **1.05840** | 420.5s ✅ | 15,979,187 | | 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | **1.05914** | 428.4s ✅ | 15,985,934 | | **Mean** | **4890** | **596.03s** | **1.06180** | **1.07025** | **1.05847** | **447.4s** | **15,981,488** | vs merged PR openai#1855 (1.06108): **-0.00261 BPB / -0.00571 nats** ## Stack Inherits the full PR openai#1855 base (codemath3000) and layers: 1. **AWQ-lite mixed-precision GPTQ** (PR openai#1908, romeerp) — activation-aware salient-group int8 promotion 2. **Asymmetric Logit Rescale** (PR openai#1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval 3. **no_qv TTT mask** (PR openai#1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O 4. **TTT_LOCAL_LR_MULT=0.75** — scaled TTT optimizer LR 5. **QK_GAIN_INIT=5.25** — per-head Q-gain initialization 6. **EVAL_SEQ_LEN=2560** — extended eval context 7. **PHASED_TTT_PREFIX_DOCS=3000** — larger global-TTT prefix 8. **TTT_LORA_RANK=56** — reduced LoRA rank (compute reallocation) ## Compliance - [x] Artifact under 16,000,000 bytes (max 15,985,934) - [x] Train wallclock under 600s (max 596.14s) - [x] Eval wallclock under 600s (max 493.2s) - [x] No PPM, no SLOT, no pre-quant TTT, no n-gram cache - [x] Single left-to-right pass, score-before-update - [x] Full normalized softmax distribution ## Reproduction ```bash apt-get install -y lrzip pip install sentencepiece brotli huggingface_hub numpy python-minifier pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ # Dataset HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c " from huggingface_hub import snapshot_download snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data') " # Run for SEED in 42 0 1234; do SEED=$SEED \ DATA_PATH=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ CASEOPS_ENABLED=1 VOCAB_SIZE=8192 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 \ SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \ GATED_ATTN_QUANT_GATE=1 FUSED_CE_ENABLED=1 QK_GAIN_INIT=5.25 \ EMBED_BITS=7 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \ GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 COMPRESSOR=pergroup \ LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \ AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \ ASYM_LOGIT_RESCALE=1 \ TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=3000 \ TTT_LORA_RANK=56 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 TTT_LOCAL_LR_MULT=0.75 \ TTT_CHUNK_SIZE=48 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 \ EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \ WARMDOWN_FRAC=0.85 BETA2=0.99 GRAD_CLIP_NORM=0.3 MIN_LR=0.1 MATRIX_LR=0.026 \ NCCL_NET=Socket GLOBAL_TTT_MOMENTUM=0.9 \ torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1 done ```

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

…ence After user feedback that LEAK calls relied too heavily on lineage-inheritance and path heuristics, applied stricter criterion: a LEAK verdict requires at least one of (a) explicit shell-script invocation of prepare_caseops_data.py without --val-docs=50000, (b) README "Data setup" matching actual train log path, (c) audit/submission.json admission text, (d) train log path with `_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>` (which only local prep produces; HF always gives double-nesting). Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS unless they meet at least one of those tests. Changes: - openai#1945 LEAK → CLEAN (finalize_v18.sh has snapshot_download from HF; actual run path matches HF target; README's prepare_caseops_data.py section is stale documentation) - openai#1953 LEAK → AMBIGUOUS (PR ships only train_gpt.py + logs; no prep evidence; path matches HF target; parent openai#1945 confirmed CLEAN — leans CLEAN but no direct PR evidence) - openai#2041 LEAK → AMBIGUOUS (no prep invocation; double-nested path consistent with EITHER HF or local prep) - openai#2075 LEAK → AMBIGUOUS (ships prep file but no explicit invocation; path matches HF target) Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1). Headline impact: realistic clean SOTA is at most ~0.012 bpb below the claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order: openai#2019 1.05847 (HF, confirmed) openai#1953 1.05855 (AMBIGUOUS, leans CLEAN) openai#1945 1.05943 (HF, confirmed via re-audit) openai#2031 1.05985 (HF, confirmed) openai#1908 1.06081 (HF, confirmed) openai#1851 1.06128 (HF, MERGED SOTA)

aquariouseworkman changed the title ~~Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval~~ Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) Apr 30, 2026

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

aquariouseworkman mentioned this pull request May 2, 2026

Merge #2019 exclusion explanation request #2148

Open

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)#2019

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)#2019
aquariouseworkman wants to merge 1 commit intoopenai:mainfrom
aquariouseworkman:wubba_Lubba_dub_Dub_1

aquariouseworkman commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aquariouseworkman commented Apr 30, 2026

Record: PR #1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval

Results

Stack

Compliance

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant