Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)#2019
Open
aquariouseworkman wants to merge 1 commit intoopenai:mainfrom
Open
Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)#2019aquariouseworkman wants to merge 1 commit intoopenai:mainfrom
aquariouseworkman wants to merge 1 commit intoopenai:mainfrom
Conversation
# Record: PR openai#1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval **val_bpb = 1.05847** (3-seed mean, std 0.00063) | **max artifact 15,985,934 bytes** | 8x H100 SXM | strict 600s train + eval ## Results | Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | **Post-TTT BPB** | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------| | 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | **1.05788** | 493.2s ✅ | 15,979,342 | | 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | **1.05840** | 420.5s ✅ | 15,979,187 | | 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | **1.05914** | 428.4s ✅ | 15,985,934 | | **Mean** | **4890** | **596.03s** | **1.06180** | **1.07025** | **1.05847** | **447.4s** | **15,981,488** | vs merged PR openai#1855 (1.06108): **-0.00261 BPB / -0.00571 nats** ## Stack Inherits the full PR openai#1855 base (codemath3000) and layers: 1. **AWQ-lite mixed-precision GPTQ** (PR openai#1908, romeerp) — activation-aware salient-group int8 promotion 2. **Asymmetric Logit Rescale** (PR openai#1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval 3. **no_qv TTT mask** (PR openai#1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O 4. **TTT_LOCAL_LR_MULT=0.75** — scaled TTT optimizer LR 5. **QK_GAIN_INIT=5.25** — per-head Q-gain initialization 6. **EVAL_SEQ_LEN=2560** — extended eval context 7. **PHASED_TTT_PREFIX_DOCS=3000** — larger global-TTT prefix 8. **TTT_LORA_RANK=56** — reduced LoRA rank (compute reallocation) ## Compliance - [x] Artifact under 16,000,000 bytes (max 15,985,934) - [x] Train wallclock under 600s (max 596.14s) - [x] Eval wallclock under 600s (max 493.2s) - [x] No PPM, no SLOT, no pre-quant TTT, no n-gram cache - [x] Single left-to-right pass, score-before-update - [x] Full normalized softmax distribution ## Reproduction ```bash apt-get install -y lrzip pip install sentencepiece brotli huggingface_hub numpy python-minifier pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ # Dataset HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c " from huggingface_hub import snapshot_download snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data') " # Run for SEED in 42 0 1234; do SEED=$SEED \ DATA_PATH=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ CASEOPS_ENABLED=1 VOCAB_SIZE=8192 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 \ SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \ GATED_ATTN_QUANT_GATE=1 FUSED_CE_ENABLED=1 QK_GAIN_INIT=5.25 \ EMBED_BITS=7 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \ GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 COMPRESSOR=pergroup \ LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \ AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \ ASYM_LOGIT_RESCALE=1 \ TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=3000 \ TTT_LORA_RANK=56 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 TTT_LOCAL_LR_MULT=0.75 \ TTT_CHUNK_SIZE=48 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 \ EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \ WARMDOWN_FRAC=0.85 BETA2=0.99 GRAD_CLIP_NORM=0.3 MIN_LR=0.1 MATRIX_LR=0.026 \ NCCL_NET=Socket GLOBAL_TTT_MOMENTUM=0.9 \ torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1 done ```
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 1, 2026
…ence After user feedback that LEAK calls relied too heavily on lineage-inheritance and path heuristics, applied stricter criterion: a LEAK verdict requires at least one of (a) explicit shell-script invocation of prepare_caseops_data.py without --val-docs=50000, (b) README "Data setup" matching actual train log path, (c) audit/submission.json admission text, (d) train log path with `_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>` (which only local prep produces; HF always gives double-nesting). Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS unless they meet at least one of those tests. Changes: - openai#1945 LEAK → CLEAN (finalize_v18.sh has snapshot_download from HF; actual run path matches HF target; README's prepare_caseops_data.py section is stale documentation) - openai#1953 LEAK → AMBIGUOUS (PR ships only train_gpt.py + logs; no prep evidence; path matches HF target; parent openai#1945 confirmed CLEAN — leans CLEAN but no direct PR evidence) - openai#2041 LEAK → AMBIGUOUS (no prep invocation; double-nested path consistent with EITHER HF or local prep) - openai#2075 LEAK → AMBIGUOUS (ships prep file but no explicit invocation; path matches HF target) Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1). Headline impact: realistic clean SOTA is at most ~0.012 bpb below the claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order: openai#2019 1.05847 (HF, confirmed) openai#1953 1.05855 (AMBIGUOUS, leans CLEAN) openai#1945 1.05943 (HF, confirmed via re-audit) openai#2031 1.05985 (HF, confirmed) openai#1908 1.06081 (HF, confirmed) openai#1851 1.06128 (HF, MERGED SOTA)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: PR #1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval
val_bpb = 1.05847 (3-seed mean, std 0.00063) | max artifact 15,985,934 bytes | 8x H100 SXM | strict 600s train + eval
Results
| Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | Post-TTT BPB | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------|
| 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | 1.05788 | 493.2s ✅ | 15,979,342 |
| 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | 1.05840 | 420.5s ✅ | 15,979,187 |
| 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | 1.05914 | 428.4s ✅ | 15,985,934 |
| Mean | 4890 | 596.03s | 1.06180 | 1.07025 | 1.05847 | 447.4s | 15,981,488 |
vs merged PR #1855 (1.06108): -0.00261 BPB / -0.00571 nats
Stack
Inherits the full PR #1855 base (codemath3000) and layers:
Compliance
Reproduction