Skip to content

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)#2019

Open
aquariouseworkman wants to merge 1 commit intoopenai:mainfrom
aquariouseworkman:wubba_Lubba_dub_Dub_1
Open

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed)#2019
aquariouseworkman wants to merge 1 commit intoopenai:mainfrom
aquariouseworkman:wubba_Lubba_dub_Dub_1

Conversation

@aquariouseworkman
Copy link
Copy Markdown
Contributor

Record: PR #1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval

val_bpb = 1.05847 (3-seed mean, std 0.00063) | max artifact 15,985,934 bytes | 8x H100 SXM | strict 600s train + eval

Results

| Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | Post-TTT BPB | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------|
| 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | 1.05788 | 493.2s ✅ | 15,979,342 |
| 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | 1.05840 | 420.5s ✅ | 15,979,187 |
| 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | 1.05914 | 428.4s ✅ | 15,985,934 |
| Mean | 4890 | 596.03s | 1.06180 | 1.07025 | 1.05847 | 447.4s | 15,981,488 |

vs merged PR #1855 (1.06108): -0.00261 BPB / -0.00571 nats

Stack

Inherits the full PR #1855 base (codemath3000) and layers:

  1. AWQ-lite mixed-precision GPTQ (PR Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean) #1908, romeerp) — activation-aware salient-group int8 promotion
  2. Asymmetric Logit Rescale (PR Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval
  3. no_qv TTT mask (PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O
  4. TTT_LOCAL_LR_MULT=0.75 — scaled TTT optimizer LR
  5. QK_GAIN_INIT=5.25 — per-head Q-gain initialization
  6. EVAL_SEQ_LEN=2560 — extended eval context
  7. PHASED_TTT_PREFIX_DOCS=3000 — larger global-TTT prefix
  8. TTT_LORA_RANK=56 — reduced LoRA rank (compute reallocation)

Compliance

  • Artifact under 16,000,000 bytes (max 15,985,934)
  • Train wallclock under 600s (max 596.14s)
  • Eval wallclock under 600s (max 493.2s)
  • No PPM, no SLOT, no pre-quant TTT, no n-gram cache
  • Single left-to-right pass, score-before-update
  • Full normalized softmax distribution

Reproduction

apt-get install -y lrzip
pip install sentencepiece brotli huggingface_hub numpy python-minifier
pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Dataset
HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data')
"

# Record: PR openai#1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval

**val_bpb = 1.05847** (3-seed mean, std 0.00063) | **max artifact 15,985,934 bytes** | 8x H100 SXM | strict 600s train + eval

## Results

| Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | **Post-TTT BPB** | Eval time | Artifact bytes |
|------|-----------|------------|---------------|---------------|------------------|-----------|----------------|
| 42   | 4892      | 595.97s ✅ | 1.06126       | 1.06962       | **1.05788**      | 493.2s ✅ | 15,979,342     |
| 0    | 4884      | 595.97s ✅ | 1.06181       | 1.07019       | **1.05840**      | 420.5s ✅ | 15,979,187     |
| 1234 | 4894      | 596.14s ✅ | 1.06232       | 1.07093       | **1.05914**      | 428.4s ✅ | 15,985,934     |
| **Mean** | **4890** | **596.03s** | **1.06180** | **1.07025** | **1.05847** | **447.4s** | **15,981,488** |

vs merged PR openai#1855 (1.06108): **-0.00261 BPB / -0.00571 nats**

## Stack

Inherits the full PR openai#1855 base (codemath3000) and layers:

1. **AWQ-lite mixed-precision GPTQ** (PR openai#1908, romeerp) — activation-aware salient-group int8 promotion
2. **Asymmetric Logit Rescale** (PR openai#1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval
3. **no_qv TTT mask** (PR openai#1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O
4. **TTT_LOCAL_LR_MULT=0.75** — scaled TTT optimizer LR
5. **QK_GAIN_INIT=5.25** — per-head Q-gain initialization
6. **EVAL_SEQ_LEN=2560** — extended eval context
7. **PHASED_TTT_PREFIX_DOCS=3000** — larger global-TTT prefix
8. **TTT_LORA_RANK=56** — reduced LoRA rank (compute reallocation)

## Compliance

- [x] Artifact under 16,000,000 bytes (max 15,985,934)
- [x] Train wallclock under 600s (max 596.14s)
- [x] Eval wallclock under 600s (max 493.2s)
- [x] No PPM, no SLOT, no pre-quant TTT, no n-gram cache
- [x] Single left-to-right pass, score-before-update
- [x] Full normalized softmax distribution

## Reproduction

```bash
apt-get install -y lrzip
pip install sentencepiece brotli huggingface_hub numpy python-minifier
pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Dataset
HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data')
"

# Run
for SEED in 42 0 1234; do
  SEED=$SEED \
  DATA_PATH=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
  TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
  CASEOPS_ENABLED=1 VOCAB_SIZE=8192 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 \
  SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \
  GATED_ATTN_QUANT_GATE=1 FUSED_CE_ENABLED=1 QK_GAIN_INIT=5.25 \
  EMBED_BITS=7 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \
  GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 COMPRESSOR=pergroup \
  LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \
  AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \
  ASYM_LOGIT_RESCALE=1 \
  TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=3000 \
  TTT_LORA_RANK=56 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 TTT_LOCAL_LR_MULT=0.75 \
  TTT_CHUNK_SIZE=48 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 \
  EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \
  WARMDOWN_FRAC=0.85 BETA2=0.99 GRAD_CLIP_NORM=0.3 MIN_LR=0.1 MATRIX_LR=0.026 \
  NCCL_NET=Socket GLOBAL_TTT_MOMENTUM=0.9 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1
done
```
@aquariouseworkman aquariouseworkman changed the title Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) Apr 30, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
…ence

After user feedback that LEAK calls relied too heavily on lineage-inheritance
and path heuristics, applied stricter criterion: a LEAK verdict requires at
least one of (a) explicit shell-script invocation of prepare_caseops_data.py
without --val-docs=50000, (b) README "Data setup" matching actual train log
path, (c) audit/submission.json admission text, (d) train log path with
`_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>`
(which only local prep produces; HF always gives double-nesting).

Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS
unless they meet at least one of those tests.

Changes:
  - openai#1945 LEAK → CLEAN  (finalize_v18.sh has snapshot_download from HF;
    actual run path matches HF target; README's prepare_caseops_data.py
    section is stale documentation)
  - openai#1953 LEAK → AMBIGUOUS  (PR ships only train_gpt.py + logs; no prep
    evidence; path matches HF target; parent openai#1945 confirmed CLEAN —
    leans CLEAN but no direct PR evidence)
  - openai#2041 LEAK → AMBIGUOUS  (no prep invocation; double-nested path
    consistent with EITHER HF or local prep)
  - openai#2075 LEAK → AMBIGUOUS  (ships prep file but no explicit invocation;
    path matches HF target)

Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1).

Headline impact: realistic clean SOTA is at most ~0.012 bpb below the
claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order:
  openai#2019 1.05847 (HF, confirmed)
  openai#1953 1.05855 (AMBIGUOUS, leans CLEAN)
  openai#1945 1.05943 (HF, confirmed via re-audit)
  openai#2031 1.05985 (HF, confirmed)
  openai#1908 1.06081 (HF, confirmed)
  openai#1851 1.06128 (HF, MERGED SOTA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant