Skip to content

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911

Open
dttdrv wants to merge 3 commits intoopenai:mainfrom
dttdrv:record/caseops-prequant-ttt-103459
Open

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911
dttdrv wants to merge 3 commits intoopenai:mainfrom
dttdrv:record/caseops-prequant-ttt-103459

Conversation

@dttdrv
Copy link
Copy Markdown

@dttdrv dttdrv commented Apr 28, 2026

Summary

This PR submits the CaseOps V15 record stack for track_10min_16mb.

  • 3-seed mean val_bpb: 1.03540487 (std 0.00056684)
  • Seeds: 1337, 42, 999
  • Artifact range: 15,994,993 to 15,996,195 bytes
  • Independent reproduction: seed 1337 reached 1.03459029 BPB with a 15,996,563 byte artifact on 2026-04-28/29
  • Title change: this is marked {RECORD} because it clears the threshold versus PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735's 1.04290 BPB result

I am being explicit about the provenance here: this is a community stack, not a "one weird trick" claim. The core move is combining PR #1735's parallel pre-quant TTT stack with PR #1729's CaseOps tokenizer/byte-sidecar path, as integrated in PR #1738, and independently reproducing it.

Why I Did This

The frontier PRs pointed to two large, mostly orthogonal levers:

  1. Pre-quant TTT was the biggest optimization lever. Instead of trying to make post-quant TTT work after GPTQ has already crushed the degrees of freedom, PR Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364 and then PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 adapt the full-precision EMA model first, then quantize the adapted model into a fixed artifact.
  2. CaseOps was the cleanest data/tokenizer lever. PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 showed that capitalization can be represented as a reversible side channel over a lower-case lexical stream. That reduces avoidable case fragmentation while still evaluating BPB against the original raw bytes.

Those two ideas should compose. CaseOps makes the sequence modeling problem cleaner; pre-quant TTT spends the remaining time budget adapting the full-precision model to that cleaner target before export. The non-trivial integration work is that CaseOps cannot use naive decoded-token byte counting. It needs byte sidecars for honest BPB.

What This PR Adds

This record folder is based on PR #1738's CaseOps V15 integration:

Results

Seed Sliding val_bpb Artifact bytes
1337 1.03484145 15,996,061
42 1.03618043 15,996,195
999 1.03519273 15,994,993
Mean 1.03540487 15,995,750
Std 0.00056684

The previous frontier stack I am comparing against is PR #1735 at 1.04290 BPB. This improves it by about 0.00750 BPB, just over the 0.005 nats / 0.00721 BPB threshold.

Independent reproduction from this same record folder:

Date Seed Sliding val_bpb Artifact bytes
2026-04-28/29 1337 1.03459029 15,996,563

Reproduction checkpoints:

  • Training stopped at 588132ms, step 4568/20000
  • Pre-quantization post-EMA: val_bpb=1.08389912
  • After 21 pre-quant TTT epochs: post-prequant-ttt val_bpb=1.02819756
  • Quantized non-sliding eval: val_bpb=1.04801825
  • Quantized sliding eval: val_bpb=1.03459029
  • Total submission size: 15,996,563 bytes

Technique Stack

  • SP8192 CaseOps tokenizer with reversible case-control operators
  • Original-byte validation sidecars for correct BPB accounting
  • 11-layer, 512d, 8-head / 4-KV-head transformer
  • XSA on all layers
  • 3-layer depth recurrence over layers 3-5
  • Parallel residual path from layer 7 onward
  • QK-Gain 5.25
  • LeakyReLU(0.5)^2 MLP, mlp_mult=4.0
  • EMA/SWA, Muon-family optimization, high-WD compression pressure, warmdown scheduling
  • 8-GPU parallel pre-quant AdamW TTT, 21 epochs
  • Full-Hessian GPTQ with SDClip-style row clipping
  • Int6 model matrices, int8 embeddings
  • Brotli-compressed model and LZMA-wrapped code under 16 MB
  • Sliding-window eval, stride 64

Full Lineage / Credits

I read the upstream PR chain and am intentionally not reducing this to a short credit list. The exact runtime stack is a compressed script, so not every ancestor appears as a neat isolated function anymore, but these are the PRs I traced as leading to this record's components or to the parent PRs used here.

PR Contributor Why it matters here
#1738 @alertcat Exact CaseOps V15 integration of PR #1735 plus PR #1729. This PR is based on that record folder.
#1735 @AjAnubolu Parallel pre-quant AdamW TTT, 21 epochs, federated averaging, epoch-level cosine LR, torch.compile speedup.
#1729 @romeerp CaseOps tokenizer/data export, reversible capitalization operators, validation byte sidecars.
#1626 @dexhunter Multi-phase score-first TTT lineage used by the CaseOps PR.
#1530 @samacqua VarLen attention / fused MLP / doc-TTT base referenced by #1626.
#1610 @romeerp Phased TTT concept referenced by #1626.
#1493 @bigbag QK-Gain 5.25 and consolidation of SP8192 + recurrence + residuals + legal TTT.
#1445 @X-Abhishek-X Tuned WD, matrix LR, EMA, warmdown settings cited by #1493.
#1412 @Robby955 Parallel residuals from layer 7 onward; Hessian-aware SDClip analysis.
#1331 @dexhunter 3-layer depth recurrence over layers 3-5 and WD/LR compression tradeoff.
#1285 @dexhunter Earlier recurrence and WD-quantization synergy extended by #1331.
#1394 @clarkkev SP8192, GPTQ embedding quantization, SDClip, Brotli packaging, simplified recurrence.
#1218 @clarkkev Larger vocab/model stack, high-WD compression logic, GPTQ Hessian-aware path, skip-gate and QK-gain adoption.
#1217 @bigbag MuonEq-R and QK-gain sweep context.
#1204 @msisovic Mini depth recurrence and parallel residual formulation.
#1179 @dexhunter Base stack used by #1204 / #1217 lineage.
#1125 @jainpranjal97 XSA-all and QK-Gain 4.0 findings that pushed the later attention-gain sweeps.
#1105 @abaybektursun Mixed-quantization / AR GPTQ path referenced by #1204.
#1089 @clarkkev Byte-shuffle/Brotli compression and sigmoid-gated skip connection lineage.
#1060 @clarkkev GPTQ Hessian-aware quantization implementation referenced by #1218.
#1019 @abaybektursun AR self-generated GPTQ calibration, XSA-all, architecture documentation, prior merged SOTA baseline.
#756 @abaybektursun Negative post-quant TTT experiments that helped motivate pre-quant adaptation.
#726 @clarkkev Coprime-stride loader lineage that preceded the simplified loader in #1394.
#609 @saml212 BigramHash / selective pruning / GPTQ calibration lineage referenced by #1019.
#593 prior contributors GPTQ calibration legality context referenced by #1019.
#569 prior contributors GPTQ calibration legality context referenced by #1019.
#549 @abaybektursun LeakyReLU^2, legal score-first TTT, Parallel Muon record line.
#535 @raahilshah Full-Hessian GPTQ and QAT/export alignment lineage.
#518 @sofiabod LeakyReLU^2 follow-up credit in the #549 lineage.
#493 @parinzee 11-layer model, XSA, LeakyReLU(0.5)^2, EMA, int6 quantization, partial RoPE.
#478 @gowtham0992 XSA on all 11 layers, GPTQ-lite, EMA, late-QAT record line.
#461 @Christopher-Lee-McClendon Score-first TTT framework used by earlier legal TTT records.
#414 @signalrush Base model lineage credited by #549.
#401 @newjordan EMA/SWA weight-averaging lineage.
#399 @abaybektursun Parallel Muon optimizer lineage.
#364 @shikhar1729 Warmdown schedule lineage.
#315 @jfprincz Partial RoPE and layer-scale lineage.
#289 contributor in #1019 lineage U-Net skip connection lineage documented by #1019.
#286 @chris-buckley Late QAT / STE lineage documented by #1019.
#180 @thwu1 Early SOTA baseline credited by #493.
#162 @raahilshah BigramHash concept lineage documented by #1019.
#160 @ChaseWNorton Compression lineage documented by #1019.
#122 @mtybadger Flash Attention 3 / Hopper kernel dependency lineage documented by #1019.
#65 @aquariouseworkman SmearGate lineage documented by #1019; later SP8192 stacks simplified parts away.

Compliance Notes

This is submitted under the same Track A interpretation as PR #1735 and PR #1738:

  • Final evaluation uses a fixed quantized artifact.
  • Pre-quant TTT happens before export, not during final scoring.
  • No SLOT, RLS, ETLB, n-gram cache, or eval-time cache.
  • No two-pass rescoring.
  • Sliding-window eval is causal with stride 64.
  • The softmax distribution is normalized.
  • CaseOps is reversible and uses original-byte sidecars for BPB.
  • Artifact size is below 16,000,000 bytes.
  • Training and eval stay below the 10-minute limits.

The sensitive part is pre-quant TTT on validation chunks. I am not hiding that. I am submitting this consistently with the Track A framing used by PR #1735 / PR #1738: adaptation is part of producing the fixed artifact, and the scorer sees a fixed predictor. If maintainers decide that interpretation is not allowed, this line should be judged consistently with those PRs.

Dependencies / External Data

The challenge README allows packages/imports as long as they do not violate the evaluation, compute, training-time, code-size, or other restrictions, and asks record folders to include dependency/setup notes. I added a requirements.txt to this record folder for manual setup.

For clarity:

  • The final submitted artifact is self-contained: counted code bytes plus compressed model bytes.
  • There are no network calls or external downloads during final evaluation.
  • romeerp/parameter-golf-caseops-v1 is used as the public CaseOps tokenizer/data export for training setup, before train_gpt.py runs.
  • The train script imports torch, numpy, sentencepiece, and brotli; it tries FlashAttention 3 when available in the official H100 image and otherwise falls back to the PyTorch attention path.
  • huggingface-hub and hf_transfer are only for fetching the public CaseOps dataset/tokenizer during setup.

So no, I am not relying on an external service at eval time. The only external piece is the documented public dataset/tokenizer setup needed to reproduce the training run, in the same spirit as the repository's normal data download flow.

Reproduction

pip install -r requirements.txt
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='romeerp/parameter-golf-caseops-v1',
    repo_type='dataset',
    local_dir='/workspace/caseops_data',
)
"

cd /workspace/caseops_data/datasets/datasets/
ln -sf fineweb10B_sp8192_lossless_caps_caseops_v1_reserved fineweb10B_sp8192
cd /workspace/caseops_data/datasets/tokenizers/
ln -sf fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model fineweb_8192_bpe.model

SEED=1337 \
  DATA_DIR=/workspace/caseops_data/datasets/ \
  TTT_EMA_ENABLED=0 \
  PREQUANT_TTT_ENABLED=1 \
  PREQUANT_TTT_EPOCHS=21 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

  • 3-seed validation: 1337, 42, 999
  • Independent seed 1337 reproduction on 2026-04-28/29
  • Artifacts below 16,000,000 bytes
  • Training below 600s
  • Eval below 600s
  • Fixed predictor for final scoring
  • Full-Hessian GPTQ int6 + Brotli
  • CaseOps byte-sidecar BPB accounting

@dttdrv dttdrv changed the title Add CaseOps pre-quant TTT record (1.0354 BPB) {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) Apr 28, 2026
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant