{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) by dttdrv · Pull Request #1911 · openai/parameter-golf

dttdrv · 2026-04-28T22:29:52Z

Summary

This PR submits the CaseOps V15 record stack for track_10min_16mb.

3-seed mean val_bpb: 1.03540487 (std 0.00056684)
Seeds: 1337, 42, 999
Artifact range: 15,994,993 to 15,996,195 bytes
Independent reproduction: seed 1337 reached 1.03459029 BPB with a 15,996,563 byte artifact on 2026-04-28/29
Title change: this is marked {RECORD} because it clears the threshold versus PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735's 1.04290 BPB result

I am being explicit about the provenance here: this is a community stack, not a "one weird trick" claim. The core move is combining PR #1735's parallel pre-quant TTT stack with PR #1729's CaseOps tokenizer/byte-sidecar path, as integrated in PR #1738, and independently reproducing it.

Why I Did This

The frontier PRs pointed to two large, mostly orthogonal levers:

Pre-quant TTT was the biggest optimization lever. Instead of trying to make post-quant TTT work after GPTQ has already crushed the degrees of freedom, PR Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364 and then PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 adapt the full-precision EMA model first, then quantize the adapted model into a fixed artifact.
CaseOps was the cleanest data/tokenizer lever. PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 showed that capitalization can be represented as a reversible side channel over a lower-case lexical stream. That reduces avoidable case fragmentation while still evaluating BPB against the original raw bytes.

Those two ideas should compose. CaseOps makes the sequence modeling problem cleaner; pre-quant TTT spends the remaining time budget adapting the full-precision model to that cleaner target before export. The non-trivial integration work is that CaseOps cannot use naive decoded-token byte counting. It needs byte sidecars for honest BPB.

What This PR Adds

This record folder is based on PR #1738's CaseOps V15 integration:

Adds load_validation_token_bytes()
Threads byte sidecars through eval_val(), eval_val_sliding(), and eval_val_ttt()
Excludes _bytes_ files from token-shard loading to avoid double-counting
Uses the PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 pre-quant TTT settings: 8 ranks, 21 AdamW epochs, epoch-level cosine LR, federated averaging, then GPTQ export
Uses PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729's CaseOps dataset/tokenizer exported publicly at romeerp/parameter-golf-caseops-v1
Keeps the final scorer fixed: the submitted artifact is quantized and compressed before final sliding-window eval

Results

Seed	Sliding val_bpb	Artifact bytes
1337	1.03484145	15,996,061
42	1.03618043	15,996,195
999	1.03519273	15,994,993
Mean	1.03540487	15,995,750
Std	0.00056684

The previous frontier stack I am comparing against is PR #1735 at 1.04290 BPB. This improves it by about 0.00750 BPB, just over the 0.005 nats / 0.00721 BPB threshold.

Independent reproduction from this same record folder:

Date	Seed	Sliding val_bpb	Artifact bytes
2026-04-28/29	1337	1.03459029	15,996,563

Reproduction checkpoints:

Training stopped at 588132ms, step 4568/20000
Pre-quantization post-EMA: val_bpb=1.08389912
After 21 pre-quant TTT epochs: post-prequant-ttt val_bpb=1.02819756
Quantized non-sliding eval: val_bpb=1.04801825
Quantized sliding eval: val_bpb=1.03459029
Total submission size: 15,996,563 bytes

Technique Stack

SP8192 CaseOps tokenizer with reversible case-control operators
Original-byte validation sidecars for correct BPB accounting
11-layer, 512d, 8-head / 4-KV-head transformer
XSA on all layers
3-layer depth recurrence over layers 3-5
Parallel residual path from layer 7 onward
QK-Gain 5.25
LeakyReLU(0.5)^2 MLP, mlp_mult=4.0
EMA/SWA, Muon-family optimization, high-WD compression pressure, warmdown scheduling
8-GPU parallel pre-quant AdamW TTT, 21 epochs
Full-Hessian GPTQ with SDClip-style row clipping
Int6 model matrices, int8 embeddings
Brotli-compressed model and LZMA-wrapped code under 16 MB
Sliding-window eval, stride 64

Full Lineage / Credits

I read the upstream PR chain and am intentionally not reducing this to a short credit list. The exact runtime stack is a compressed script, so not every ancestor appears as a neat isolated function anymore, but these are the PRs I traced as leading to this record's components or to the parent PRs used here.

PR	Contributor	Why it matters here
#1738	@alertcat	Exact CaseOps V15 integration of PR #1735 plus PR #1729. This PR is based on that record folder.
#1735	@AjAnubolu	Parallel pre-quant AdamW TTT, 21 epochs, federated averaging, epoch-level cosine LR, torch.compile speedup.
#1729	@romeerp	CaseOps tokenizer/data export, reversible capitalization operators, validation byte sidecars.
#1626	@dexhunter	Multi-phase score-first TTT lineage used by the CaseOps PR.
#1530	@samacqua	VarLen attention / fused MLP / doc-TTT base referenced by #1626.
#1610	@romeerp	Phased TTT concept referenced by #1626.
#1493	@bigbag	QK-Gain 5.25 and consolidation of SP8192 + recurrence + residuals + legal TTT.
#1445	@X-Abhishek-X	Tuned WD, matrix LR, EMA, warmdown settings cited by #1493.
#1412	@Robby955	Parallel residuals from layer 7 onward; Hessian-aware SDClip analysis.
#1331	@dexhunter	3-layer depth recurrence over layers 3-5 and WD/LR compression tradeoff.
#1285	@dexhunter	Earlier recurrence and WD-quantization synergy extended by #1331.
#1394	@clarkkev	SP8192, GPTQ embedding quantization, SDClip, Brotli packaging, simplified recurrence.
#1218	@clarkkev	Larger vocab/model stack, high-WD compression logic, GPTQ Hessian-aware path, skip-gate and QK-gain adoption.
#1217	@bigbag	MuonEq-R and QK-gain sweep context.
#1204	@msisovic	Mini depth recurrence and parallel residual formulation.
#1179	@dexhunter	Base stack used by #1204 / #1217 lineage.
#1125	@jainpranjal97	XSA-all and QK-Gain 4.0 findings that pushed the later attention-gain sweeps.
#1105	@abaybektursun	Mixed-quantization / AR GPTQ path referenced by #1204.
#1089	@clarkkev	Byte-shuffle/Brotli compression and sigmoid-gated skip connection lineage.
#1060	@clarkkev	GPTQ Hessian-aware quantization implementation referenced by #1218.
#1019	@abaybektursun	AR self-generated GPTQ calibration, XSA-all, architecture documentation, prior merged SOTA baseline.
#756	@abaybektursun	Negative post-quant TTT experiments that helped motivate pre-quant adaptation.
#726	@clarkkev	Coprime-stride loader lineage that preceded the simplified loader in #1394.
#609	@saml212	BigramHash / selective pruning / GPTQ calibration lineage referenced by #1019.
#593	prior contributors	GPTQ calibration legality context referenced by #1019.
#569	prior contributors	GPTQ calibration legality context referenced by #1019.
#549	@abaybektursun	LeakyReLU^2, legal score-first TTT, Parallel Muon record line.
#535	@raahilshah	Full-Hessian GPTQ and QAT/export alignment lineage.
#518	@sofiabod	LeakyReLU^2 follow-up credit in the #549 lineage.
#493	@parinzee	11-layer model, XSA, LeakyReLU(0.5)^2, EMA, int6 quantization, partial RoPE.
#478	@gowtham0992	XSA on all 11 layers, GPTQ-lite, EMA, late-QAT record line.
#461	@Christopher-Lee-McClendon	Score-first TTT framework used by earlier legal TTT records.
#414	@signalrush	Base model lineage credited by #549.
#401	@newjordan	EMA/SWA weight-averaging lineage.
#399	@abaybektursun	Parallel Muon optimizer lineage.
#364	@shikhar1729	Warmdown schedule lineage.
#315	@jfprincz	Partial RoPE and layer-scale lineage.
#289	contributor in #1019 lineage	U-Net skip connection lineage documented by #1019.
#286	@chris-buckley	Late QAT / STE lineage documented by #1019.
#180	@thwu1	Early SOTA baseline credited by #493.
#162	@raahilshah	BigramHash concept lineage documented by #1019.
#160	@ChaseWNorton	Compression lineage documented by #1019.
#122	@mtybadger	Flash Attention 3 / Hopper kernel dependency lineage documented by #1019.
#65	@aquariouseworkman	SmearGate lineage documented by #1019; later SP8192 stacks simplified parts away.

Compliance Notes

This is submitted under the same Track A interpretation as PR #1735 and PR #1738:

Final evaluation uses a fixed quantized artifact.
Pre-quant TTT happens before export, not during final scoring.
No SLOT, RLS, ETLB, n-gram cache, or eval-time cache.
No two-pass rescoring.
Sliding-window eval is causal with stride 64.
The softmax distribution is normalized.
CaseOps is reversible and uses original-byte sidecars for BPB.
Artifact size is below 16,000,000 bytes.
Training and eval stay below the 10-minute limits.

The sensitive part is pre-quant TTT on validation chunks. I am not hiding that. I am submitting this consistently with the Track A framing used by PR #1735 / PR #1738: adaptation is part of producing the fixed artifact, and the scorer sees a fixed predictor. If maintainers decide that interpretation is not allowed, this line should be judged consistently with those PRs.

Dependencies / External Data

The challenge README allows packages/imports as long as they do not violate the evaluation, compute, training-time, code-size, or other restrictions, and asks record folders to include dependency/setup notes. I added a requirements.txt to this record folder for manual setup.

For clarity:

The final submitted artifact is self-contained: counted code bytes plus compressed model bytes.
There are no network calls or external downloads during final evaluation.
romeerp/parameter-golf-caseops-v1 is used as the public CaseOps tokenizer/data export for training setup, before train_gpt.py runs.
The train script imports torch, numpy, sentencepiece, and brotli; it tries FlashAttention 3 when available in the official H100 image and otherwise falls back to the PyTorch attention path.
huggingface-hub and hf_transfer are only for fetching the public CaseOps dataset/tokenizer during setup.

So no, I am not relying on an external service at eval time. The only external piece is the documented public dataset/tokenizer setup needed to reproduce the training run, in the same spirit as the repository's normal data download flow.

Reproduction

pip install -r requirements.txt
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='romeerp/parameter-golf-caseops-v1',
    repo_type='dataset',
    local_dir='/workspace/caseops_data',
)
"

cd /workspace/caseops_data/datasets/datasets/
ln -sf fineweb10B_sp8192_lossless_caps_caseops_v1_reserved fineweb10B_sp8192
cd /workspace/caseops_data/datasets/tokenizers/
ln -sf fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model fineweb_8192_bpe.model

SEED=1337 \
  DATA_DIR=/workspace/caseops_data/datasets/ \
  TTT_EMA_ENABLED=0 \
  PREQUANT_TTT_ENABLED=1 \
  PREQUANT_TTT_EPOCHS=21 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

3-seed validation: 1337, 42, 999
Independent seed 1337 reproduction on 2026-04-28/29
Artifacts below 16,000,000 bytes
Training below 600s
Eval below 600s
Fixed predictor for final scoring
Full-Hessian GPTQ int6 + Brotli
CaseOps byte-sidecar BPB accounting

…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)

Add CaseOps pre-quant TTT record

80da055

dttdrv changed the title ~~Add CaseOps pre-quant TTT record (1.0354 BPB)~~ {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) Apr 28, 2026

dttdrv added 2 commits April 29, 2026 01:38

Expand CaseOps record attribution

082d9e8

Document CaseOps record dependencies

ea8c855

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911
dttdrv wants to merge 3 commits intoopenai:mainfrom
dttdrv:record/caseops-prequant-ttt-103459

dttdrv commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dttdrv commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why I Did This

What This PR Adds

Results

Technique Stack

Full Lineage / Credits

Compliance Notes

Dependencies / External Data

Reproduction

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dttdrv commented Apr 28, 2026 •

edited

Loading