Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 by vaibhavmishra1 · Pull Request #2123 · openai/parameter-golf

vaibhavmishra1 · 2026-05-01T15:26:53Z

Gated XSA + CaseOps + LQER g32/top4 + In-Timer N-gram Tilt

The default run is:

PR Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018-style V21/Gated-XSA architecture and in-timer token-only n-gram tilt.
PR Record: LongCtx No-QV QK5.25 + AsymLogit + LQER g32/top4 + TTT-local 0.80 — 1.05792 BPB 3-seed mean #2060 LQER/GPTQ retune: MATRIX_LR=0.028, LQER_RANK=2, LQER_ASYM_GROUP=32, LQER_TOP_K=4.
CaseOps SP8192 tokenizer and byte sidecar scoring.
One-phase score-first TTT with 1000 prefix docs, keeping eval under the 600s cap.

Results

The folder includes three successful 8xH100 logs.

Final result: 1.05933439 mean TTT BPB across three logs.
Final file size: max observed submission size 15,991,624 B (8,376 B under the 16,000,000-byte cap).

Seed	Config	Post-EMA BPB	Quant BPB	TTT BPB	TTT loss	TTT eval time	Train stop	Total submission size
42	g32/top4, rank2, prefix1000	1.06357301	1.07110057	1.05872644	2.31688588	521.1s	596.047s	15,984,321 B
0	g32/top4, rank2, prefix1000	1.06531335	1.07291783	1.06035228	2.32044383	475.6s	596.146s	15,987,537 B
1234	g32/top4, rank2, prefix1000	1.06335328	1.07134445	1.05892444	2.31731918	467.9s	598.091s	15,991,624 B
Mean	all three logs	1.06407988	1.07178762	1.05933439	2.31821630	488.2s	596.761s	15,987,827 B

What Changed

Architecture and training

The architecture follows the PR #2018 lineage:

11-layer, 512-dim, 8-head GQA transformer with 4 KV heads.
CaseOps SP8192 lossless-caps tokenizer and byte sidecar validation scoring.
Gated XSA on all layers with zero-init per-head gates.
Looping layers 3-5 after 35% of training.
Parallel final lane from layer 8.
SmearGate and Sparse Attention Gate enabled.
Fused CE training path, FA3 attention, partial RoPE, logit softcap.
EMA applied before quantization.

The training loop and optimizer routing are otherwise kept close to the PR #2018 base. The run stops on the 600s wallclock cap.

Quantization

The final quantization path is the PR #2018 GPTQ/LQER/AWQ-lite path with PR #2060's portable LQER retune:

Setting	Value
`MATRIX_BITS`	6
`EMBED_BITS`	7
`LQER_RANK`	2
`LQER_ASYM_GROUP`	32
`LQER_TOP_K`	4
`AWQ_LITE_ENABLED`	1
`AWQ_LITE_GROUP_SIZE`	64
`COMPRESSOR`	`pergroup`

The largest included g32/top4 run was 15,987,537 bytes, leaving 12,463 bytes of headroom under the 16,000,000-byte cap.

Evaluation

Evaluation keeps the conservative PR #2018 timing recipe:

Setting	Value
`EVAL_SEQ_LEN`	2560
`TTT_EVAL_SEQ_LEN`	2560
`PHASED_TTT_NUM_PHASES`	1
`PHASED_TTT_PREFIX_DOCS`	1000
`TTT_LORA_RANK`	80
`TTT_LORA_LR`	0.0001
`TTT_CHUNK_SIZE`	48
`NGRAM_TILT_ENABLED`	1
`NGRAM_HINT_PRECOMPUTE_OUTSIDE`	0
`TOKEN_ORDER`	16
`TOKEN_THRESHOLD`	0.800
`TOKEN_BOOST`	2.625
`WITHIN_BOOST`	0
`WORD_BOOST`	0

The n-gram helper is inlined into train_gpt.py, so the evaluation logic is self-contained and counted with the submitted code. The helper is token-only, prefix-only, and its hint precompute happens inside the measured TTT eval timer.

Compliance Notes

Train time: all logs stop under the 600s training cap.
Eval time: all logs finish under the 600s eval cap.
Artifact size: all logs are below 16,000,000 bytes.
No network during train/eval: runtime uses local dataset shards, local tokenizer, and local Python/C helpers only.
Validation data is not accessed during training.
N-gram tilt is causal and score-time only; it reads the validation stream left-to-right and emits each hint from prefix state before observing the target.
TTT is score-first: each chunk is scored before its loss is used for adapter updates.

Reproduction

Run from this record folder or from the repository root. The run script sets the final default hyperparameters.

# optional: install Python dependencies
pip install -r requirements.txt

The CaseOps dataset can be downloaded with:

huggingface-cli download romeerp/parameter-golf-caseops-v1 \
  --repo-type dataset \
  --local-dir ./data/datasets/fineweb10B_sp8192_caseops/

Then run the final configuration:

SEED=42 ./run.sh

The expected data root is:

./data/datasets/fineweb10B_sp8192_caseops/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved

run.sh also supports overriding:

DATA_PATH=/path/to/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=/path/to/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
SEED=42 ./run.sh

Files

train_gpt.py: self-contained training, quantization, serialization, TTT eval, and in-timer n-gram tilt logic.
run.sh: final record-candidate command wrapper.
submission.json: leaderboard metadata.
requirements.txt: Python dependencies beyond the base environment.
tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model: tokenizer used by the CaseOps dataset.
lossless_caps.py and prepare_caseops_data.py: CaseOps transform and dataset-prep helpers.
train_seed42.log, train_seed0.log: claimed final g32/top4 reproduction logs.
train_seed1234.log: closely related seed-1234 run included in the results table; it used the earlier LQER top1/rank4/group64 settings.

Lineage and Credits

This submission is a recombination/tuning entry. Key public sources:

PR #2018 by simon-marcus: Gated XSA stack, token-only in-timer n-gram tilt, and the main code lineage.
PR #2060 by S0urC10ud: LQER/GPTQ retune ported here where supported.
PR #2007 by Elubrazione: Long-context/QK/AsymLogit lineage used by later stacks.
PR #1967 by ndokutovich: V21 + n-gram tilt + LeakyReLU base lineage.
PR #1948 by TimS-ml: LeakyReLU squared slope sweep.
PR #1953 by andrewbaggio1: TTT/QK tuning lineage.
PR #1923 by jorge-asenjo: asymmetric logit rescale.
PR #1908 by romeerp: AWQ-lite mixed precision.
PR #1855 by codemath3000: per-group compression and strong SP8192 stack.
PR #1514 by codemath3000: strict token-only n-gram precedent.
PR #1145 by AnirudhRahul: n-gram tilt with closed-form probability renormalization.

vaibhavmishra1 added 2 commits May 1, 2026 20:55

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439

3265c6e

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439

473aefe

vaibhavmishra1 closed this May 1, 2026

vaibhavmishra1 deleted the final_sub branch May 1, 2026 15:37

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439#2123

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439#2123
vaibhavmishra1 wants to merge 2 commits intoopenai:mainfrom
vaibhavmishra1:final_sub

vaibhavmishra1 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vaibhavmishra1 commented May 1, 2026

Gated XSA + CaseOps + LQER g32/top4 + In-Timer N-gram Tilt

Results

What Changed

Architecture and training

Quantization

Evaluation

Compliance Notes

Reproduction

Files

Lineage and Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant