Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716) by himanshudongre · Pull Request #1718 · openai/parameter-golf

himanshudongre · 2026-04-18T11:47:34Z

Structured documentation of the ablation path behind my record PR #1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. The point of this submission is signal quality, not a leaderboard number — filing structured evidence so that the next person exploring this branch of the design space does not re-learn the same lessons at full training cost.

TL;DR

Lever	Result	Verdict
`BIGRAM_DIM = 32`	pre-quant −0.0002 bpb, consistent	✅ in record
Path A v3 passthrough quantization	40 KB artifact savings at 0 bpb cost	✅ record mechanism
`TTT_EPOCHS = 4`	Δ −0.00009 bpb (0.06σ, noise)	❌ saturated
`EVAL_SEQ_LEN = 4096, stride=128`	pre-quant −0.00509, sliding +0.00555	❌ OOD scored-tail positions
SWA w=1024 training (complete 2×2 factorial)	all configs strictly worse than baseline	❌ context cap dominates
`TRAIN_SEQ_LEN = 4096`	pre-quant −0.00196, sliding +0.00435, TTT +0.00428	❌ position-depth vs breadth tradeoff
QAT v3 (matrices-only int6 fake-quant)	TTT diverged to 1.48	❌ QAT × score-first-TTT catastrophic interaction
Adaptive Hadamard GPTQ	null on Muon sub-Gaussian weights	❌ literature benefit doesn't transfer

Structural finding

Every attempt to exploit the eval-time working-memory asymmetry (16 MB artifact cap vs 640 GB eval HBM) via the absolute-position RoPE architecture fails for the same architectural reason: sliding eval specifically scores tokens at a narrow tail-position band, and any technique that trades position depth (more samples per position) for position breadth (more positions covered) regresses there. Three independent experiments (EVAL_SEQ_LEN=4096, SWA training, TRAIN_SEQ_LEN=4096) produced the same pre-quant-wins-but-sliding-loses inversion.

The architecturally correct next step is relative-position attention (ALiBi / NoPE / retrieval) where scored-token position becomes irrelevant. That is my next experiment.

Rule compliance

Every probe satisfies C1-C4 (causality, normalized softmax, score-before-update, single-pass) per Issue #1017. No SLOT, no ETLB, no pre-quant TTT on val, no n-gram cache (only prototyped in patches/ngram_cache_eval.py, not submitted).

Files

README.md — full writeup with mechanism analysis for each null result
submission.json — structured findings metadata
adaptive_hadamard_gptq.py — unit-tested implementation validating the null from §3.4
kv_cache_chain.py — early prototype (superseded by findings in §5)
patches/ — patches documenting each experiment (Path A v3, QAT v3, SWA, n-gram cache prototype, LZMA wrapper)
logs/ — 10 training and eval logs across the ablations

Record path: records/track_non_record_16mb/2026-04-18_EvalTimeAblations_SP8192/

…companion to PR openai#1716) Structured documentation of the ablation path behind my record PR openai#1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. TL;DR: - CONFIRMED: BIGRAM_DIM=32 + Path A v3 aggressive passthrough quantization (int8 control tensors + int8 small matrices + LZMA code wrapper) fits 15.99 MB at 0 bpb cost; this is the record mechanism. - NULL: TTT_EPOCHS=4 (Δ -0.00009 bpb, saturated), EVAL_SEQ_LEN=4096 + stride=128 (sliding +0.00555 due to OOD scored-tail positions), SWA w=1024 training 2x2 factorial (all configs strictly worse than baseline), TRAIN_SEQ_LEN=4096 direct training (position-depth-vs-breadth tradeoff: pre-quant better but sliding +0.00435), QAT v3 (pre-quant +0.015, TTT diverged to 1.48 via QAT x score-first-TTT lattice interaction), Adaptive Hadamard GPTQ (null on Muon sub-Gaussian weights). Structural finding: every attempt to exploit the eval-time working-memory asymmetry via the absolute-position RoPE architecture fails because sliding specifically measures position-depth at a narrow tail-band. The architecturally correct fix is relative-position attention (ALiBi / NoPE), not more aggressive positional extension. That is the next experiment. Includes 10 logs, 5 patches, 2 prototype modules, and full mechanism analysis for each null so the next person exploring this branch does not re-learn the same lessons at full training cost.

Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier to test whether absolute-position bias is bottlenecking the PR openai#1700 TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged relative-position attention as the next architectural axis, and no PR has tried NoPE at frontier. ALiBi was the first choice, but FA3 (Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no alibi_slopes parameter, and FA2 fallback breaks the 600s budget under TTT. NoPE is the cheapest position-axis test under FA3. NOPE env knob (default 1) gates apply_rotary_emb in three attn paths: forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary module is still constructed so warmup calls remain harmless and the diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new params, submission size unchanged.

himanshudongre mentioned this pull request May 1, 2026

Non-record: competition research notes #2111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)#1718

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)#1718
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:non-record/2026-04-18-eval-time-ablations

himanshudongre commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himanshudongre commented Apr 18, 2026

TL;DR

Structural finding

Rule compliance

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant