Skip to content

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)#1718

Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:non-record/2026-04-18-eval-time-ablations
Open

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)#1718
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:non-record/2026-04-18-eval-time-ablations

Conversation

@himanshudongre
Copy link
Copy Markdown

Structured documentation of the ablation path behind my record PR #1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. The point of this submission is signal quality, not a leaderboard number — filing structured evidence so that the next person exploring this branch of the design space does not re-learn the same lessons at full training cost.

TL;DR

Lever Result Verdict
BIGRAM_DIM = 32 pre-quant −0.0002 bpb, consistent ✅ in record
Path A v3 passthrough quantization 40 KB artifact savings at 0 bpb cost record mechanism
TTT_EPOCHS = 4 Δ −0.00009 bpb (0.06σ, noise) ❌ saturated
EVAL_SEQ_LEN = 4096, stride=128 pre-quant −0.00509, sliding +0.00555 ❌ OOD scored-tail positions
SWA w=1024 training (complete 2×2 factorial) all configs strictly worse than baseline ❌ context cap dominates
TRAIN_SEQ_LEN = 4096 pre-quant −0.00196, sliding +0.00435, TTT +0.00428 ❌ position-depth vs breadth tradeoff
QAT v3 (matrices-only int6 fake-quant) TTT diverged to 1.48 ❌ QAT × score-first-TTT catastrophic interaction
Adaptive Hadamard GPTQ null on Muon sub-Gaussian weights ❌ literature benefit doesn't transfer

Structural finding

Every attempt to exploit the eval-time working-memory asymmetry (16 MB artifact cap vs 640 GB eval HBM) via the absolute-position RoPE architecture fails for the same architectural reason: sliding eval specifically scores tokens at a narrow tail-position band, and any technique that trades position depth (more samples per position) for position breadth (more positions covered) regresses there. Three independent experiments (EVAL_SEQ_LEN=4096, SWA training, TRAIN_SEQ_LEN=4096) produced the same pre-quant-wins-but-sliding-loses inversion.

The architecturally correct next step is relative-position attention (ALiBi / NoPE / retrieval) where scored-token position becomes irrelevant. That is my next experiment.

Rule compliance

Every probe satisfies C1-C4 (causality, normalized softmax, score-before-update, single-pass) per Issue #1017. No SLOT, no ETLB, no pre-quant TTT on val, no n-gram cache (only prototyped in patches/ngram_cache_eval.py, not submitted).

Files

  • README.md — full writeup with mechanism analysis for each null result
  • submission.json — structured findings metadata
  • adaptive_hadamard_gptq.py — unit-tested implementation validating the null from §3.4
  • kv_cache_chain.py — early prototype (superseded by findings in §5)
  • patches/ — patches documenting each experiment (Path A v3, QAT v3, SWA, n-gram cache prototype, LZMA wrapper)
  • logs/ — 10 training and eval logs across the ablations

Record path: records/track_non_record_16mb/2026-04-18_EvalTimeAblations_SP8192/

…companion to PR openai#1716)

Structured documentation of the ablation path behind my record PR openai#1716 (3-seed mean
1.07882 bpb). Covers what was tested, what worked, what was killed, and the
architectural reason behind each null result.

TL;DR:
- CONFIRMED: BIGRAM_DIM=32 + Path A v3 aggressive passthrough quantization (int8
  control tensors + int8 small matrices + LZMA code wrapper) fits 15.99 MB at 0
  bpb cost; this is the record mechanism.
- NULL: TTT_EPOCHS=4 (Δ -0.00009 bpb, saturated), EVAL_SEQ_LEN=4096 + stride=128
  (sliding +0.00555 due to OOD scored-tail positions), SWA w=1024 training 2x2
  factorial (all configs strictly worse than baseline), TRAIN_SEQ_LEN=4096
  direct training (position-depth-vs-breadth tradeoff: pre-quant better but
  sliding +0.00435), QAT v3 (pre-quant +0.015, TTT diverged to 1.48 via
  QAT x score-first-TTT lattice interaction), Adaptive Hadamard GPTQ (null on
  Muon sub-Gaussian weights).

Structural finding: every attempt to exploit the eval-time working-memory
asymmetry via the absolute-position RoPE architecture fails because sliding
specifically measures position-depth at a narrow tail-band. The architecturally
correct fix is relative-position attention (ALiBi / NoPE), not more aggressive
positional extension. That is the next experiment.

Includes 10 logs, 5 patches, 2 prototype modules, and full mechanism analysis
for each null so the next person exploring this branch does not re-learn the
same lessons at full training cost.
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 20, 2026
Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier
to test whether absolute-position bias is bottlenecking the PR openai#1700
TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged
relative-position attention as the next architectural axis, and no PR
has tried NoPE at frontier.

ALiBi was the first choice, but FA3
(Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no
alibi_slopes parameter, and FA2 fallback breaks the 600s budget under
TTT. NoPE is the cheapest position-axis test under FA3.

NOPE env knob (default 1) gates apply_rotary_emb in three attn paths:
forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary
module is still constructed so warmup calls remain harmless and the
diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new
params, submission size unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant