Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716)#1718
Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…companion to PR openai#1716) Structured documentation of the ablation path behind my record PR openai#1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. TL;DR: - CONFIRMED: BIGRAM_DIM=32 + Path A v3 aggressive passthrough quantization (int8 control tensors + int8 small matrices + LZMA code wrapper) fits 15.99 MB at 0 bpb cost; this is the record mechanism. - NULL: TTT_EPOCHS=4 (Δ -0.00009 bpb, saturated), EVAL_SEQ_LEN=4096 + stride=128 (sliding +0.00555 due to OOD scored-tail positions), SWA w=1024 training 2x2 factorial (all configs strictly worse than baseline), TRAIN_SEQ_LEN=4096 direct training (position-depth-vs-breadth tradeoff: pre-quant better but sliding +0.00435), QAT v3 (pre-quant +0.015, TTT diverged to 1.48 via QAT x score-first-TTT lattice interaction), Adaptive Hadamard GPTQ (null on Muon sub-Gaussian weights). Structural finding: every attempt to exploit the eval-time working-memory asymmetry via the absolute-position RoPE architecture fails because sliding specifically measures position-depth at a narrow tail-band. The architecturally correct fix is relative-position attention (ALiBi / NoPE), not more aggressive positional extension. That is the next experiment. Includes 10 logs, 5 patches, 2 prototype modules, and full mechanism analysis for each null so the next person exploring this branch does not re-learn the same lessons at full training cost.
sergeevii123
added a commit
to sergeevii123/parameter-golf
that referenced
this pull request
Apr 20, 2026
Drops RoPE entirely from the SmearGate + AttnOutGate + QK525 frontier to test whether absolute-position bias is bottlenecking the PR openai#1700 TTT + PR openai#1667 gates stack at ~1.071 BPB. PR openai#1718 explicitly flagged relative-position attention as the next architectural axis, and no PR has tried NoPE at frontier. ALiBi was the first choice, but FA3 (Dao-AILab/flash-attention/hopper/flash_attn_interface.py) has no alibi_slopes parameter, and FA2 fallback breaks the 600s budget under TTT. NoPE is the cheapest position-axis test under FA3. NOPE env knob (default 1) gates apply_rotary_emb in three attn paths: forward(), _block_with_lora(), _parallel_block_with_lora(). Rotary module is still constructed so warmup calls remain harmless and the diff is reversible by NOPE=0 (reproduces Stage 2 numerics). Zero new params, submission size unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Structured documentation of the ablation path behind my record PR #1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. The point of this submission is signal quality, not a leaderboard number — filing structured evidence so that the next person exploring this branch of the design space does not re-learn the same lessons at full training cost.
TL;DR
BIGRAM_DIM = 32TTT_EPOCHS = 4EVAL_SEQ_LEN = 4096, stride=128TRAIN_SEQ_LEN = 4096Structural finding
Every attempt to exploit the eval-time working-memory asymmetry (16 MB artifact cap vs 640 GB eval HBM) via the absolute-position RoPE architecture fails for the same architectural reason: sliding eval specifically scores tokens at a narrow tail-position band, and any technique that trades position depth (more samples per position) for position breadth (more positions covered) regresses there. Three independent experiments (EVAL_SEQ_LEN=4096, SWA training, TRAIN_SEQ_LEN=4096) produced the same pre-quant-wins-but-sliding-loses inversion.
The architecturally correct next step is relative-position attention (ALiBi / NoPE / retrieval) where scored-token position becomes irrelevant. That is my next experiment.
Rule compliance
Every probe satisfies C1-C4 (causality, normalized softmax, score-before-update, single-pass) per Issue #1017. No SLOT, no ETLB, no pre-quant TTT on val, no n-gram cache (only prototyped in
patches/ngram_cache_eval.py, not submitted).Files
README.md— full writeup with mechanism analysis for each null resultsubmission.json— structured findings metadataadaptive_hadamard_gptq.py— unit-tested implementation validating the null from §3.4kv_cache_chain.py— early prototype (superseded by findings in §5)patches/— patches documenting each experiment (Path A v3, QAT v3, SWA, n-gram cache prototype, LZMA wrapper)logs/— 10 training and eval logs across the ablationsRecord path:
records/track_non_record_16mb/2026-04-18_EvalTimeAblations_SP8192/