Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix by amrayach · Pull Request #1741 · openai/parameter-golf

amrayach · 2026-04-19T13:20:57Z

This package is intentionally narrow: it does not remix multiple frontier submissions into a new record claim. Instead, it reproduces one current frontier line to near-exact fidelity, tests one new adaptive corrector path against that reproduced baseline, and reports both the measured negative result and the eval-only fix required to obtain it.

Prior context

Previous submissions in this line: #1101 (pre-TTT anchor, 1.1290 BPB), #1307 (07c1 strict base proof vs merged #1019), #1598 (SP8192-D 5-seed evidence package).

Contributions

Reproduction of PR Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 on independent infrastructure. Seed-0 BPB 1.07218477 vs Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 published seed-0 1.07216564 → Δ = +1.913×10⁻⁵. Run on 8× H100 80GB HBM3 SXM5 (RunPod) at this branch's commit 1765afc (pins Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 at upstream ca19195).
Bounded negative result for a score-first n-gram posterior corrector layered on Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610's phased LoRA TTT eval path. All three tested (alpha, orders) configs degrade BPB, monotonically in alpha. Multi-order backoff provides no measurable benefit over single-order at the same blend weight.
Bug fix in train_gpt.py's quantized-eval-only branch (two guards at lines 3204 and 3259). Without these, EVAL_ONLY_QUANTIZED_PATH crashes on None-model dereference. Surfaced while running the ablations in Contribution 2.

The reproduction is a credibility prerequisite for the negative-result claim, not a contribution in itself. The corrector formulation and its Section-III-compliance engineering are the only novel content. The bug fix is incidental.

Reproduction result

	Value
Our seed-0 BPB	1.07218477
Published #1610 seed-0 BPB	1.07216564
Δ vs published seed-0	+1.913×10⁻⁵
Eval wall-clock	455.9 s
Artifact	15,999,394 bytes (606 B under the 16 MB cap)

Training stopped at step 4,879 of 20,000 due to MAX_WALLCLOCK_SECONDS=600 - GPTQ_RESERVE_SECONDS=13 (by design in #1610). The training log's GATE_A: FAIL line is our internal pipeline's 15,997,520-byte safety threshold (intended to absorb code-size drift); the artifact passes the competition rule.

Corrector ablation

All three run in eval-only mode against the reproduced seed-0 checkpoint — no retraining.

Run	α	orders	BPB	Δ BPB (run − baseline; positive = worse)	Eval (s)
Baseline	0.0	—	1.07218477	0	455.9
1a	0.3	[8]	1.08876294	+0.01658	462.8
1b	0.3	[5, 8, 12]	1.08891256	+0.01673	472.4
1c	0.1	[5, 8, 12]	1.07430360	+0.00212	465.8

The effect at α=0.1 is ~1/8 of the effect at α=0.3 — first-order linear in α, no inflection toward improvement. Structurally, TTT-LoRA and the n-gram corrector are both deterministic functions of the scored prefix x_{1..t-1}; adding alpha * log(q_prefix_ngram(v)) on top of logits that already encode P(x_t | x_{1..t-1}) under TTT adaptation over-counts the prefix evidence. This predicts the monotonic-in-α result and predicts a non-TTT eval pipeline might behave differently. The latter was not tested.

This PR rules out one tested posterior-corrector path on a reproduced #1610-class phased-TTT stack; it does not claim that all n-gram or posterior correctors are ineffective.

Eval-only bug fix

In EVAL_ONLY_QUANTIZED_PATH mode, base_model, compiled_model, and compiled_forward_logits are all None (line 3188), but two downstream paths dereferenced them:

The pre-quantization diagnostic timed_eval("diagnostic pre-quantization post-ema", ...) dereferenced compiled_model.forward_logits → AttributeError.
The TTT-branch del eval_model, compiled_model cleanup referenced eval_model which was never bound in this mode → UnboundLocalError.

Fix: if not quantized_eval_only: guard on the diagnostic (line 3204), and extend the existing cleanup guard to cover this branch (line 3259). The post-quantization diagnostic still runs because it calls deserialize(h, device) directly and does not touch the None locals.

Compliance with Issue #1017 Section III

Walked line-by-line in the folder README under "Compliance with Issue #1017 Section III". Summary:

C1 (causal): PrefixNgramCorrector state (lines 15-58) populated only via update(x_t), which runs after scoring.
C2 (full distribution): Blend is logits + alpha * log(q_t) over full V=8192 (line 1122). Laplace init (line 23) guarantees q_t(v) > 0 for all v. Full [V] tensor add, not gathered single-index.
C3 (score-before-update): Bias collected (line 2564), score forward pass (line 2567), BPB accumulation (lines 2568-2582), then update(_tok) (line 2591). Explicit inline comment at line 2583: # Corrector: update state with scored tokens (score-before-update).
C4 (single pass): One forward pass over validation. Global SGD steps between chunks do not re-score prior positions. Corrector state is reset after global SGD.

Warmup uses synthetic tokens only, via a device-local RNG generator (lines 3324-3365). Timer starts at torch.cuda.synchronize(); t_ttt = time.perf_counter() (lines 3370-3371) after warmup closes.

The chunk-static bias approximation is a deliberate engineering choice (per-position bias would cost 32× more GPU forwards or a ~2 GB [B, S, V] dense tensor per batch per rank, both breaking the time/memory budget). It satisfies score-before-update at chunk granularity rather than per-position — the bias inside chunk c uses only tokens from chunks [0, c). Explicit in the corrector's docstring.

Scope

Single-seed (seed 0). Reproduction is compared against #1610's published seed-0 number (1.07216564), not their 3-seed mean. Multi-seed validation was descoped: given a +1.9×10⁻⁵ BPB delta against the matched seed and monotonic +0.002 to +0.017 degradation across the corrector grid, additional seeds would refine variance but are unlikely to flip either direction. The negative-result claim is bounded to seed 0 of the reproduced checkpoint.

Out of scope in this package: α < 0.1, orders > 12, logistic-domain blends, non-TTT eval pipelines.

Artifacts

Self-contained in records/track_non_record_16mb/2026-04-19_pr1610_reproduction_corrector_negative/: train_gpt.py, submission.json, requirements.txt, raw train_seed0.log + three ablation_1[abc].log, machine-readable reproduction_summary.json and ablation_summary.json, plus provenance/ (commit SHA, env fingerprint, nvidia-smi). Training logs are raw; the training script writes compact metrics-only output by design.

Supplementary external archive: https://huggingface.co/amay01/parameter-golf-pr1610-reproduction-artifacts (141 MB tarball, MD5 caf8adf63d8c80965f6671beba95d7aa). Contains preserved checkpoints (final_model.int6.ptz, final_model.pt) and full intermediate artifacts. Not required to reproduce the headline number.

…ative result + quantized-eval-only path fix

amrayach added 2 commits April 19, 2026 15:19

add(non_record_16mb): openai#1610 reproduction + n-gram corrector neg…

2114ec4

…ative result + quantized-eval-only path fix

ops: add RunPod PR openai#1812 audit reproduction launcher

f086fbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1741

Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1741
amrayach wants to merge 2 commits intoopenai:mainfrom
amrayach:submission/pr1610-nonrecord-clean

amrayach commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amrayach commented Apr 19, 2026

Prior context

Contributions

Reproduction result

Corrector ablation

Eval-only bug fix

Compliance with Issue #1017 Section III

Scope

Artifacts

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant