Skip to content

Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1741

Open
amrayach wants to merge 2 commits intoopenai:mainfrom
amrayach:submission/pr1610-nonrecord-clean
Open

Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1741
amrayach wants to merge 2 commits intoopenai:mainfrom
amrayach:submission/pr1610-nonrecord-clean

Conversation

@amrayach
Copy link
Copy Markdown

This package is intentionally narrow: it does not remix multiple frontier submissions into a new record claim. Instead, it reproduces one current frontier line to near-exact fidelity, tests one new adaptive corrector path against that reproduced baseline, and reports both the measured negative result and the eval-only fix required to obtain it.

Prior context

Previous submissions in this line: #1101 (pre-TTT anchor, 1.1290 BPB), #1307 (07c1 strict base proof vs merged #1019), #1598 (SP8192-D 5-seed evidence package).

Contributions

  1. Reproduction of PR Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 on independent infrastructure. Seed-0 BPB 1.07218477 vs Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 published seed-0 1.07216564 → Δ = +1.913×10⁻⁵. Run on 8× H100 80GB HBM3 SXM5 (RunPod) at this branch's commit 1765afc (pins Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 at upstream ca19195).
  2. Bounded negative result for a score-first n-gram posterior corrector layered on Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610's phased LoRA TTT eval path. All three tested (alpha, orders) configs degrade BPB, monotonically in alpha. Multi-order backoff provides no measurable benefit over single-order at the same blend weight.
  3. Bug fix in train_gpt.py's quantized-eval-only branch (two guards at lines 3204 and 3259). Without these, EVAL_ONLY_QUANTIZED_PATH crashes on None-model dereference. Surfaced while running the ablations in Contribution 2.

The reproduction is a credibility prerequisite for the negative-result claim, not a contribution in itself. The corrector formulation and its Section-III-compliance engineering are the only novel content. The bug fix is incidental.

Reproduction result

Value
Our seed-0 BPB 1.07218477
Published #1610 seed-0 BPB 1.07216564
Δ vs published seed-0 +1.913×10⁻⁵
Eval wall-clock 455.9 s
Artifact 15,999,394 bytes (606 B under the 16 MB cap)

Training stopped at step 4,879 of 20,000 due to MAX_WALLCLOCK_SECONDS=600 - GPTQ_RESERVE_SECONDS=13 (by design in #1610). The training log's GATE_A: FAIL line is our internal pipeline's 15,997,520-byte safety threshold (intended to absorb code-size drift); the artifact passes the competition rule.

Corrector ablation

All three run in eval-only mode against the reproduced seed-0 checkpoint — no retraining.

Run α orders BPB Δ BPB (run − baseline; positive = worse) Eval (s)
Baseline 0.0 1.07218477 0 455.9
1a 0.3 [8] 1.08876294 +0.01658 462.8
1b 0.3 [5, 8, 12] 1.08891256 +0.01673 472.4
1c 0.1 [5, 8, 12] 1.07430360 +0.00212 465.8

The effect at α=0.1 is ~1/8 of the effect at α=0.3 — first-order linear in α, no inflection toward improvement. Structurally, TTT-LoRA and the n-gram corrector are both deterministic functions of the scored prefix x_{1..t-1}; adding alpha * log(q_prefix_ngram(v)) on top of logits that already encode P(x_t | x_{1..t-1}) under TTT adaptation over-counts the prefix evidence. This predicts the monotonic-in-α result and predicts a non-TTT eval pipeline might behave differently. The latter was not tested.

This PR rules out one tested posterior-corrector path on a reproduced #1610-class phased-TTT stack; it does not claim that all n-gram or posterior correctors are ineffective.

Eval-only bug fix

In EVAL_ONLY_QUANTIZED_PATH mode, base_model, compiled_model, and compiled_forward_logits are all None (line 3188), but two downstream paths dereferenced them:

  1. The pre-quantization diagnostic timed_eval("diagnostic pre-quantization post-ema", ...) dereferenced compiled_model.forward_logitsAttributeError.
  2. The TTT-branch del eval_model, compiled_model cleanup referenced eval_model which was never bound in this mode → UnboundLocalError.

Fix: if not quantized_eval_only: guard on the diagnostic (line 3204), and extend the existing cleanup guard to cover this branch (line 3259). The post-quantization diagnostic still runs because it calls deserialize(h, device) directly and does not touch the None locals.

Compliance with Issue #1017 Section III

Walked line-by-line in the folder README under "Compliance with Issue #1017 Section III". Summary:

  • C1 (causal): PrefixNgramCorrector state (lines 15-58) populated only via update(x_t), which runs after scoring.
  • C2 (full distribution): Blend is logits + alpha * log(q_t) over full V=8192 (line 1122). Laplace init (line 23) guarantees q_t(v) > 0 for all v. Full [V] tensor add, not gathered single-index.
  • C3 (score-before-update): Bias collected (line 2564), score forward pass (line 2567), BPB accumulation (lines 2568-2582), then update(_tok) (line 2591). Explicit inline comment at line 2583: # Corrector: update state with scored tokens (score-before-update).
  • C4 (single pass): One forward pass over validation. Global SGD steps between chunks do not re-score prior positions. Corrector state is reset after global SGD.

Warmup uses synthetic tokens only, via a device-local RNG generator (lines 3324-3365). Timer starts at torch.cuda.synchronize(); t_ttt = time.perf_counter() (lines 3370-3371) after warmup closes.

The chunk-static bias approximation is a deliberate engineering choice (per-position bias would cost 32× more GPU forwards or a ~2 GB [B, S, V] dense tensor per batch per rank, both breaking the time/memory budget). It satisfies score-before-update at chunk granularity rather than per-position — the bias inside chunk c uses only tokens from chunks [0, c). Explicit in the corrector's docstring.

Scope

Single-seed (seed 0). Reproduction is compared against #1610's published seed-0 number (1.07216564), not their 3-seed mean. Multi-seed validation was descoped: given a +1.9×10⁻⁵ BPB delta against the matched seed and monotonic +0.002 to +0.017 degradation across the corrector grid, additional seeds would refine variance but are unlikely to flip either direction. The negative-result claim is bounded to seed 0 of the reproduced checkpoint.

Out of scope in this package: α < 0.1, orders > 12, logistic-domain blends, non-TTT eval pipelines.

Artifacts

Self-contained in records/track_non_record_16mb/2026-04-19_pr1610_reproduction_corrector_negative/: train_gpt.py, submission.json, requirements.txt, raw train_seed0.log + three ablation_1[abc].log, machine-readable reproduction_summary.json and ablation_summary.json, plus provenance/ (commit SHA, env fingerprint, nvidia-smi). Training logs are raw; the training script writes compact metrics-only output by design.

Supplementary external archive: https://huggingface.co/amay01/parameter-golf-pr1610-reproduction-artifacts (141 MB tarball, MD5 caf8adf63d8c80965f6671beba95d7aa). Contains preserved checkpoints (final_model.int6.ptz, final_model.pt) and full intermediate artifacts. Not required to reproduce the headline number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant