Skip to content

Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1740

Closed
amrayach wants to merge 74 commits intoopenai:mainfrom
amrayach:submission/pr1610-corrector
Closed

Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1740
amrayach wants to merge 74 commits intoopenai:mainfrom
amrayach:submission/pr1610-corrector

Conversation

@amrayach
Copy link
Copy Markdown

This package is intentionally narrow: it does not remix multiple frontier submissions into a new record claim. Instead, it reproduces one current frontier line to near-exact fidelity, tests one new adaptive corrector path against that reproduced baseline, and reports both the measured negative result and the eval-only fix required to obtain it.

Prior context

Previous submissions in this line: #1101 (pre-TTT anchor, 1.1290 BPB), #1307 (07c1 strict base proof vs merged #1019), #1598 (SP8192-D 5-seed evidence package).

Contributions

  1. Reproduction of PR Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 on independent infrastructure. Seed-0 BPB 1.07218477 vs Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 published seed-0 1.07216564 → Δ = +1.913×10⁻⁵. Run on 8× H100 80GB HBM3 SXM5 (RunPod) at this branch's commit 1765afc (pins Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 at upstream ca19195).
  2. Bounded negative result for a score-first n-gram posterior corrector layered on Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610's phased LoRA TTT eval path. All three tested (alpha, orders) configs degrade BPB, monotonically in alpha. Multi-order backoff provides no measurable benefit over single-order at the same blend weight.
  3. Bug fix in train_gpt.py's quantized-eval-only branch (two guards at lines 3204 and 3259). Without these, EVAL_ONLY_QUANTIZED_PATH crashes on None-model dereference. Surfaced while running the ablations in Contribution 2.

The reproduction is a credibility prerequisite for the negative-result claim, not a contribution in itself. The corrector formulation and its Section-III-compliance engineering are the only novel content. The bug fix is incidental.

Reproduction result

Value
Our seed-0 BPB 1.07218477
Published #1610 seed-0 BPB 1.07216564
Δ vs published seed-0 +1.913×10⁻⁵
Eval wall-clock 455.9 s
Artifact 15,999,394 bytes (606 B under the 16 MB cap)

Training stopped at step 4,879 of 20,000 due to MAX_WALLCLOCK_SECONDS=600 - GPTQ_RESERVE_SECONDS=13 (by design in #1610). The training log's GATE_A: FAIL line is our internal pipeline's 15,997,520-byte safety threshold (intended to absorb code-size drift); the artifact passes the competition rule.

Corrector ablation

All three run in eval-only mode against the reproduced seed-0 checkpoint — no retraining.

Run α orders BPB Δ BPB (run − baseline; positive = worse) Eval (s)
Baseline 0.0 1.07218477 0 455.9
1a 0.3 [8] 1.08876294 +0.01658 462.8
1b 0.3 [5, 8, 12] 1.08891256 +0.01673 472.4
1c 0.1 [5, 8, 12] 1.07430360 +0.00212 465.8

The effect at α=0.1 is ~1/8 of the effect at α=0.3 — first-order linear in α, no inflection toward improvement. Structurally, TTT-LoRA and the n-gram corrector are both deterministic functions of the scored prefix x_{1..t-1}; adding alpha * log(q_prefix_ngram(v)) on top of logits that already encode P(x_t | x_{1..t-1}) under TTT adaptation over-counts the prefix evidence. This predicts the monotonic-in-α result and predicts a non-TTT eval pipeline might behave differently. The latter was not tested.

This PR rules out one tested posterior-corrector path on a reproduced #1610-class phased-TTT stack; it does not claim that all n-gram or posterior correctors are ineffective.

Eval-only bug fix

In EVAL_ONLY_QUANTIZED_PATH mode, base_model, compiled_model, and compiled_forward_logits are all None (line 3188), but two downstream paths dereferenced them:

  1. The pre-quantization diagnostic timed_eval("diagnostic pre-quantization post-ema", ...) dereferenced compiled_model.forward_logitsAttributeError.
  2. The TTT-branch del eval_model, compiled_model cleanup referenced eval_model which was never bound in this mode → UnboundLocalError.

Fix: if not quantized_eval_only: guard on the diagnostic (line 3204), and extend the existing cleanup guard to cover this branch (line 3259). The post-quantization diagnostic still runs because it calls deserialize(h, device) directly and does not touch the None locals.

Compliance with Issue #1017 Section III

Walked line-by-line in the folder README under "Compliance with Issue #1017 Section III". Summary:

  • C1 (causal): PrefixNgramCorrector state (lines 15-58) populated only via update(x_t), which runs after scoring.
  • C2 (full distribution): Blend is logits + alpha * log(q_t) over full V=8192 (line 1122). Laplace init (line 23) guarantees q_t(v) > 0 for all v. Full [V] tensor add, not gathered single-index.
  • C3 (score-before-update): Bias collected (line 2564), score forward pass (line 2567), BPB accumulation (lines 2568-2582), then update(_tok) (line 2591). Explicit inline comment at line 2583: # Corrector: update state with scored tokens (score-before-update).
  • C4 (single pass): One forward pass over validation. Global SGD steps between chunks do not re-score prior positions. Corrector state is reset after global SGD.

Warmup uses synthetic tokens only, via a device-local RNG generator (lines 3324-3365). Timer starts at torch.cuda.synchronize(); t_ttt = time.perf_counter() (lines 3370-3371) after warmup closes.

The chunk-static bias approximation is a deliberate engineering choice (per-position bias would cost 32× more GPU forwards or a ~2 GB [B, S, V] dense tensor per batch per rank, both breaking the time/memory budget). It satisfies score-before-update at chunk granularity rather than per-position — the bias inside chunk c uses only tokens from chunks [0, c). Explicit in the corrector's docstring.

Scope

Single-seed (seed 0). Reproduction is compared against #1610's published seed-0 number (1.07216564), not their 3-seed mean. Multi-seed validation was descoped: given a +1.9×10⁻⁵ BPB delta against the matched seed and monotonic +0.002 to +0.017 degradation across the corrector grid, additional seeds would refine variance but are unlikely to flip either direction. The negative-result claim is bounded to seed 0 of the reproduced checkpoint.

Out of scope in this package: α < 0.1, orders > 12, logistic-domain blends, non-TTT eval pipelines.

Artifacts

Self-contained in records/track_non_record_16mb/2026-04-19_pr1610_reproduction_corrector_negative/: train_gpt.py, submission.json, requirements.txt, raw train_seed0.log + three ablation_1[abc].log, machine-readable reproduction_summary.json and ablation_summary.json, plus provenance/ (commit SHA, env fingerprint, nvidia-smi). Training logs are raw; the training script writes compact metrics-only output by design.

Supplementary external archive: https://huggingface.co/amay01/parameter-golf-pr1610-reproduction-artifacts (141 MB tarball, MD5 caf8adf63d8c80965f6671beba95d7aa). Contains preserved checkpoints (final_model.int6.ptz, final_model.pt) and full intermediate artifacts. Not required to reproduce the headline number.

amrayach and others added 30 commits March 28, 2026 04:13
- 5-run A100 evidence package (baseline, seed42, LowerLR, warmdown, smoke)
- Campaign sessions 01-07 with templates and runbook
- Pre-TTT anchor diff analysis and root port-gap audit
- V100/fp16 compatibility via AMP_DTYPE auto-detection
- Agent coordination files (CLAUDE.md, AGENTS.md, AGENT_SYNC.md)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…te, pre-run

Self-contained anchor script porting the 2026-03-21 donor features onto the
root train_gpt.py skeleton. Features: SmearGate, BigramHash, XSA-last-4,
partial RoPE (16/64) with NTK scaling, LN scale, EMA, Muon/Adam weight decay,
mixed int6+zstd export, stride-64 sliding eval. SDPA attention backend
(no flash_attn_3 dependency). All 15 code review checks passed.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…constructors

The parameter was defined in Hyperparameters but hardcoded as 1024 in
CausalSelfAttention.__init__ instead of being passed through GPT → Block →
CausalSelfAttention. This would have caused a TypeError on model construction.
Found by Codex local validation (CPU forward pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ning

NGC PyTorch containers don't ship sentencepiece. Install via Pegasus
PyPI cache (with fallback to public PyPI) at container launch time.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The oom_kill was a Slurm memory limit (not GPU OOM). Default Slurm
memory allocation is too low for the anchor model's grad accumulation
buffer on a single GPU. Request 64G explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
8 tasks simultaneously pip-installing inside NGC container caused Slurm
OOM kill. Fix: only rank 0 installs, others wait 5s. Also request all
available node memory with --mem=0 in salloc mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The math backend is a slow fallback that PyTorch may pick for certain
shapes. The donor disables it. This was flagged as observation I1 in the
code review but left as-is for safety. Now that the smoke test confirms
flash SDP works, disable math for better throughput.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sliding s64 val_bpb: 1.12904446 (target 1.123-1.128, landed at edge)
Steps: 6564, step_avg: 91.37ms, artifact: 15,751,324 bytes
Gap from donor (1.1248): +0.0042 BPB — entirely throughput (487 fewer steps)
Bottleneck: SDPA (FA2) vs donor's FA3, not model fidelity

Environment: 8xH100 SXM5, serv-3342, NGC 26.03 container, /fscratch data
All coordination docs updated for Session 04 handoff.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…thesis + doc sync

Delta 1 (GPTQ-lite percentile clip search):
- Sliding s64: 1.12941 (worse than anchor 1.12904 by +0.0004)
- Artifact: 16,219,752 bytes — OVER 16MB cap by 219KB
- Conclusion: clip search hurts zstd compressibility, export gap not clip-related

Novel approach synthesis (04b):
- 50 ideas from 8 parallel AI research queries (ChatGPT/Claude/Gemini/Perplexity)
- Top convergence signal: n-gram cache (4+ models)
- RFN thesis connection: low-rank residual correction (W ≈ Q + UV^T)

All coordination docs updated for Delta 2 (LeakyReLU²) pivot.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…re-run

Single change vs measured Session 03 anchor: MLP activation from relu^2 to
leaky_relu(0.5)^2. enable_math_sdp restored to True to match anchor measured
state (pre-563700f). All other code identical to anchor.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Delta 2 measured on 8xH100: sliding s64 val_bpb 1.12904123 (effectively
identical to anchor 1.12904446). Slightly better quantization metrics and
168KB smaller artifact, but +0.72ms/step slower throughput cancels gains.
Not a standalone graduating delta — keepable as a stack component.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…audit

Session 04 micro-delta sweep closed at 1 failed (GPTQ-lite) + 1 neutral
(LeakyReLU²). Session 05 opens as a three-part audit: throughput gap
decomposition (FA3 portability), pre-TTT stack-gap analysis vs the local
1.1194 record, and TTT correctness/integration planning.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Comprehensive prompt covering throughput audit, pre-TTT stack-gap audit,
and TTT correctness audit. Includes git, Pegasus, documentation, and
memory update conventions for session continuity.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ify FA3 scope

Make tool/MCP section "prefer if available" instead of hard requirements.
Reframe FA3 check as verify-presence not install-package.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Concrete implementation prompt for FW-1 (Flash Attention 3) as isolated
delta on anchor. Includes pre-implementation FA3 verification check,
exact code change sites, isolation constraints, and full convention set.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Phase 1 of revised Session 05: port anchor attention from SDPA to direct
flash_attn_interface (FA3) on NGC 25.02 container. Microbenchmark shows
11.44x kernel speedup. AGENT_SYNC updated with competitive landscape
analysis and 3-phase plan (FA3 → Full Hessian GPTQ → Novelty).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…pdate CLAUDE.md

- 05b: Full Hessian GPTQ isolated delta with strict source priority,
  preserved artifact format, calibration budget (target ≤30s)
- 05c: Training bundle (XSA-all + VE128 + SWA + warmdown3500) with bisect plan
- CLAUDE.md: add entry point protocol, Pegasus operational rules, FA3 container path
- AGENT_SYNC: updated with FA3 port results (neutral/slower), revised strategy
- FA3 README: updated with smoke results and saved container path

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Isolated quantization-only delta: replaces naive int6 per-row rounding
with Cholesky error-compensated GPTQ (block_size=128, actorder, percdamp=0.01).
Training code is identical to anchor. Post-training calibration collects
H=X^TX via 128 forward passes, then GPTQ quantizes column-by-column with
error propagation weighted by H^{-1}.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… handoff docs

1xH100 smoke revealed 0.212 BPB roundtrip gap (27x worse than anchor).
GPTQ pipeline mechanics work (66 layers, 0 fallbacks, 4.2s) but
quantized weights reconstruct catastrophically. Must debug before 8xH100.

Updated: AGENT_SYNC, next-session, decisions, project-state.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…n path

The export-only replay showed 66/66 layers worse than naive regardless of
actorder or block_size, pointing to the upstream Hessian path as the root
cause. This patch aligns Hessian collection with PR openai#1019/openai#634 semantics:

- Divide accumulated H by num_batches (was raw sum — caused scale blowup)
- Add 1% diagonal damping in _finalize_hessians before quantization
- Run calibration forward pass under torch.autocast(bf16) to match training
- Accumulate Hessians on CPU to avoid GPU memory pressure

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previous replay_ref_hfix still showed 66/66 worse layers with the Hessian
normalization fix. Rather than continuing to debug from symptoms, this
transplants PR openai#1019's complete GPTQ slice verbatim:

- collect_hessians: PR openai#1019 hook pattern with pre-init and param_name keys
- quantize_int6_gptq: verbatim from PR openai#1019 lines 1171-1224
- gptq_mixed_quantize_int6: direct param_name key lookup, PR openai#1019 quantizer

Source: pr-1019-gptq:records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Set GPTQ_AR_CALIB=1 to generate 64 autoregressive sequences (temp=0.8)
from the model itself instead of using training data for Hessian
collection. This matches PR openai#1019's actual calibration strategy.

Both paths available — training data (default) and AR self-gen (opt-in).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
amrayach and others added 28 commits April 1, 2026 15:28
…sbatch

- Brotli is now a hard import (not try/except). The lzma fallback caused
  cross-rank decompress failures when pip install raced across srun tasks.
  2/12 grid jobs (seed 42, ttt_qk5) crashed with LZMAError on brotli blobs.

- Removed dead lzma compress/decompress branches from the model export path.
  Code-wrapper self-compression (line 186) still uses lzma intentionally.

- New sbatch files with fixes:
  - 07c1_ttt_s1337_fixed: TTT seed 1337, 35min wallclock, forced pip install
  - 07c1_ttt_s2025_fixed: TTT seed 2025, 35min wallclock, forced pip install
  - 07c1_base_s42_fixed: baseline seed 42 rerun, forced pip install

  All use unconditional `pip install` instead of conditional import guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Bring AGENTS.md, AGENT_SYNC.md, project-state.md, decisions.md,
and next-session.md to the openai#1610-direct strategy. Add locked
execution plan (PLAN_PR1610_CORRECTOR.md Rev 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Exact copy from PR openai#1610 at SHA ca19195.
MD5: 57cfda2047b2c2a63ec10b99d704bfb0. 3379 lines, 139831 bytes.
This is the unmodified source base; corrector will be added in later commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Setup, seed-0 (Gate A), seed-1/2 (Gate B) subcommands with
published BPB verification targets and kill criteria.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Commit the current posterior-corrector working-tree state for PR openai#1610:
- train_gpt.py corrector path plus warmup legality fix
- LEGALITY_SPEC.md, DEPENDENCY_GATE.md, requirements.txt
- test_corrector.py and bench_corrector_cpu.py
- AGENT_SYNC.md closeout with audit measurements

The warmup path previously touched val_data.val_tokens before the official
eval timer. It now uses a device-local torch.Generator + torch.randint
synthetic tokens. 9/9 tests pass and the CPU bench projects 26.1s.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
… A/B

Zero-intervention 8xH100 pipeline: pod verify, SP8192 download,
Gate A seed-0 baseline, corrector ablations, 3-way decision point,
Gate B 3-seed corrector mean, fallback requant, artifact preservation.

Fixes applied (Codex review): checkpoint persisted before log parse
(Fix D), 3-way ablation decision fork with hold band (Fix G), fail-closed
fallback parse (Fix H), removed malformed S3 backend (Fix J), Gate B
rewired to coherent 3-seed corrector mean (Fix I — seed-0 re-eval added
so all three seeds use same corrector config, mean vs published 1.07280628).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
009: add logit_bias warmup pass (dummy bf16 tensor) so Dynamo traces
     the Tensor branch before the 600s eval timer starts; gated on
     h.corrector_alpha > 0

002: pass BEST_ALPHA/BEST_ORDERS as argv to Gate B summary heredoc;
     corrector_alpha/corrector_orders now populated in gate_b_summary.json

003: update 02_gate_a.sh header comment to show actual ceiling 1.07516564

004: drop hash() wrapper in PrefixNgramCorrector — use ctx tuple directly
     as dict key; Python dicts handle collision disambiguation natively

001: rewrite test_single_pass to actually exercise chunk-boundary
     invariance: same corrector fed tokens[:10] then tokens[10:] must
     match a fresh single-pass corrector fed all 20 tokens

All 9 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…t-SHA pin, align README + run_all

- 05_preserve_artifacts.sh: write commit_sha.txt, hardware_info.txt,
  env_fingerprint.txt before tarball; fix repo_type=model to match
  the amay01/parameter-golf-session3-artifacts repo type
- 00_verify_pod.sh: add optional EXPECTED_SHA exact-pin check on top
  of existing ancestry-only guard
- run_all.sh: parameterize banner SHA; warn when EXPECTED_SHA unset
  so operator knows the orchestrator is running ancestry-only
- README.md: align Gate A kill threshold (1.078 → 1.07516564); update
  Block 1 operator commands to include git checkout + EXPECTED_SHA;
  separate ancestry anchor from session launch SHA in header

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…verlay

Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds
provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and
run_all.sh/README alignment; new pin reflects the pipeline-patch commit.

Also records the live-guidance absolute-BPB overlay and 04b deprecation
driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ative result + quantized-eval fix

Non-record evidence package for PR openai#1610. Three separable contributions:

1. Faithful seed-0 reproduction of PR openai#1610 on independent infrastructure
   (8xH100 HBM3 SXM5, RunPod): our BPB 1.07218477 vs published seed-0 BPB
   1.07216564 -> delta +1.913e-5.

2. Bounded negative result for a score-first n-gram posterior corrector
   layered on PR openai#1610's phased LoRA TTT eval path. All three tested
   (alpha, orders) configs degrade BPB monotonically with alpha. The
   corrector and TTT-LoRA are both deterministic functions of the scored
   prefix, so additively combining them over-counts the prefix evidence.
   Claim is bounded to the tested grid on this stack; does not generalize
   to all posterior correctors or non-TTT eval pipelines.

3. Fix for the quantized-eval-only branch of train_gpt.py (two guards at
   lines 3204 and 3259) that previously crashed on None-model dereference
   when EVAL_ONLY_QUANTIZED_PATH was set. Surfaced while running the
   ablations in contribution 2.

Artifact: 15,999,394 bytes (606 bytes of competition-cap headroom).
Single-seed scope, acknowledged. Compliance with Issue openai#1017 Section III
walked line-by-line in README.

Also updates three internal docs to reference the renamed HF artifact
repo (amay01/parameter-golf-pr1610-reproduction-artifacts).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- decisions.md: new locked decision explaining non-record framing, scope bounds, and post-submission discipline (no self-comments for 48h)
- AGENT_SYNC.md: current objective now records PR4 submitted upstream
- next-session.md: phase updated to post-submission state; Fallback 1A framed as secondary task unblocked from PR status
- results_log.jsonl: appended four records (reproduction + three ablations) with pre/post quantization BPBs, eval times, and bounded negative-result outcomes

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@amrayach
Copy link
Copy Markdown
Author

Closing to reopen against upstream/main with a clean diff scoped to the submission folder only.

@amrayach amrayach closed this Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant