Non-record: #1610 reproduction (Δ=+1.9e-5 BPB), n-gram posterior corrector negative result, quantized-eval-only path fix#1740
Closed
amrayach wants to merge 74 commits intoopenai:mainfrom
Closed
Conversation
- 5-run A100 evidence package (baseline, seed42, LowerLR, warmdown, smoke) - Campaign sessions 01-07 with templates and runbook - Pre-TTT anchor diff analysis and root port-gap audit - V100/fp16 compatibility via AMP_DTYPE auto-detection - Agent coordination files (CLAUDE.md, AGENTS.md, AGENT_SYNC.md) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…te, pre-run Self-contained anchor script porting the 2026-03-21 donor features onto the root train_gpt.py skeleton. Features: SmearGate, BigramHash, XSA-last-4, partial RoPE (16/64) with NTK scaling, LN scale, EMA, Muon/Adam weight decay, mixed int6+zstd export, stride-64 sliding eval. SDPA attention backend (no flash_attn_3 dependency). All 15 code review checks passed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…constructors The parameter was defined in Hyperparameters but hardcoded as 1024 in CausalSelfAttention.__init__ instead of being passed through GPT → Block → CausalSelfAttention. This would have caused a TypeError on model construction. Found by Codex local validation (CPU forward pass). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ning NGC PyTorch containers don't ship sentencepiece. Install via Pegasus PyPI cache (with fallback to public PyPI) at container launch time. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The oom_kill was a Slurm memory limit (not GPU OOM). Default Slurm memory allocation is too low for the anchor model's grad accumulation buffer on a single GPU. Request 64G explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
8 tasks simultaneously pip-installing inside NGC container caused Slurm OOM kill. Fix: only rank 0 installs, others wait 5s. Also request all available node memory with --mem=0 in salloc mode. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The math backend is a slow fallback that PyTorch may pick for certain shapes. The donor disables it. This was flagged as observation I1 in the code review but left as-is for safety. Now that the smoke test confirms flash SDP works, disable math for better throughput. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sliding s64 val_bpb: 1.12904446 (target 1.123-1.128, landed at edge) Steps: 6564, step_avg: 91.37ms, artifact: 15,751,324 bytes Gap from donor (1.1248): +0.0042 BPB — entirely throughput (487 fewer steps) Bottleneck: SDPA (FA2) vs donor's FA3, not model fidelity Environment: 8xH100 SXM5, serv-3342, NGC 26.03 container, /fscratch data All coordination docs updated for Session 04 handoff. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…dates Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…thesis + doc sync Delta 1 (GPTQ-lite percentile clip search): - Sliding s64: 1.12941 (worse than anchor 1.12904 by +0.0004) - Artifact: 16,219,752 bytes — OVER 16MB cap by 219KB - Conclusion: clip search hurts zstd compressibility, export gap not clip-related Novel approach synthesis (04b): - 50 ideas from 8 parallel AI research queries (ChatGPT/Claude/Gemini/Perplexity) - Top convergence signal: n-gram cache (4+ models) - RFN thesis connection: low-rank residual correction (W ≈ Q + UV^T) All coordination docs updated for Delta 2 (LeakyReLU²) pivot. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…re-run Single change vs measured Session 03 anchor: MLP activation from relu^2 to leaky_relu(0.5)^2. enable_math_sdp restored to True to match anchor measured state (pre-563700f). All other code identical to anchor. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Delta 2 measured on 8xH100: sliding s64 val_bpb 1.12904123 (effectively identical to anchor 1.12904446). Slightly better quantization metrics and 168KB smaller artifact, but +0.72ms/step slower throughput cancels gains. Not a standalone graduating delta — keepable as a stack component. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…audit Session 04 micro-delta sweep closed at 1 failed (GPTQ-lite) + 1 neutral (LeakyReLU²). Session 05 opens as a three-part audit: throughput gap decomposition (FA3 portability), pre-TTT stack-gap analysis vs the local 1.1194 record, and TTT correctness/integration planning. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Comprehensive prompt covering throughput audit, pre-TTT stack-gap audit, and TTT correctness audit. Includes git, Pegasus, documentation, and memory update conventions for session continuity. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ify FA3 scope Make tool/MCP section "prefer if available" instead of hard requirements. Reframe FA3 check as verify-presence not install-package. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Concrete implementation prompt for FW-1 (Flash Attention 3) as isolated delta on anchor. Includes pre-implementation FA3 verification check, exact code change sites, isolation constraints, and full convention set. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Phase 1 of revised Session 05: port anchor attention from SDPA to direct flash_attn_interface (FA3) on NGC 25.02 container. Microbenchmark shows 11.44x kernel speedup. AGENT_SYNC updated with competitive landscape analysis and 3-phase plan (FA3 → Full Hessian GPTQ → Novelty). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…pdate CLAUDE.md - 05b: Full Hessian GPTQ isolated delta with strict source priority, preserved artifact format, calibration budget (target ≤30s) - 05c: Training bundle (XSA-all + VE128 + SWA + warmdown3500) with bisect plan - CLAUDE.md: add entry point protocol, Pegasus operational rules, FA3 container path - AGENT_SYNC: updated with FA3 port results (neutral/slower), revised strategy - FA3 README: updated with smoke results and saved container path Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Isolated quantization-only delta: replaces naive int6 per-row rounding
with Cholesky error-compensated GPTQ (block_size=128, actorder, percdamp=0.01).
Training code is identical to anchor. Post-training calibration collects
H=X^TX via 128 forward passes, then GPTQ quantizes column-by-column with
error propagation weighted by H^{-1}.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… handoff docs 1xH100 smoke revealed 0.212 BPB roundtrip gap (27x worse than anchor). GPTQ pipeline mechanics work (66 layers, 0 fallbacks, 4.2s) but quantized weights reconstruct catastrophically. Must debug before 8xH100. Updated: AGENT_SYNC, next-session, decisions, project-state. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…n path The export-only replay showed 66/66 layers worse than naive regardless of actorder or block_size, pointing to the upstream Hessian path as the root cause. This patch aligns Hessian collection with PR openai#1019/openai#634 semantics: - Divide accumulated H by num_batches (was raw sum — caused scale blowup) - Add 1% diagonal damping in _finalize_hessians before quantization - Run calibration forward pass under torch.autocast(bf16) to match training - Accumulate Hessians on CPU to avoid GPU memory pressure Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previous replay_ref_hfix still showed 66/66 worse layers with the Hessian normalization fix. Rather than continuing to debug from symptoms, this transplants PR openai#1019's complete GPTQ slice verbatim: - collect_hessians: PR openai#1019 hook pattern with pre-init and param_name keys - quantize_int6_gptq: verbatim from PR openai#1019 lines 1171-1224 - gptq_mixed_quantize_int6: direct param_name key lookup, PR openai#1019 quantizer Source: pr-1019-gptq:records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Set GPTQ_AR_CALIB=1 to generate 64 autoregressive sequences (temp=0.8) from the model itself instead of using training data for Hessian collection. This matches PR openai#1019's actual calibration strategy. Both paths available — training data (default) and AR self-gen (opt-in). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…sbatch - Brotli is now a hard import (not try/except). The lzma fallback caused cross-rank decompress failures when pip install raced across srun tasks. 2/12 grid jobs (seed 42, ttt_qk5) crashed with LZMAError on brotli blobs. - Removed dead lzma compress/decompress branches from the model export path. Code-wrapper self-compression (line 186) still uses lzma intentionally. - New sbatch files with fixes: - 07c1_ttt_s1337_fixed: TTT seed 1337, 35min wallclock, forced pip install - 07c1_ttt_s2025_fixed: TTT seed 2025, 35min wallclock, forced pip install - 07c1_base_s42_fixed: baseline seed 42 rerun, forced pip install All use unconditional `pip install` instead of conditional import guard. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Bring AGENTS.md, AGENT_SYNC.md, project-state.md, decisions.md, and next-session.md to the openai#1610-direct strategy. Add locked execution plan (PLAN_PR1610_CORRECTOR.md Rev 3). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Exact copy from PR openai#1610 at SHA ca19195. MD5: 57cfda2047b2c2a63ec10b99d704bfb0. 3379 lines, 139831 bytes. This is the unmodified source base; corrector will be added in later commits. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Setup, seed-0 (Gate A), seed-1/2 (Gate B) subcommands with published BPB verification targets and kill criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Commit the current posterior-corrector working-tree state for PR openai#1610: - train_gpt.py corrector path plus warmup legality fix - LEGALITY_SPEC.md, DEPENDENCY_GATE.md, requirements.txt - test_corrector.py and bench_corrector_cpu.py - AGENT_SYNC.md closeout with audit measurements The warmup path previously touched val_data.val_tokens before the official eval timer. It now uses a device-local torch.Generator + torch.randint synthetic tokens. 9/9 tests pass and the CPU bench projects 26.1s. Co-Authored-By: Claude Opus 4.7 <[email protected]>
… A/B Zero-intervention 8xH100 pipeline: pod verify, SP8192 download, Gate A seed-0 baseline, corrector ablations, 3-way decision point, Gate B 3-seed corrector mean, fallback requant, artifact preservation. Fixes applied (Codex review): checkpoint persisted before log parse (Fix D), 3-way ablation decision fork with hold band (Fix G), fail-closed fallback parse (Fix H), removed malformed S3 backend (Fix J), Gate B rewired to coherent 3-seed corrector mean (Fix I — seed-0 re-eval added so all three seeds use same corrector config, mean vs published 1.07280628). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
009: add logit_bias warmup pass (dummy bf16 tensor) so Dynamo traces
the Tensor branch before the 600s eval timer starts; gated on
h.corrector_alpha > 0
002: pass BEST_ALPHA/BEST_ORDERS as argv to Gate B summary heredoc;
corrector_alpha/corrector_orders now populated in gate_b_summary.json
003: update 02_gate_a.sh header comment to show actual ceiling 1.07516564
004: drop hash() wrapper in PrefixNgramCorrector — use ctx tuple directly
as dict key; Python dicts handle collision disambiguation natively
001: rewrite test_single_pass to actually exercise chunk-boundary
invariance: same corrector fed tokens[:10] then tokens[10:] must
match a fresh single-pass corrector fed all 20 tokens
All 9 tests pass.
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…t-SHA pin, align README + run_all - 05_preserve_artifacts.sh: write commit_sha.txt, hardware_info.txt, env_fingerprint.txt before tarball; fix repo_type=model to match the amay01/parameter-golf-session3-artifacts repo type - 00_verify_pod.sh: add optional EXPECTED_SHA exact-pin check on top of existing ancestry-only guard - run_all.sh: parameterize banner SHA; warn when EXPECTED_SHA unset so operator knows the orchestrator is running ancestry-only - README.md: align Gate A kill threshold (1.078 → 1.07516564); update Block 1 operator commands to include git checkout + EXPECTED_SHA; separate ancestry anchor from session launch SHA in header Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ative result + quantized-eval fix Non-record evidence package for PR openai#1610. Three separable contributions: 1. Faithful seed-0 reproduction of PR openai#1610 on independent infrastructure (8xH100 HBM3 SXM5, RunPod): our BPB 1.07218477 vs published seed-0 BPB 1.07216564 -> delta +1.913e-5. 2. Bounded negative result for a score-first n-gram posterior corrector layered on PR openai#1610's phased LoRA TTT eval path. All three tested (alpha, orders) configs degrade BPB monotonically with alpha. The corrector and TTT-LoRA are both deterministic functions of the scored prefix, so additively combining them over-counts the prefix evidence. Claim is bounded to the tested grid on this stack; does not generalize to all posterior correctors or non-TTT eval pipelines. 3. Fix for the quantized-eval-only branch of train_gpt.py (two guards at lines 3204 and 3259) that previously crashed on None-model dereference when EVAL_ONLY_QUANTIZED_PATH was set. Surfaced while running the ablations in contribution 2. Artifact: 15,999,394 bytes (606 bytes of competition-cap headroom). Single-seed scope, acknowledged. Compliance with Issue openai#1017 Section III walked line-by-line in README. Also updates three internal docs to reference the renamed HF artifact repo (amay01/parameter-golf-pr1610-reproduction-artifacts). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- decisions.md: new locked decision explaining non-record framing, scope bounds, and post-submission discipline (no self-comments for 48h) - AGENT_SYNC.md: current objective now records PR4 submitted upstream - next-session.md: phase updated to post-submission state; Fallback 1A framed as secondary task unblocked from PR status - results_log.jsonl: appended four records (reproduction + three ablations) with pre/post quantization BPBs, eval times, and bounded negative-result outcomes Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Author
|
Closing to reopen against upstream/main with a clean diff scoped to the submission folder only. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This package is intentionally narrow: it does not remix multiple frontier submissions into a new record claim. Instead, it reproduces one current frontier line to near-exact fidelity, tests one new adaptive corrector path against that reproduced baseline, and reports both the measured negative result and the eval-only fix required to obtain it.
Prior context
Previous submissions in this line: #1101 (pre-TTT anchor, 1.1290 BPB), #1307 (07c1 strict base proof vs merged #1019), #1598 (SP8192-D 5-seed evidence package).
Contributions
1765afc(pins Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610 at upstreamca19195).(alpha, orders)configs degrade BPB, monotonically inalpha. Multi-order backoff provides no measurable benefit over single-order at the same blend weight.train_gpt.py's quantized-eval-only branch (two guards at lines 3204 and 3259). Without these,EVAL_ONLY_QUANTIZED_PATHcrashes onNone-model dereference. Surfaced while running the ablations in Contribution 2.The reproduction is a credibility prerequisite for the negative-result claim, not a contribution in itself. The corrector formulation and its Section-III-compliance engineering are the only novel content. The bug fix is incidental.
Reproduction result
Training stopped at step 4,879 of 20,000 due to
MAX_WALLCLOCK_SECONDS=600 - GPTQ_RESERVE_SECONDS=13(by design in #1610). The training log'sGATE_A: FAILline is our internal pipeline's 15,997,520-byte safety threshold (intended to absorb code-size drift); the artifact passes the competition rule.Corrector ablation
All three run in eval-only mode against the reproduced seed-0 checkpoint — no retraining.
The effect at α=0.1 is ~1/8 of the effect at α=0.3 — first-order linear in α, no inflection toward improvement. Structurally, TTT-LoRA and the n-gram corrector are both deterministic functions of the scored prefix
x_{1..t-1}; addingalpha * log(q_prefix_ngram(v))on top of logits that already encodeP(x_t | x_{1..t-1})under TTT adaptation over-counts the prefix evidence. This predicts the monotonic-in-α result and predicts a non-TTT eval pipeline might behave differently. The latter was not tested.This PR rules out one tested posterior-corrector path on a reproduced #1610-class phased-TTT stack; it does not claim that all n-gram or posterior correctors are ineffective.
Eval-only bug fix
In
EVAL_ONLY_QUANTIZED_PATHmode,base_model,compiled_model, andcompiled_forward_logitsare allNone(line 3188), but two downstream paths dereferenced them:timed_eval("diagnostic pre-quantization post-ema", ...)dereferencedcompiled_model.forward_logits→AttributeError.del eval_model, compiled_modelcleanup referencedeval_modelwhich was never bound in this mode →UnboundLocalError.Fix:
if not quantized_eval_only:guard on the diagnostic (line 3204), and extend the existing cleanup guard to cover this branch (line 3259). The post-quantization diagnostic still runs because it callsdeserialize(h, device)directly and does not touch theNonelocals.Compliance with Issue #1017 Section III
Walked line-by-line in the folder README under "Compliance with Issue #1017 Section III". Summary:
PrefixNgramCorrectorstate (lines 15-58) populated only viaupdate(x_t), which runs after scoring.logits + alpha * log(q_t)over full V=8192 (line 1122). Laplace init (line 23) guaranteesq_t(v) > 0for all v. Full[V]tensor add, not gathered single-index.update(_tok)(line 2591). Explicit inline comment at line 2583:# Corrector: update state with scored tokens (score-before-update).Warmup uses synthetic tokens only, via a device-local RNG generator (lines 3324-3365). Timer starts at
torch.cuda.synchronize(); t_ttt = time.perf_counter()(lines 3370-3371) after warmup closes.The chunk-static bias approximation is a deliberate engineering choice (per-position bias would cost 32× more GPU forwards or a ~2 GB
[B, S, V]dense tensor per batch per rank, both breaking the time/memory budget). It satisfies score-before-update at chunk granularity rather than per-position — the bias inside chunkcuses only tokens from chunks[0, c). Explicit in the corrector's docstring.Scope
Single-seed (seed 0). Reproduction is compared against #1610's published seed-0 number (1.07216564), not their 3-seed mean. Multi-seed validation was descoped: given a +1.9×10⁻⁵ BPB delta against the matched seed and monotonic +0.002 to +0.017 degradation across the corrector grid, additional seeds would refine variance but are unlikely to flip either direction. The negative-result claim is bounded to seed 0 of the reproduced checkpoint.
Out of scope in this package: α < 0.1, orders > 12, logistic-domain blends, non-TTT eval pipelines.
Artifacts
Self-contained in
records/track_non_record_16mb/2026-04-19_pr1610_reproduction_corrector_negative/:train_gpt.py,submission.json,requirements.txt, rawtrain_seed0.log+ threeablation_1[abc].log, machine-readablereproduction_summary.jsonandablation_summary.json, plusprovenance/(commit SHA, env fingerprint, nvidia-smi). Training logs are raw; the training script writes compact metrics-only output by design.Supplementary external archive: https://huggingface.co/amay01/parameter-golf-pr1610-reproduction-artifacts (141 MB tarball, MD5
caf8adf63d8c80965f6671beba95d7aa). Contains preserved checkpoints (final_model.int6.ptz,final_model.pt) and full intermediate artifacts. Not required to reproduce the headline number.