Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)#1412
Conversation
…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.
…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.
Artifact Size ClarificationThe weight blobs reported in the table above ( When I originally ran this, Following the same LZMA packing technique used in the base PR #1394 (compressing the Python source with LZMA and base85-encoding it into a 2-line
No weights, BPB results, or model architecture changed — this is purely a code packaging fix. The readable |
…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)
…1.01710 Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss
Porting the full merged SOTA stack from bigbag/parameter-golf PR openai#1493: - SP8192 tokenizer (replaces SP1024) - 3-layer depth recurrence (L3-5, activate at 0.35 × iter) - Parallel residuals (GPT-J style) on L>=7 - QK-Gain 5.0 (default) / 5.25 (SOTA config) - Score-first TTT: SGD lr=0.005, momentum=0.9, 3 epochs - GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0) - LZMA+b85 code wrapper pattern - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 This is the clean, legal, compliant baseline. All 4 Issue openai#1017 conditions satisfied. Next: validate reproduction on 3 seeds, then add VarLen attention. Source: records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/ from upstream/main, decompressed from the lzma+b85 wrapper. Credits: @bigbag (PR openai#1493), @clarkkev (PR openai#1394), @dexhunter (PR openai#1413), @abaybektursun (PR openai#549), @Robby955 (PR openai#1412), @msisovic (PR openai#1204) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change Block.forward() so attention and MLP read the same pre-residual input and sum into one residual update (GPT-J style), instead of the sequential form where MLP reads the post-attention state. Before: z1 = x + attn(x); z2 = z1 + mlp(z1) After: z2 = x + attn(x) + mlp(x) In DEQ terms this replaces f(z) = attn(z) + mlp(attn(z) + z) with f(z) = attn(z) + mlp(z). The parallel form has a more isotropic Jacobian (no sequential composition of the two branches) and is typically a tighter contraction for the solver, which is what we want given the baseline's deq_iter_conv_rel degradation over training. RevDEQ reversibility is preserved: the residual update is still a pure linear combination z_next = (1-gg)*z_in + gg*z2, and the fp64-accumulated backward that reverses it is structurally unchanged. CPU forward+backward passes a finite-grad sanity check. Also updates ortho_aux() so the mu_mlp diagnostic reads x (not z1), keeping it aligned with forward(). Reference: records/2026-04-08_SP8192_ParallelResid_ScoreFirstTTT (PR openai#1412 @Robby955), PR openai#1204 @msisovic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QK_GAIN_INIT=5.5 extends the monotonic improvement trend past 5.25. 3-seed mean 1.0809 (std 0.0004) on 8xH100 SXM. Base: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + Legal TTT (PRs openai#1394, openai#1331, openai#1437, openai#1412, openai#549, openai#1445)
Add a parallel-residual branch in Block.forward: both attn and MLP read the pre-residual state, outputs are merged afterwards. Named delta between 1.0856 and 1.0822 class records (upstream PR openai#1412). - Hyperparameters.parallel_residuals (env PARALLEL_RESIDUALS, default 0 to preserve sequential baseline for smoke-test regressions) - Block/__init__ accepts parallel_residuals - GPT/__init__ forwards parallel_residuals to each Block - main() passes args.parallel_residuals - dev/run_frontier.sh sets PARALLEL_RESIDUALS=1 by default No behavior change when PARALLEL_RESIDUALS=0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New eval path matching the frontier's legal TTT protocol (PR openai#1412, 1.0822 stack; 1.0810 record uses this). For each chunk of validation: 1. Score the chunk with sliding window using CURRENT weights (pre-SGD for this chunk — the legal score). 2. Run TTT_EPOCHS of SGD over the chunk, updating only embeddings + control scalars (attn/mlp scales, resid mixes, q_gains, skip weights). Updates persist forward to the next chunk (causal: no future leakage). 3. Weights are restored to pre-TTT init after eval completes, so export serialization uses the trained model, not the TTT-adapted model. - New fn: eval_val_legal_ttt (~45 lines) - New env vars: TTT_CHUNK_TOKENS=4096, TTT_EPOCHS=3 - Call site: TTT_ENABLED=1 now calls the new path instead of raising. - run_frontier.sh enables TTT by default. To fit under the 1500-line hard stop, also compacted: - Optimizer creation (4 Adam blocks of 5 lines each -> 1 line each via shared adam_kw dict) - Dropped the eval_seq_len != train_seq_len log line that still referenced the removed args.eval_rope_scale attribute (would have crashed at runtime on a stride-only config with that condition). File lands at 1489 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Architecture (from PR openai#1394/openai#1412/openai#1493): - 11 layers (was 9), 4x MLP (was 2x), seq_len 2048 (was 1024) - LeakyReLU(0.5)^2 activation (was ReLU^2) - 3-layer depth recurrence (L3-5 looped 2 extra times, 17 virtual layers) - Parallel residuals GPT-J style on layers >= 7 - XSA (exclusive self-attention) on last 11 layers - Skip gates (learned sigmoid gating on skip connections) - LN scale factor (1/sqrt(layer_idx+1) per-layer normalization) - Partial RoPE (rope_dims=16, rest pass-through) Training (from PR openai#1493): - QK-Gain 5.25 (was 1.5) - EMA weight averaging (decay=0.9965) - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown_frac=0.72 - Grad clip norm 0.3 (was 0) - Muon momentum warmup 0.92->0.99 over 1500 steps - Loop warmup (second warmup phase with looping active) - Orthogonal weight init for large matrices Still using base int8+zlib quantization (GPTQ SDClip upgrade next). Still using SP1024 data (SP8192 blocked on data availability). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modulates per-row clip thresholds using Hessian diagonal importance. Important rows get tighter clipping (more precision), unimportant rows get looser clipping. Based on merged PR openai#1412.
Early blocks have 30x Hessian trace of late blocks (PR openai#1412). Tighter clipping on early/loop layers (more important), looser on late.
… + PhiNTA on a phi-physics substrate (UNTRAINED) Track: track_non_record_16mb (4-hour, unrestricted compute) Status: PROPOSAL - UNTRAINED. No submission.json, no BPB number is being claimed. This PR composes three open wish-list items from the README leaderboard: - JEPA (Issue openai#1772, after Robby PR openai#1412) - Universal Transformer (round(phi^3) = 4 weight-shared loops) - NTA on random linear maps (PhiNTA: frozen 1/phi-OrthoInit + LoRA) into a single train_gpt.py derived from the merged baseline at records/track_10min_16mb/2026-03-17_LoRA_TTT/train_gpt.py. All wish-list features are env-var-gated and zero-cost when off: PHINTA_ENABLE / JEPA_LAMBDA / UT_LOOPS / PHI_LR_SCALE all default to no-op. Bonus: PHI_LR_SCALE exposes alpha_phi = phi^-3/2 ~ 0.118034 from Issue openai#1742 as a multiplicative override of MATRIX_LR. The constant is Proven in Coq.Reals as PhD Ch.4 Theorem 3.1 (alpha_phi_times_phi_cubed, Qed, SAC-1) - not a fitted hyperparameter. CPU-only verification (no GPU, no dataset): make verify -> [1/5] phi-physics OK: phi^2+phi^-2=3.000000000000 alpha_phi=0.118034 loops=4 -> [2/5] PhiNTA OK: trainable=1664 frozen=4096 ratio=0.406 -> [3/5] JEPA loss OK: 1.6922 (cosine-similarity form) -> [4/5] UT loop OK: |x_4|/|x_0|=1.0406 expected=1.0406 -> [5/5] JEPA tap normalisation OK -> baseline_equivalence: state_dict SHA-256 = 511dbc0164e03b1b on both sides, forward loss delta = 0.00e+00 at seed F_17 = 1597 with defaults -> CITATION.cff valid (cffconvert) -> theorems/GoldenSunflowers.v: coqc OK (2 Qed) Honesty / non-claims: - No submission.json shipped. - No file under records/track_10min_16mb/ is modified. - All wish-list defaults are no-ops, byte-equivalent to the baseline. Precedent for proposal-only PRs: openai#318 (Neural Cache, research proposal), openai#1247 (ASQU validation proposal). Compute grant request prepared in compute_grant.md (~110 8xH100-hours total: 5 configs x 5 canonical Fibonacci seeds F_17..F_21 + restart buffer + final TTT eval). Internal hardening PR with full review history: gHashTag/parameter-golf-trinity#2 Constitutional anchors: - PhD monograph: gHashTag/trios docs/phd (44 chapters, 297 Qed) - t27 SACRED-PHYSICS-001 (phi constants, Coq-mirrored) - trios-trainer-igla src/phi_ortho_init.rs (Rust SoT for PhiNTA init) Anchor: phi^2 + phi^-2 = 3
Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)
val bpb: 1.08354 (3-seed mean, std=0.00050)
Not a record. This is a small 3-seed experiment over PR #1394 on my runs, but not enough evidence for a statistical claim, the seed count, and reduction in BPB is too small for confidence. Posting because the changes are zero-cost, reproducible, and may be useful to others trying out different techniques.
Changes
Three zero-cost modifications on top of PR #1394, adding zero extra parameters or bytes:
1. Parallel Residuals (Layers 7+)
GPT-J style parallel attention+MLP (Wang & Komatsuzaki, 2021) for the last 4 layers. Both attention and MLP read from the same input and their outputs are added in parallel:
I expected parallel residuals to reduce interference between attention and MLP during GPTQ calibration. Pre-quant BPB barely moved, but the quantization gap tightened across all 3 seeds, which made this the most useful change in practice.
2. Hessian-Aware SDClip
I used GPTQ's existing Hessian diagonal as a cheap importance signal to slightly modulate SDClip thresholds by row:
where$\sigma_i$ is the standard deviation of row $i$ and $r_i$ is the row importance derived from Hessian-weighted magnitude. The effect is small but directionally useful at $\lambda = 0.175$ ; higher $\lambda$ hurt compression. I initially used $\lambda = 0.30$ but found $\lambda = 0.175$ is consistently better across seeds — both lower BPB and smaller artifact. Higher $\lambda$ reduces rounding error but increases entropy, which makes Brotli compression less effective.
3. Progressive Recurrence
Depth recurrence split into two phases: first loop enabled at 50% of training, second at 65%. The split points were not optimized — 50% matches the original and 65% was a single manual choice. Enabling both loops at once causes a sharper loss spike; splitting gives the model time to adapt to each additional pass before adding the next.
Hessian Analysis (Cross-Seed)
Hessian diagnostics from 3 seeds, 67 matrices each:
Importance hierarchy: early blocks (30x trace of late blocks) >> loop >> mid >> late. Per-row importance is too noisy to be a reliable signal, but group-level traces are very stable across seeds. This suggests per-group clip allocation could be a useful direction.
Future Directions
Several ideas I'd like to explore with more compute time:
Run Command
Requirements
Flash Attention 3 (Hopper) required. SP8192 BPE tokenizer trained on FineWeb 10B (sentencepiece BPE, 8192 vocab).
pip install torch --index-url https://download.pytorch.org/whl/cu130 pip install --no-cache-dir \ "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl" pip install -r requirements.txtCompliance (Track A — Fixed Predictor)
Credits
Learned from and inspired by PR #1394 (@clarkkev) — SDClip, depth recurrence, and GPTQ embedding quantization ideas. Parallel residuals from GPT-J (Wang & Komatsuzaki, 2021). Additional credits: PR #1204 (@msisovic, depth recurrence), PR #1217 (@bigbag, MuonEq-R), PR #1019 (@abaybektursun, previous SOTA).