Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) by Robby955 · Pull Request #1412 · openai/parameter-golf

Robby955 · 2026-04-06T08:18:29Z

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)

val bpb: 1.08354 (3-seed mean, std=0.00050)

Not a record. This is a small 3-seed experiment over PR #1394 on my runs, but not enough evidence for a statistical claim, the seed count, and reduction in BPB is too small for confidence. Posting because the changes are zero-cost, reproducible, and may be useful to others trying out different techniques.

Seed	Steps	Pre-quant BPB	Post-quant BPB	Sliding BPB	Artifact
1337	5178	1.08765	1.09959	1.08301	15,976,275
42	5180	1.08816	1.10013	1.08363	15,978,439
3141	5182	1.08872	1.10044	1.08399	15,979,649
Mean		1.08818	1.10005	1.08354	15,978,121

Changes

Three zero-cost modifications on top of PR #1394, adding zero extra parameters or bytes:

1. Parallel Residuals (Layers 7+)

GPT-J style parallel attention+MLP (Wang & Komatsuzaki, 2021) for the last 4 layers. Both attention and MLP read from the same input and their outputs are added in parallel:

# Parallel (layers 7-10):
x_out = x + attn_scale * Attn(norm(x)) + mlp_scale * MLP(norm(x))

# Sequential (layers 0-6, unchanged):
h = x + attn_scale * Attn(norm(x))
x_out = h + mlp_scale * MLP(norm(h))

I expected parallel residuals to reduce interference between attention and MLP during GPTQ calibration. Pre-quant BPB barely moved, but the quantization gap tightened across all 3 seeds, which made this the most useful change in practice.

2. Hessian-Aware SDClip

I used GPTQ's existing Hessian diagonal as a cheap importance signal to slightly modulate SDClip thresholds by row:

$$c_i = k \cdot \sigma_i \cdot [1 + \lambda(r_i - 1)], \quad \lambda = 0.175$$

where $\sigma_i$ is the standard deviation of row $i$ and $r_i$ is the row importance derived from Hessian-weighted magnitude. The effect is small but directionally useful at $\lambda = 0.175$; higher $\lambda$ hurt compression. I initially used $\lambda = 0.30$ but found $\lambda = 0.175$ is consistently better across seeds — both lower BPB and smaller artifact. Higher $\lambda$ reduces rounding error but increases entropy, which makes Brotli compression less effective.

3. Progressive Recurrence

Depth recurrence split into two phases: first loop enabled at 50% of training, second at 65%. The split points were not optimized — 50% matches the original and 65% was a single manual choice. Enabling both loops at once causes a sharper loss spike; splitting gives the model time to adapt to each additional pass before adding the next.

Hessian Analysis (Cross-Seed)

Hessian diagnostics from 3 seeds, 67 matrices each:

Group-level traces (early/loop/mid/late blocks): $r=0.997$ across seeds
Per-matrix traces: $r=0.994$
Per-row importance: $r=0.12$ (noise)

Importance hierarchy: early blocks (30x trace of late blocks) >> loop >> mid >> late. Per-row importance is too noisy to be a reliable signal, but group-level traces are very stable across seeds. This suggests per-group clip allocation could be a useful direction.

Future Directions

Several ideas I'd like to explore with more compute time:

Per-group clip allocation: Non-uniform $k$ across layer groups, using the stable group-level trace hierarchy as a guide.
Output-Hessian weighting: Using backward-pass gradients for output-side row importance rather than input-side alone.
More seeds: 3 seeds is not enough for strong statistical claims. I'd want 5+ to be confident about the gap vs PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394.
YAQA: I like the idea of the paper (arXiv:2505.22988), but I couldn't get a working backward pass for it. I think maybe it could be adapted for the parameter golf problem in an interesting way. I also like the math in Mousse (arXiv:2603.09697), but exploiting curvature in small LMs seems tough.

Run Command

HESSIAN_CLIP_LAMBDA=0.175 LOOP_PHASE2_AT=0.65 PARALLEL_RESIDUAL_START=7 SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt_sweep.py

Requirements

Flash Attention 3 (Hopper) required. SP8192 BPE tokenizer trained on FineWeb 10B (sentencepiece BPE, 8192 vocab).

pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install --no-cache-dir \
  "https://download.pytorch.org/whl/cu130/flash_attn_3-3.0.0-cp39-abi3-manylinux_2_28_x86_64.whl"
pip install -r requirements.txt

Compliance (Track A — Fixed Predictor)

No TTT, SLOT, n-gram cache, or eval-time adaptation
GPTQ calibration within training budget
Standard autoregressive sliding-window eval (stride=64)

Credits

Learned from and inspired by PR #1394 (@clarkkev) — SDClip, depth recurrence, and GPTQ embedding quantization ideas. Parallel residuals from GPT-J (Wang & Komatsuzaki, 2021). Additional credits: PR #1204 (@msisovic, depth recurrence), PR #1217 (@bigbag, MuonEq-R), PR #1019 (@abaybektursun, previous SOTA).

…08354 BPB)

@Robby955

…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.

…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.

Robby955 · 2026-04-09T22:19:01Z

Artifact Size Clarification

The weight blobs reported in the table above (artifact_bytes) are just the compressed model weights. The full submission artifact = weights + train_gpt.py code.

When I originally ran this, train_gpt.py used exec(open("train_gpt_decode.py").read()) which meant the effective code size was the full readable script (~78KB), pushing the total to ~16.05MB — slightly over the 16MB limit.

Following the same LZMA packing technique used in the base PR #1394 (compressing the Python source with LZMA and base85-encoding it into a 2-line train_gpt.py), the code shrinks from ~78KB to ~23KB:

Component	Before	After (LZMA-packed)
Weights (seed 1337)	15,976,275	15,976,275 (unchanged)
Code (`train_gpt.py`)	78,688	23,095
Total	16,054,963 (over)	15,999,370 (under by 630)

No weights, BPB results, or model architecture changed — this is purely a code packaging fix. The readable train_gpt_decode.py (included in the PR as train_gpt_human.py equivalent) is unchanged.

…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)

…1.01710 Merged SOTA changed from 1.1147 to 1.0810 (PR openai#1493, bigbag, 2026-04-09). Six PRs merged in 5 days (PRs openai#1334, openai#1285, openai#1394, openai#1412, openai#1413, openai#1477, openai#1493). New target: ≤1.0760 val_bpb. 18 days to deadline. Key findings: - GDN-Hybrid (PR openai#1564): 1.01710 BPB, no TTT/SLOT — monitor for organizer review - VarLen Attention + Doc-TTT (PR openai#1560): 1.07406 BPB — implement next - TMA Megakernel + Tap-In (PR openai#1555): 1.07636 BPB — add after openai#1560 - PR openai#731 n-gram (dense count + Laplace): reviewer says LOOKS CLEAN, awaiting 3rd seed - PR openai#758: major legality flags, do not implement Updated CLAUDE.md: Competition Strategy, Technique Reference, Lessons Learned (Session 9). Updated logs/daily_research.md: new 2026-04-12 entry prepended. https://claude.ai/code/session_011WyxjcwdigLhMFQDjLL5ss

@bigbag

Porting the full merged SOTA stack from bigbag/parameter-golf PR openai#1493: - SP8192 tokenizer (replaces SP1024) - 3-layer depth recurrence (L3-5, activate at 0.35 × iter) - Parallel residuals (GPT-J style) on L>=7 - QK-Gain 5.0 (default) / 5.25 (SOTA config) - Score-first TTT: SGD lr=0.005, momentum=0.9, 3 epochs - GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0) - LZMA+b85 code wrapper pattern - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 This is the clean, legal, compliant baseline. All 4 Issue openai#1017 conditions satisfied. Next: validate reproduction on 3 seeds, then add VarLen attention. Source: records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/ from upstream/main, decompressed from the lzma+b85 wrapper. Credits: @bigbag (PR openai#1493), @clarkkev (PR openai#1394), @dexhunter (PR openai#1413), @abaybektursun (PR openai#549), @Robby955 (PR openai#1412), @msisovic (PR openai#1204) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Robby955

Change Block.forward() so attention and MLP read the same pre-residual input and sum into one residual update (GPT-J style), instead of the sequential form where MLP reads the post-attention state. Before: z1 = x + attn(x); z2 = z1 + mlp(z1) After: z2 = x + attn(x) + mlp(x) In DEQ terms this replaces f(z) = attn(z) + mlp(attn(z) + z) with f(z) = attn(z) + mlp(z). The parallel form has a more isotropic Jacobian (no sequential composition of the two branches) and is typically a tighter contraction for the solver, which is what we want given the baseline's deq_iter_conv_rel degradation over training. RevDEQ reversibility is preserved: the residual update is still a pure linear combination z_next = (1-gg)*z_in + gg*z2, and the fp64-accumulated backward that reverses it is structurally unchanged. CPU forward+backward passes a finite-grad sanity check. Also updates ortho_aux() so the mu_mlp diagnostic reads x (not z1), keeping it aligned with forward(). Reference: records/2026-04-08_SP8192_ParallelResid_ScoreFirstTTT (PR openai#1412 @Robby955), PR openai#1204 @msisovic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

QK_GAIN_INIT=5.5 extends the monotonic improvement trend past 5.25. 3-seed mean 1.0809 (std 0.0004) on 8xH100 SXM. Base: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + Legal TTT (PRs openai#1394, openai#1331, openai#1437, openai#1412, openai#549, openai#1445)

Add a parallel-residual branch in Block.forward: both attn and MLP read the pre-residual state, outputs are merged afterwards. Named delta between 1.0856 and 1.0822 class records (upstream PR openai#1412). - Hyperparameters.parallel_residuals (env PARALLEL_RESIDUALS, default 0 to preserve sequential baseline for smoke-test regressions) - Block/__init__ accepts parallel_residuals - GPT/__init__ forwards parallel_residuals to each Block - main() passes args.parallel_residuals - dev/run_frontier.sh sets PARALLEL_RESIDUALS=1 by default No behavior change when PARALLEL_RESIDUALS=0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New eval path matching the frontier's legal TTT protocol (PR openai#1412, 1.0822 stack; 1.0810 record uses this). For each chunk of validation: 1. Score the chunk with sliding window using CURRENT weights (pre-SGD for this chunk — the legal score). 2. Run TTT_EPOCHS of SGD over the chunk, updating only embeddings + control scalars (attn/mlp scales, resid mixes, q_gains, skip weights). Updates persist forward to the next chunk (causal: no future leakage). 3. Weights are restored to pre-TTT init after eval completes, so export serialization uses the trained model, not the TTT-adapted model. - New fn: eval_val_legal_ttt (~45 lines) - New env vars: TTT_CHUNK_TOKENS=4096, TTT_EPOCHS=3 - Call site: TTT_ENABLED=1 now calls the new path instead of raising. - run_frontier.sh enables TTT by default. To fit under the 1500-line hard stop, also compacted: - Optimizer creation (4 Adam blocks of 5 lines each -> 1 line each via shared adam_kw dict) - Dropped the eval_seq_len != train_seq_len log line that still referenced the removed args.eval_rope_scale attribute (would have crashed at runtime on a stride-only config with that condition). File lands at 1489 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Architecture (from PR openai#1394/openai#1412/openai#1493): - 11 layers (was 9), 4x MLP (was 2x), seq_len 2048 (was 1024) - LeakyReLU(0.5)^2 activation (was ReLU^2) - 3-layer depth recurrence (L3-5 looped 2 extra times, 17 virtual layers) - Parallel residuals GPT-J style on layers >= 7 - XSA (exclusive self-attention) on last 11 layers - Skip gates (learned sigmoid gating on skip connections) - LN scale factor (1/sqrt(layer_idx+1) per-layer normalization) - Partial RoPE (rope_dims=16, rest pass-through) Training (from PR openai#1493): - QK-Gain 5.25 (was 1.5) - EMA weight averaging (decay=0.9965) - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown_frac=0.72 - Grad clip norm 0.3 (was 0) - Muon momentum warmup 0.92->0.99 over 1500 steps - Loop warmup (second warmup phase with looping active) - Orthogonal weight init for large matrices Still using base int8+zlib quantization (GPTQ SDClip upgrade next). Still using SP1024 data (SP8192 blocked on data availability). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Modulates per-row clip thresholds using Hessian diagonal importance. Important rows get tighter clipping (more precision), unimportant rows get looser clipping. Based on merged PR openai#1412.

Early blocks have 30x Hessian trace of late blocks (PR openai#1412). Tighter clipping on early/loop layers (more important), looser on late.

… + PhiNTA on a phi-physics substrate (UNTRAINED) Track: track_non_record_16mb (4-hour, unrestricted compute) Status: PROPOSAL - UNTRAINED. No submission.json, no BPB number is being claimed. This PR composes three open wish-list items from the README leaderboard: - JEPA (Issue openai#1772, after Robby PR openai#1412) - Universal Transformer (round(phi^3) = 4 weight-shared loops) - NTA on random linear maps (PhiNTA: frozen 1/phi-OrthoInit + LoRA) into a single train_gpt.py derived from the merged baseline at records/track_10min_16mb/2026-03-17_LoRA_TTT/train_gpt.py. All wish-list features are env-var-gated and zero-cost when off: PHINTA_ENABLE / JEPA_LAMBDA / UT_LOOPS / PHI_LR_SCALE all default to no-op. Bonus: PHI_LR_SCALE exposes alpha_phi = phi^-3/2 ~ 0.118034 from Issue openai#1742 as a multiplicative override of MATRIX_LR. The constant is Proven in Coq.Reals as PhD Ch.4 Theorem 3.1 (alpha_phi_times_phi_cubed, Qed, SAC-1) - not a fitted hyperparameter. CPU-only verification (no GPU, no dataset): make verify -> [1/5] phi-physics OK: phi^2+phi^-2=3.000000000000 alpha_phi=0.118034 loops=4 -> [2/5] PhiNTA OK: trainable=1664 frozen=4096 ratio=0.406 -> [3/5] JEPA loss OK: 1.6922 (cosine-similarity form) -> [4/5] UT loop OK: |x_4|/|x_0|=1.0406 expected=1.0406 -> [5/5] JEPA tap normalisation OK -> baseline_equivalence: state_dict SHA-256 = 511dbc0164e03b1b on both sides, forward loss delta = 0.00e+00 at seed F_17 = 1597 with defaults -> CITATION.cff valid (cffconvert) -> theorems/GoldenSunflowers.v: coqc OK (2 Qed) Honesty / non-claims: - No submission.json shipped. - No file under records/track_10min_16mb/ is modified. - All wish-list defaults are no-ops, byte-equivalent to the baseline. Precedent for proposal-only PRs: openai#318 (Neural Cache, research proposal), openai#1247 (ASQU validation proposal). Compute grant request prepared in compute_grant.md (~110 8xH100-hours total: 5 configs x 5 canonical Fibonacci seeds F_17..F_21 + restart buffer + final TTT eval). Internal hardening PR with full review history: gHashTag/parameter-golf-trinity#2 Constitutional anchors: - PhD monograph: gHashTag/trios docs/phd (44 chapters, 297 Qed) - t27 SACRED-PHYSICS-001 (phi constants, Coq-mirrored) - trios-trainer-igla src/phi_ortho_init.rs (Rust SoT for PhiNTA init) Anchor: phi^2 + phi^-2 = 3

Robby955 added 2 commits April 6, 2026 03:16

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.…

3f1e814

…08354 BPB)

Fix LaTeX rendering

4b57791

abaybektursun mentioned this pull request Apr 6, 2026

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420

Open

dexhunter mentioned this pull request Apr 7, 2026

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437

Open

This was referenced Apr 8, 2026

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean) #1477

Merged

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

This was referenced Apr 9, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1492

Closed

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493

Merged

owizdom mentioned this pull request Apr 9, 2026

Non-record: Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — quad-stack synthesis (validation pending compute) #1498

Open

7 tasks

Robby955 changed the title ~~Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)~~ Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) Apr 9, 2026

Robby955 mentioned this pull request Apr 9, 2026

Update README leaderboard for April records #1511

Merged

cocohearts merged commit c714a4d into openai:main Apr 9, 2026

SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026

Merge pull request openai#1412 from Robby955/submission/parallel-resi…

0ef8564

…duals-hessian-sdclip Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)

taka6745 mentioned this pull request Apr 10, 2026

SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824 #1520

Open

6 tasks

MatoTeziTanka mentioned this pull request Apr 10, 2026

PR Acceptance Order and Competition Rules - A discussion - I want to know what you think #1522

Open

pireylow mentioned this pull request Apr 11, 2026

Non-Record: CAT, Sparsity (Structured and Hessian-Guided), MoE, KAN Negative Results #1537

Open

translatingthename mentioned this pull request Apr 11, 2026

Record: SP8192 + Pre-Quant AdamW TTT + Compiled TTT — val_bpb 1.0587 (3-seed mean) #1539

Closed

7 tasks

bigbag mentioned this pull request Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541

Open

translatingthename mentioned this pull request Apr 11, 2026

Non-record: Pre-Quant AdamW TTT (Compiled) + SP8192 + Depth Recurrence — val_bpb 1.0587 (3-seed mean) #1550

Open

4 tasks

MatoTeziTanka mentioned this pull request Apr 11, 2026

Record: Asymmetric Two-Lane Parallel Routing + Tap-In V6 + Legal TTT (1.073938) #1518

Open

ndokutovich mentioned this pull request Apr 12, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + TTT 5ep + N-gram Tilt + Hessian SDClip — val_bpb 1.07730 #1557

Open

7 tasks

abbudjoe mentioned this pull request Apr 12, 2026

[Non Record] Fractal recurrent primitive hybrid - SP1024 1xH100 #1569

Closed

G3sparky mentioned this pull request Apr 18, 2026

Record: QK-Gain 5.5 — val_bpb 1.0810 (3-seed mean) #1715

Closed

alertcat mentioned this pull request Apr 19, 2026

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) #1738

Open

OE-GOD mentioned this pull request Apr 20, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) #1755

Open

kilojoules mentioned this pull request Apr 21, 2026

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) #1758

Open

5 tasks

Amanbig mentioned this pull request Apr 22, 2026

Non-record: SDClip-matched FakeQuantize — reduces quant degradation from +0.17 to +0.044 #1773

Open

aamodbhatt mentioned this pull request Apr 24, 2026

Record: SP8192 + Polar Express NS + Multi-Phase Global TTT — val_bpb 1.0771 (3-seed mean) #1802

Open

davie2009kh mentioned this pull request Apr 24, 2026

Record attempt: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly) — val_bpb 1.07037 (3-seed mean) #1807

Open

PranavViswanath mentioned this pull request Apr 24, 2026

Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809

Open

This was referenced Apr 27, 2026

Record: Pre-Quant TTT + Void Compass — val_bpb 1.0282 (3-seed mean) #1852

Closed

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

This was referenced Apr 27, 2026

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) #1876

Open

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean) #1880

Open

ChideraIbe123 mentioned this pull request Apr 28, 2026

[Non-record] SP8192 + MuonEq-R + Loop@0.42 + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971 #1894

Open

7 tasks

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

davie2009kh mentioned this pull request Apr 29, 2026

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569 #1929

Open

hilbertmeng mentioned this pull request Apr 29, 2026

[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean) #1936

Open

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

This was referenced Apr 30, 2026

Record: SP10240 SimCTG + 3-Layer Recurrence — 1.07502 sliding-window (3-seed) #1971

Open

Record: SP10240 SimCTG + PreQuantTTT — 1.03983 sliding-window (3-seed) #1972

Open

G3sparky mentioned this pull request Apr 30, 2026

Flower Brain v3: SmearGate + LoRA-TTT + GPTQ — val_bpb 1.0680 (unlimited compute, 2hr 1xH100) #1896

Open

This was referenced Apr 30, 2026

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed) #1992

Closed

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed) #2005

Open

BharathSShankar mentioned this pull request Apr 30, 2026

Record: SP10240 + SimCTG + QAHSP + post-quant TTT — 1.07197 ttt-sliding-window (3-seed mean, std 0.00023) #2022

Open

4 tasks

gHashTag mentioned this pull request May 1, 2026

🌻 GOLDEN SUNFLOWERS — JEPA + Universal Transformer + PhiNTA on a φ-physics substrate #2059

Open

jamesEmerson112 mentioned this pull request May 1, 2026

Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed) #2071

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)#1412

Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean)#1412
cocohearts merged 2 commits intoopenai:mainfrom
Robby955:submission/parallel-residuals-hessian-sdclip

Robby955 commented Apr 6, 2026 •

edited

Loading

Uh oh!

Robby955 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Robby955 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-record: Parallel Residuals + Hessian-Aware SDClip (3-seed mean 1.08354 BPB)

Changes

1. Parallel Residuals (Layers 7+)

2. Hessian-Aware SDClip

3. Progressive Recurrence

Hessian Analysis (Cross-Seed)

Future Directions

Run Command

Requirements

Compliance (Track A — Fixed Predictor)

Credits

Uh oh!

Robby955 commented Apr 9, 2026

Artifact Size Clarification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Robby955 commented Apr 6, 2026 •

edited

Loading