Skip to content

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413

Merged
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279
Apr 9, 2026
Merged

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

On top of PR #1394 (@clarkkev) — the current clean sp8192 benchmark — this submission adds a single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, 3 epochs, freeze=0).

  • val_bpb: 1.08279 (3-seed mean across seeds 0/42/1234) — 0.00731 nats/token below PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (1.08563), clearing the 0.005 nats record threshold by 0.00231 nats.
  • All 3 seeds fit 16 MB (margins 7,454–10,942 bytes)
  • Training 588 s / seed, eval 381–392 s / seed (well under the 600 s budgets)
  • No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT follows the PR #549 precedent — every chunk is scored under inference_mode() before any parameter update.

Hardware: 8×H100 80GB SXM, PyTorch 2.9.1+cu128. See the README in the new folder for the full two-table results + diagnostics layout per repo SUBMISSION_GUIDE.md.

Per-seed (post-TTT)

Seed Pre-TTT sliding bpb Post-TTT bpb Δ TTT Artifact Train ms Eval ms
0 1.08397 1.08210 −0.00187 15,991,018 588,004 385,050
42 1.08470 1.08315 −0.00155 15,992,546 588,009 381,500
1234 1.08590 1.08314 −0.00276 15,989,058 588,000 386,880
mean 1.08486 1.08279 −0.00206 15,990,874 588,004 384,477

Lineage / change from PR #1394

Compliance (Issue #1017 four conditions)

  • Condition 1 (Causality): Strict left-to-right causal model. Sliding eval never references future tokens.
  • Condition 2 (Normalized distribution): Standard softmax over full vocab. No logit biasing, no BigramHash, no two-pass.
  • Condition 3 (Score before update): Every TTT chunk is scored under torch.inference_mode() BEFORE any parameter update. Training on a chunk only happens AFTER its scoring has been accumulated into loss_sum. Matches the PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 pattern.
  • Condition 4 (Single pass): Each token is scored exactly once.

Additional flags:

Reproduction

export NCCL_NET=Socket
export QK_GAIN_INIT=5.0
export TTT_ENABLED=1
export TTT_LR=0.005
export TTT_EPOCHS=3
for SEED in 0 42 1234; do
    SEED=$SEED torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

Files

Only adds records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/ with README, submission.json, train_gpt.py, and 3 seed logs.

…(3-seed mean)

On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal
score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the
clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all
fitting 16MB with 7-11K margin.

Per-seed (post-TTT):
- seed 0   : 1.08210 (val_loss 2.79517)
- seed 42  : 1.08315 (val_loss 2.79788)
- seed 1234: 1.08314 (val_loss 2.79785)
- mean     : 1.08279 (2.79697 nats per token)

Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token,
clearing the 0.005 nats record threshold by 0.00231 nats per seed.

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change.
Score-first TTT matches PR openai#549 precedent: every chunk scored under
inference_mode() before any parameter update.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
Changes:
- num_loops: 2 -> 3, enable_looping_at: 0.5 -> 0.35
- Add score-first TTT eval (ported from PR openai#1413)
- Novel twist: ttt_loop_only=1 freezes all except blocks 4-5
- TTT config: LR=0.005, epochs=3, SGD, chunk_tokens=32768

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 6, 2026
…m tilt, SP8192 primary path

- PR openai#771 confirmed CLOSED/REJECTED (train-then-score AdamW TTT)
- PR openai#727 confirmed CLOSED (illegal n-gram hash cache)
- Merged SOTA unchanged at 1.1147
- New primary target: PR openai#1420 (abaybektursun, 1.08014):
  SP8192 + Triple Loop (3×, 17 virtual layers) + N-gram Tilt (legal,
  properly normalized, -0.0029 bpb) + Fused Kernels (+127 steps)
- PR openai#1413 (1.08279): confirms legal score-first TTT adds -0.003 bpb
- ETLB (-0.0019 bpb) noted as unruled — await @valerio-oai
- Strategy updated to v10.0: SP8192 + Triple Loop replaces SP4096 + 2×

https://claude.ai/code/session_01TbdBLJPXpbK5wGHpLAQ9x4
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 7, 2026
Novel: TTT adapts ONLY scalar/control parameters (q_gain, attn_scale,
mlp_scale, resid_mix, RMSNorm weights, skip_weights, skip_gates).
Matrix weights (c_q/c_k/c_v/proj/MLP/tok_emb) stay frozen.

This is mechanistically different from full-model TTT (openai#1413, openai#537):
the model retunes its existing control knobs rather than learning
new weight directions. Higher LR (0.01) since scalars need bigger steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
…am Tilt — val_bpb 1.07800 (3-seed mean)

3-lever stack on top of PR openai#1394 sp8192 baseline:
- Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955)
- 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence)
- Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul)

Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3).

Results (3-seed mean, 8xH100 SXM):
- val_bpb 1.07800 (std 0.00053)
- val_loss 2.78457 nats per token
- Beats PR openai#1394 (1.08563) by 0.01971 nats per token
- Beats PR openai#1420 (1.08014) by 0.00553 nats per token
- Beats own PR openai#1413 (1.08279) by 0.01237 nats per token

All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only
hash construction, full-vocab renormalized one-token tilt, score-before-update
ordering inside the C++ kernel, single left-to-right pass.

C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed
(extern "C" shim + ctypes loader, single g++ -shared invocation at runtime).

5-seed re-verification via the shipped mini wrapper is in progress; this PR
will be updated with the final 5-seed mean once s1337 and s2025 land.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper.
The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800)
which is well within the std (~0.00046). Margins vs the legal open
chronology are unchanged in direction:

- vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar)
- vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar)
- vs own PR openai#1413 (1.08279): -0.01205 nats per token

3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit;
s0 and s1234 mini-wrapper re-runs still in progress.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper.
The mean improves slightly from the prior mixed-source 1.07813 to 1.07807
because s1234 produced a noticeably lower TTT under the mini wrapper
(1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but
the largest single-seed drift in the verification set).

All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections):
- s0:    15,992,304 bytes (7,696 byte headroom)
- s42:   15,993,733 bytes (6,267 byte headroom)
- s1234: 15,990,539 bytes (9,461 byte headroom)
- s1337: 15,988,039 bytes (11,961 byte headroom)
- s2025: 15,992,215 bytes (7,785 byte headroom)

Margins vs the legal open chronology:
- vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar)
- vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar)
- vs own PR openai#1413 (1.08279): -0.01218 nats per token

All four issue openai#1017 conditions remain verified for the n-gram tilt path.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
…-only experts

The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug:
within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch
gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token
metadata at the position being scored), leaking 1-2 bits about the answer
per scored position. This is an Issue openai#1017 condition 2 violation.

PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR
openai#1420's thread and proposed the same fix that's applied here:

  * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last
    prefix token) for hint gating. Updates use the actual current tok via
    new tok_is_bnd / tok_is_ws variables so within_update / word_update
    still segment words correctly. Variable naming and structure copied
    verbatim from PR openai#1420's fix.
  * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0.
    Empirically the within / word experts under prefix-only gating fire
    for the wrong positions (within fires for word-starts, word fires for
    mid-word) and contribute *negative* BPB. Disabling them gives 1.07951
    on s42 vs 1.08108 with the experts active — token_hint is the only
    legitimate contributor.

5-seed verification (all on the patched kernel):

    seed   pre-fix   corrected  delta
    0      1.07751   1.08035    +0.00284
    42     1.07809   1.08097    +0.00288
    1234   1.07813   1.08127    +0.00314
    1337   1.07801   1.08060    +0.00259
    2025   1.07862   1.08135    +0.00273
    mean   1.07807   1.08091    +0.00284

All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB
headroom). Pre-fix per-seed values preserved in submission.json under
seed_results_pre_fix for the public record.

Bar comparisons (corrected mean 1.08091):

    PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar
    PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar
    PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the
                        same bug; would correct to ~1.08300 post-fix)

This PR is left open as a transparency / diagnostic record, NOT as a record
claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest
legal anchor. The README has been retitled "Diagnostic (causal-corrected)"
and the legality fix is documented in a dedicated section.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 7, 2026
Novel mechanism: zero-initialized nn.Embedding(4096, 512) created at
eval time, trained exclusively through the standard score-first TTT loop.
Learns document-local bigram patterns without modifying any artifact weights.

Hash: h = (prev_token * 2039 + curr_token) % 4096
Injection: tok_emb(x) + eval_hash_emb(h), before RMSNorm
Compliance: same score-first pattern as openai#549/openai#1413 TTT precedent.
Precedent for eval-time params: LoRA-TTT (openai#1254, openai#1354).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 7, 2026
…d execution

Materializes two local record folders from fetched refs pr1413 and pr1437
using a builder script that preserves the upstream FORMAT_RAW+FILTER_LZMA2
wrapper format with roundtrip decode validation.

Scripts:
- prepare_pr1413_variants.py: offline builder with wrapper-format fidelity
- runpod_1413.sh: single-run launcher with conditional final_model.pt copy
- runpod_1413_batch.sh: sequential A/B/C/D/E runner with shared archive
  timestamp, one-time SP8192 prep, per-run subdirectories, and g++ guard

Run contract:
  A: faithful openai#1413 control (16,719 code bytes)
  B: PARALLEL_RESIDUAL_START=7
  C: LOOP_START=3 LOOP_END=5
  D: parallel residual + loop adjustment (17,390 code bytes)
  E: eval-only n-gram tilt on D checkpoint (SKIP_TRAINING=1)

Campaign docs updated to reflect the strategy pivot from single-seed
openai#1413 to full offline A/B/C/D/E batch prep.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
…reshold

The previous "Diagnostic" framing was based on a unit error: I compared
val_bpb deltas as if they were nats-per-token deltas, missing the factor
of ~2.583 (mean bytes per token in the sp8192 val set, computable directly
from this submission's val_loss / val_bpb ratio).

With the correct units, the causal-corrected 5-seed mean (1.08091 BPB,
2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394:

  vs PR openai#1394 (1.08563): +0.01219 nats per token  ✅ 2.4× the bar
  vs PR openai#1019 (1.11473): +0.08736 nats per token  ✅ comfortably
  vs PR openai#1413 (ours):    +0.00486 nats per token  — essentially tied
  vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel
                          bug; its corrected ~1.08298 yields +0.00535 nats ✅

Title reverted from "Diagnostic (causal-corrected)" to "Record". The
legality fix section is preserved (the kernel patch is still a real
correctness fix matching @abaybektursun's proposed patch in PR openai#1420).
The leak magnitude in the legality fix section now correctly states
"+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB.

Pre-fix per-seed values are still preserved in submission.json under
seed_results_pre_fix for the public record.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 7, 2026
Base: PR openai#1394 (SP8192 + GPTQ Embeddings + SDClip + DR + MuonEq-R)

Novel: RDClip (Rate-Distortion Clip) — per-group GPTQ clip search
that minimizes compressed_bytes + lambda * Hessian_weighted_MSE.
Extends SDClip's fixed formula to empirical rate-distortion optimization.
Groups: embed, attn_qk, attn_vo, mlp, other.
Search: 5 multipliers per group on first tensor.

Also added: score-first TTT (ported from R12, same as openai#549/openai#1413).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
Remove RDClip to establish baseline for openai#1394 + TTT.
Tests whether the base + TTT matches openai#1413's 1.08279.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
Novel: Context-only delta optimization during eval. Per-batch additive
delta (512-dim) optimized with AdamW on ONLY already-scored positions.
New positions scored with optimized delta. Model weights frozen.

Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS
windows only. No cross-window contamination within current batch.

Same compliance pattern as score-first TTT (openai#549/openai#1413).
Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096).

Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sisegod added a commit to sisegod/parameter-golf that referenced this pull request Apr 8, 2026
Phase 5a is a trivial-wins composition on top of v6.1 SLOT-100 baseline
(2026-04-08_v61_h100_aggressive_slot_steps100, 1.146523):

  1) QK_GAIN_INIT=5.0   (PR openai#1413)
  2) MUON_EQ_R=1        (Newton-Schulz row L2 normalize, PR openai#1394)
  3) --ema 0.9965       (PR openai#1421/openai#1445, vs prior 0.997)
  4) HIDDEN_MULT=5.0    (FFN dim 4x->5x, byte re-investment from int6 tied embed)
  5) EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1
                        (Phase 1A int6 tied embed, -0.6 MB on rANS artifact)

3-seed val_bpb at SLOT lr=0.1 steps=100 stride=64 (mid-eval 28-29% of full
sliding-window):

  s1337: 1.144045  (28.7% of windows)
  s1338: 1.142021  (28.7%)
  s1339: 1.141649  (29.4%)
  -------
  mean:  1.142572
  std:   0.001247

Delta vs prior 2026-04-08_v61_h100_aggressive_slot_steps100 (1.146523):
  -0.003951 bpb

Submitted as non-record because 1.142572 does not beat the current PR openai#1019
record (1.1147). The Phase 5a stack documents both the trivial-wins
composition AND the negative ablations from Phases 1B/1C/2A-C/3/5b that
other submitters can skip:

  Phase 1B (FP32 scalar -> Int8): only -0.05 MB, kept
  Phase 1C (Pentanary -> Ternary BitNet b1.58 1-layer sanity): regression
    +0.014 bpb, abandoned
  Phase 1A pent_tok (Tied embed Pentanary): regression +0.043 bpb, abandoned
  Phase 2A (Inter-layer delta prediction Wl - Wl-1): delta entropy HIGHER
    than W (per-layer ranges differ), abandoned
  Phase 2B (Hadamard 16-dim block transform): no rANS gain, abandoned
  Phase 2C (Context-aware rANS lookup table): rans_codec_rs Rust rebuild
    blocker, abandoned
  Phase 3 (Custom HQGRANS1 binary container, pickle bypass): only -70 KB
    rans / +17 KB after lzma9 -- pickle isn't actually leaking 30%, abandoned
  Phase 4 architecture sweep (1-seed s1337 SLOT-100 stride=64):
    p5a (no extra)        ~1.144   base
    p5a_bg4096            ~1.146   hurts
    p5a_hm5               ~1.144 -> 1.142 (3-seed)  BEST
    p5a_bg4096_hm5        ~1.144   tie
    p5a_bg8192            ~1.148   hurts
    p5a_nl12              ~1.147   hurts
    p5a_ve4               ~1.150   hurts
  Phase 5b (Depth Recurrence PR openai#1239 style):
    nl9r2 (unique 9 x recur 2 = 18 effective): 30% eval @ 1.151, abandoned
    nl7r2 (unique 7 x recur 2 = 14 effective): 92% eval @ 1.166, abandoned

The 28-29% mid-eval window is the converged region: per-window cumulative
bpb has flattened to within +/-0.001 of the 100% value in every prior
3-seed SLOT-100 run we have measured. Full 100%-eval is in flight on the
same H100 pod and will be appended in a follow-up commit if the final
number differs from the mid-eval estimate.

Code change vs 2026-04-08_v61_h100_aggressive_slot_steps100/train_gpt.py is
purely env-var driven (no source-code changes to the model architecture or
serializer). The training script picks up the Phase 5a env vars at import
time (make_model() reads HIDDEN_MULT, EMBED_QUANT_BITS, etc).

Reproducibility:
  bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1337
  bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1338
  bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1339

Hardware: 8x H100 80GB SXM (RunPod). 600s wallclock training,
~50 min single-GPU SLOT-100 eval per seed (eval is unbounded).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 9, 2026
Two of the three comp-frontier wins are env-var bumps with no code change:
- LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer
  recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 /
  openai#1437 use this. Expected -0.005 to -0.01 BPB.
- QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5;
  openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve.
  Expected -0.001 BPB.

C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real
code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in
background.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 9, 2026
Both techniques from PR openai#1477 (SP8192: 1.0822 BPB, -0.033 vs record):
- QK_GAIN_INIT=5.0 (was 4.0): PR openai#1477/openai#1413 both use 5.0, ~0.002-0.005 BPB
- TTT_ENABLED=1, TTT_OPTIMIZER=adamw, TTT_LR=0.0005, TTT_EPOCHS=3, TTT_FREEZE_BLOCKS=1
  Score-First TTT = legal PR openai#461 protocol: score in inference_mode first,
  then adapt. run_legal_ttt() implementation is strictly causal.

NOT included (confirmed illegal):
- SLOT: 2-pass retroactive, optimizes delta on same tokens it scores
- Pre-Quant TTT (PR openai#1482/openai#1489): adapts model with fineweb_val_*.bin
  before GPTQ — dexhunter flagged as val-data-in-training violation

Baseline: sota_32 (WD=0.04, EMA=0.9965, RECUR=4,5, Mousse=1)
Expected: ~1.08-1.09 BPB
owizdom added a commit to owizdom/parameter-golf that referenced this pull request Apr 9, 2026
…nthesis (validation pending)

First submission to stack three independently-legal val-data adaptations on the
PR openai#1487 (1.0600) base:

1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A)
2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations
   to align quantization with the eval distribution (novel on the modern stack;
   PR openai#1019 ablated this on its older base only)
3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering
   (Track B, builds on PR openai#1493)

The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487
(1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent
angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars.

Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val
function, plus 8 hyperparameter defaults flipped). Architecture, optimizer,
training loop, EMA, and quantization machinery are byte-identical to PR openai#1487.

Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear
the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong
non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100
SXM time on RunPod; see VALIDATION.md.

Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time
score-first TTT). No SLOT, no n-gram cache, no ETLB.

Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun,
PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955,
PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.
ChideraIbe123 pushed a commit to ChideraIbe123/parameter-golf that referenced this pull request Apr 14, 2026
Porting the full merged SOTA stack from bigbag/parameter-golf PR openai#1493:
- SP8192 tokenizer (replaces SP1024)
- 3-layer depth recurrence (L3-5, activate at 0.35 × iter)
- Parallel residuals (GPT-J style) on L>=7
- QK-Gain 5.0 (default) / 5.25 (SOTA config)
- Score-first TTT: SGD lr=0.005, momentum=0.9, 3 epochs
- GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0)
- LZMA+b85 code wrapper pattern
- Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72

This is the clean, legal, compliant baseline. All 4 Issue openai#1017 conditions
satisfied. Next: validate reproduction on 3 seeds, then add VarLen attention.

Source: records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/
from upstream/main, decompressed from the lzma+b85 wrapper.

Credits: @bigbag (PR openai#1493), @clarkkev (PR openai#1394), @dexhunter (PR openai#1413),
         @abaybektursun (PR openai#549), @Robby955 (PR openai#1412), @msisovic (PR openai#1204)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tns15june pushed a commit to tns15june/parameter-golf that referenced this pull request Apr 19, 2026
Extends the existing GPTQ path to match current frontier records
(PR openai#1394, openai#1413 family). Two independent knobs added, both
backward-compatible:

- GPTQ_EMBED=1: include the (tied) embedding matrix in the GPTQ
  calibration path. A new forward hook on final_norm captures the
  Hessian for the F.linear(x, tok_emb.weight) output projection, so
  the embedding can be quantized with Hessian-aware column rounding
  instead of amax per-row. quantize_state_dict_int8 drops the
  "not is_embed" guard on the GPTQ branch.
- USE_SDCLIP=1, SDCLIP_K=2.5: per-row scale uses k*std(W) instead of
  amax or percentile clipping. Applied in both gptq_quantize_layer
  and quantize_float_tensor 2D paths. Current frontier uses this to
  avoid outlier-driven scale inflation.

dev/run_frontier.sh enables GPTQ_EMBED=1 USE_SDCLIP=1 SDCLIP_K=2.5 by
default. All three are off by default in Hyperparameters so the
existing QAT/amax paths are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sergeevii123 added a commit to sergeevii123/parameter-golf that referenced this pull request Apr 26, 2026
Per-token NLL rescaled by detached, clipped, mean-1-normalized
ratio of own NLL to batch-mean NLL, raised to alpha (warmup-ramped).
Bit-identical to PR openai#1413 (1.0810 main frontier) when LOSS_REWEIGHT_ALPHA=0.

Patch is 4 surgical edits to PR openai#1413 train_gpt.py: hyperparameters
(+4 env vars), GPT.__init__ (+_train_step buffer), GPT.forward
(constant-branch on alpha==0 else weighted CE), step_fn (fill _train_step
each step). Wrapped LZMA script grew 308 bytes; tightest base seed
keeps ~7.7KB headroom under 16MB cap.

README acknowledges prior negative results (PR openai#1360 Gaussian reweight,
PR openai#1233 focal gamma=2, PR openai#1380 focal investigation) and frames this
as replication on a stronger TTT-heavy base where train-time hardness
focus could interact with eval-time TTT in ways the older bases can't show.
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
…val_bpb 1.07983

3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack.

Changes from PR openai#1394 + PR openai#1413 baseline:
- Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged
- Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal
  score-first TTT; within-word and word-start experts explicitly disabled
  (within_beta=0, word_beta=0) because they cannot be made fully causal.
- 3-seed verification (seeds 0/42/1234)

Seeds:
- seed 0    → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes
- seed 42   → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes
- seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes
- mean      → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes

Delta vs current merged SOTA PR openai#1493 (1.0810):
  0.00117 bpb / 0.00302 nats per token

Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun
(n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT
precedent PR openai#549 / PR openai#461.

Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval
<437s per seed, both under the 600s budget. Artifact under 16 MB on
all 3 seeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants