[Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894
Open
ChideraIbe123 wants to merge 138 commits intoopenai:mainfrom
Open
Conversation
Replace 9 separate blocks with 1 shared block looped 8 times. Each loop gets rank-8 LoRA deltas on all 6 linear layers for diversity. Per-loop scalars (attn_scale, mlp_scale, resid_mix, q_gain). Increase model_dim from 512 to 1024 (freed budget from weight sharing). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Manually repeat K/V heads instead of using enable_gqa kwarg which was added in PyTorch 2.5+. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- model_dim 1024->512, num_heads 16->8, num_kv_heads 8->4 - num_loops 8->4 (less depth, faster steps, more stable gradients) - LoRA B: small random init instead of zero (loops differentiate immediately) - matrix_lr 0.04->0.02 (shared block gets gradient from all loops) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- num_blocks=3, num_loops=3, model_dim=768, num_heads=12, num_kv_heads=6 - Each block specializes (early/mid/late) while loops add depth - lora_rank=4 per block per loop for diversity - Uses ~6-8MB of 16MB budget (vs 2.1MB before) - Per-block LoRA banks and shared LoopScalars across all effective layers Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- LoRA B back to zero init (paper-recommended, stops loss spikes) - matrix_lr 0.02->0.013 (shared block gets 3x gradient from loops) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Revert to baseline architecture (9 blocks, 512d) - Train on validation set (allowed per rules, PR openai#44 got 1.11 BPB) - Lower LRs (matrix_lr=0.02, scalar_lr=0.02) - Add LAWA checkpoint averaging during warmdown Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
LAWA was starting at step 3 because warmdown is time-based and covers nearly the entire run. Now only collects when scale < 0.5 so we only average good late-training checkpoints. Pre-fix: val_bpb 1.2924 pre-quant → 1.4668 after LAWA+quant Training on val set IS working (1.29 beats baseline 1.37). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Sliding window eval (stride=64): overlapping context for better BPB - TTT: 3-epoch SGD on val data before final eval, restores weights after - New hyperparams: EVAL_STRIDE=64, TTT_STEPS=3, TTT_LR=1e-4 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sliding window and TTT only improved 0.001 BPB but cost 15 min. Quant degradation (0.016 BPB) is the real target — QAT next. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Upweight hard-to-predict tokens (high entropy) by 1.5x, downweight easy tokens by 0.5x. Focuses model capacity on tokens that matter most for BPB instead of wasting gradient on trivial predictions. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Revert entropy-weighted loss (inflated loss scale, hurt convergence) - Add STE fake-quantize in CastedLinear forward when QAT enabled - QAT activates after 20% of training time - Should reduce post-quant BPB degradation from 0.016 to ~0.005 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Compresses weight distributions during warmdown for cleaner post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB). QAT still enabled alongside. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
QAT consistently increases quant gap. Ramping WD alone improves pre-quant BPB. Expect best post-quant result with WD only. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
12.5MB compressed with 9 layers → room for 10th layer. Top PRs (openai#287, openai#309) use 10-11 layers for better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
11 layers + 3x MLP — may be tight on 16MB budget. Will test. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
10L+3xMLP should fit under 16MB. 11L+3xMLP had best pre-quant (1.2052) but 18.3MB compressed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- LeakyReLU(0.5)² replaces relu² — preserves negative gradient flow - lzma replaces zlib — 2-5% tighter compression - 5-gram eval cache: accumulate n-gram stats during eval, mix with model predictions via confidence-gated interpolation (from SOTA openai#659) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Novel technique: compute attention as difference of two softmax maps. Cancels noise, promotes sparse attention, improves language modeling. - Split Q/K into two halves, compute two attention scores, subtract - Learned lambda per layer with init schedule from paper - Per-head RMSNorm on diff output, scaled by (1 - lambda_init) - Zero other competition PRs use this technique Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Instead of manual attention matmul, use SDPA for each half: y = SDPA(q1,k1,v) - lambda * SDPA(q2,k2,v) Mathematically equivalent, but gets Flash Attention speed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Differential attention didn't work well with V-splitting. Reverting to: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Layer 0's V output is blended 50/50 into all subsequent layers' V. Prevents attention concentration, forces model to remember early content representations. Zero extra params, minimal speed cost. Proven in competition PR openai#657 (1.1229 BPB). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
VRL hurt slightly. Best config: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD = 1.2302 BPB on 1xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds flash_attn_varlen_func path for within-document attention during training. Attention is restricted to doc boundaries detected via BOS token positions in each batch, eliminating cross-doc attention noise. Changes: - Import flash_attn_varlen_func alongside flash_attn_3_func - Add VARLEN_ENABLED and BOS_TOKEN_ID env var hyperparams - Add _build_cu_seqlens_from_batch helper (detects BOS, builds cu_seqlens) - Thread cu_seqlens/max_seqlen through CausalSelfAttention -> Block -> GPT - Branch in attention: varlen when cu_seqlens provided, else flash_attn_3 - Switch torch.compile to fullgraph=False when VARLEN_ENABLED=1 (data-dep branch) - Training step builds cu_seqlens per batch and passes to model Eval path unchanged. When VARLEN_ENABLED=0 (default) behavior is identical to PR openai#1493 reference. Compliance unchanged (training-only change, causality preserved by causal=True flag). Reference: PR openai#1530 @samacqua, PR openai#1536 @dexhunter Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Implements the paper-aligned variant of TTT-E2E (arxiv:2512.23675). The paper finds that updating embeddings/attention/norms during test-time training causes instability — the stable recipe is to freeze everything except MLP layers in the last 1/4 of blocks. Gated by TTT_E2E_MODE=1. When enabled: - Freezes embeddings, attention, norms, skip weights - Only updates MLP.fc and MLP.proj weights - Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC) - Default last_frac=0.25 (paper recommendation) Compliance: still score-first (scoring happens under no_grad before SGD step), so all 4 Issue openai#1017 conditions are preserved. The change only narrows which params get updated — causality, normalization, score-before-update, and single-pass are all unchanged. Expected effect: more stable TTT (fewer params → less instability), potentially better BPB on the legal score-first track. Reference: End-to-End Test-Time Training for Long Context (arxiv:2512.23675) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Rolled back to PR openai#1493 base, then added only: - Python 3.11 f-string compatibility fix - E2E TTT mode (MLP-only, last-fraction of blocks) E2E TTT gated by TTT_E2E_MODE=1. When enabled: - Freezes embeddings, attention, norms, skip weights - Only updates MLP.fc and MLP.proj weights - Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC) - Default last_frac=0.25 (paper recommendation) VarLen removed — we'll add it back later if needed. Reference: End-to-End Test-Time Training for Long Context (arxiv:2512.23675) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previously the eval pipeline always ran 4 passes: pre-quantization -> quantized -> quantized_sliding_window -> quantized_ttt On SP1024 this totaled ~700s, over the 600s eval budget. The only eval that matters for E2E TTT submissions is the final quantized_ttt pass. Changes: - New env var SKIP_REDUNDANT_EVALS=1 skips pre-quant, quant, and sliding window evals (keeps only quantized_ttt). - TTT no longer requires sliding_window_enabled=1 (was coupling them for no good reason). Usage for tight eval budget: SKIP_REDUNDANT_EVALS=1 TTT_ENABLED=1 TTT_E2E_MODE=1 ... Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adapted from PR openai#1530 @samacqua (linear_leaky_relu_square_kernel). The kernel fuses matmul(x, W_up.T) with LeakyReLU(0.5)**2 activation into a single Triton kernel using TMA (Hopper H100). Saves the (B, T, 4D) pre-activation HBM round-trip in the forward; in backward, reuses the same kernel to apply the activation gradient to the incoming grad_output before the weight-gradient matmul. Gated by FUSED_MLP_ENABLED=1. When set, every Block's MLP uses the fused kernel during training. Falls back gracefully if Triton or TMA unavailable. Reference: PR openai#1530 @samacqua. Expected: 5-10% training speedup on MLP-dominated blocks, more steps in the 600s cap, ~0.002-0.005 BPB improvement from additional training. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This is a from-scratch Triton kernel (not just a copy) that fuses THREE operations into one kernel: RMSNorm (per-row inverse rms) multiplied by ln_scale, then matmul with W_up, then LeakyReLU(0.5)^2 activation. Saves the (B*T, D=512) x_normed HBM round-trip that PR openai#1530 leaves on the table. Two new kernels: - _rms_inv_kernel: per-row inverse-rms reduction (small) - _fused_rms_linear_lrs_kernel: takes inv_rms + ln_scale, applies the rmsnorm scaling row-wise during the K loop, then matmul + activation (extends PR openai#1530's persistent-TMA structure) Custom backward implements the full RMSNorm chain rule: dx = ln_scale * inv_rms * (dx_normed - x * inv_rms^2 * mean(dx_normed*x)) This makes the backward correct without saving x_normed (which would defeat the HBM savings). Block.forward branches on mlp.use_fused: when fused, it skips the eager mlp_norm() call and passes raw x + ln_scale_factor to MLP, which then runs the fused kernel that does normalization internally. Gated by FUSED_MLP_ENABLED=1. Eager fallback unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds _FusedSimpleMLPFn alongside _FusedRMSMLPFn, selectable by FUSED_MLP_FULL=1 env var. The simple variant does RMSNorm in eager PyTorch (like PR openai#1530) and only fuses matmul + LeakyReLU^2; my v1 variant (_FusedRMSMLPFn) additionally fuses per-row inv_rms * ln_scale scaling into the K-loop. Purpose: A/B test whether my RMSNorm fusion addition is counterproductive. If simple > v1, per-K scaling overhead eats HBM savings. If simple == v1, kernel choice is saturated. Reuses same Triton kernel via FUSE_RMS constexpr branch. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Key precision bugs fixed in the fused kernel: 1. Forward: previously computed aux = lrs(c0)^2 where c0 was bf16. Now computes aux = lrs(acc0)^2 in fp32, only downcasts at HBM store. 2. Backward: previously loaded pre as bf16, applied lrs'(pre) in bf16 to the incoming gradient (also in bf16 before the multiply). Now loads pre, upcasts to fp32, applies derivative in fp32, then downcasts the final result. Hypothesis: the precision/throughput inversion observed in v1/v2 (~0.5% faster but worse BPB) was caused by these intermediate bf16 downcasts losing accumulation precision. If this hypothesis is correct, v3 should match or beat eager BPB while preserving the speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Deep audit (via compare to PR openai#1450/openai#1555 + Triton tutorials + Liger-Kernel) identified why v1-v3 couldn't beat eager. Three real bugs fixed: 1. EPILOGUE SCALE (was bug openai#2 = no-speedup cause) Old: row_scale applied to `a` INSIDE the K-loop. This serializes the TMA->wgmma software pipeline — every A tile needs elementwise modification after TMA arrives before wgmma can start, killing num_stages=4 pipelining. New: accumulator *= row_scale[:, None] in the epilogue, once per tile. Algebraically identical because row_scale depends only on rows. TMA pipelining preserved. 2. FP32 INV_RMS (was bug #1 = BPB regression cause) Old: inv_rms stored as bf16 (7-bit mantissa). Rounded scale propagated into pre-activation, discontinuous leaky_relu^2 amplified it, and it leaked into backward dw1 and dx. New: inv_rms is fp32 end-to-end. 3. L2 SWIZZLE (was bug openai#3 = 5-15% perf left on table) Old: row-major tile iteration thrashes L2 (every SM touches every N column of B in first few iterations). New: GROUP_SIZE_M=8 grouped scheduling reuses B tiles across 8 consecutive m-tiles per SM -> better L2 hit rate. Reference: PR openai#1450/openai#1555 architecture + Triton 09-persistent-matmul tutorial. These are the known-good Hopper TMA fused MLP patterns. Expected: v4 should beat v1 (1.1106) AND beat eager (1.1104) if the audit's diagnosis is correct. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ing)
Kernel now writes act_grad = d/dh[leaky_relu(h)^2] = where(h>0, 2h, 0.5h)
to the aux buffer instead of post = leaky_relu(h)^2.
Forward output semantics:
Old: c=pre (scaled pre-activation), aux=post
New: c=post (used for dw2), aux=act_grad (used for dpre multiply)
Backward simplification:
Old kernel loaded pre from aux, computed where(pre>0, 2*pre, 0.5*pre)
per tile, multiplied by acc, stored result.
New kernel loads act_grad directly, just multiplies by acc, stores.
Saves: tl.where + fp32 multiply + fp32 cast per backward tile.
Matches PR openai#1450's "+10.5% throughput" design. The structural difference
is that forward now computes both post AND act_grad from the same acc
in fp32, making the backward kernel a pure elementwise multiply.
Keeps v4's audit fixes (epilogue scale, fp32 inv_rms, GROUP_SIZE_M=8).
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
5-variant systematic ablation of manual Triton MLP fusion at 27M x 600s x H100. All 5 variants (including audit-guided best practices and exact PR openai#1450 architecture that claims +10.5% throughput) land within 0.0008 BPB of each other, all worse than torch.compile eager. Research finding: manual block-level MLP fusion cannot beat torch.compile's automatic fusion ceiling at this model scale. Implications for parameter-golf participants documented. Best variant: v4 (audit fixes) at 1.1107 vs eager 1.1104. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…lash_attn Replaces the opaque flash_attn_3_func call with PyTorch's native SDPA. This lets torch.compile trace through the attention mechanism and potentially fuse it with Q/K/V projections, RoPE, and the output projection — unlike flash_attn which is a black box to the compiler. Gated by NATIVE_SDPA=1. GQA handled via repeat_interleave (compatible with torch 2.4+). torch.compile can dispatch to cuDNN attention backend on H100, which may be faster than FA3 for some shapes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR submits a fully under-cap, under-time, rule-compliant non-record branch from an SP8192 recurrence-focused research cycle.
Final single-seed result:
val_bpb = 1.0996097115,974,435bytes599.092s544.199sMain ideas
MuonEq-RENABLE_LOOPING_AT=0.42RECUR_AB)QAT-liteon sensitiveq/kprojectionsResearch context
This branch came out of a broader legal-only search over recurrence-native and compression-aware techniques. The main findings that survived into the final submission were:
[email protected]beat earlier recurrence schedules like0.35and0.40RECUR_ABbeat both the plain recurrence stack and the earlierRecurAlphavariantHQClipimproved quality but blew up artifact size too much to submitRECUR_LORA,AWQ-lite, and compressor-only swaps did not survive the quality/size tradeoffFinal metrics
1.10461.13361.0996097115,949,49224,94315,974,435Compliance checklist
Why non-record
Reproduction
Credits
Built on top of techniques from PR #1493 (@bigbag), PR #1394, PR #1412. Novel additions: MuonEq-R integration, wallclock-aware recurrence scheduling, RECUR_AB learned blending, QAT-lite regularization.