Skip to content

[Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894

Open
ChideraIbe123 wants to merge 138 commits intoopenai:mainfrom
ChideraIbe123:submission/recurab-042-nonrecord
Open

[Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894
ChideraIbe123 wants to merge 138 commits intoopenai:mainfrom
ChideraIbe123:submission/recurab-042-nonrecord

Conversation

@ChideraIbe123
Copy link
Copy Markdown

@ChideraIbe123 ChideraIbe123 commented Apr 28, 2026

Summary

This PR submits a fully under-cap, under-time, rule-compliant non-record branch from an SP8192 recurrence-focused research cycle.

Final single-seed result:

  • val_bpb = 1.09960971
  • total artifact size: 15,974,435 bytes
  • train time: 599.092s
  • TTT eval time: 544.199s

Main ideas

  • MuonEq-R
  • wallclock-aware depth recurrence activated at ENABLE_LOOPING_AT=0.42
  • learned recurrent alpha/beta blending (RECUR_AB)
  • targeted late QAT-lite on sensitive q/k projections
  • compact artifact engineering, including compressed control tensors / GPTQ scale storage and an LZMA code wrapper

Research context

This branch came out of a broader legal-only search over recurrence-native and compression-aware techniques. The main findings that survived into the final submission were:

  • [email protected] beat earlier recurrence schedules like 0.35 and 0.40
  • RECUR_AB beat both the plain recurrence stack and the earlier RecurAlpha variant
  • broad HQClip improved quality but blew up artifact size too much to submit
  • RECUR_LORA, AWQ-lite, and compressor-only swaps did not survive the quality/size tradeoff

Final metrics

Stage BPB
Raw pre-quant 1.1046
Quantized 1.1336
Final TTT 1.09960971
Artifact item Bytes
Quantized model + Brotli 15,949,492
Code 24,943
Total 15,974,435

Compliance checklist

  • Causal left-to-right dependence
  • Full normalized softmax distribution
  • Score-before-update TTT ordering
  • Single left-to-right pass with no rescoring
  • Train under 600s
  • Eval under 600s
  • Artifact under 16,000,000 bytes

Why non-record

  • single-seed result
  • does not beat the current record stack

Reproduction

SEED=1337 \
MUON_EQR=1 \
EMA_DECAY=0 \
ENABLE_LOOPING_AT=0.42 \
MAX_WALLCLOCK_SECONDS=599.0 \
RECUR_ALPHA_ENABLED=0 \
RECUR_AB_ENABLED=1 \
RECUR_A_INIT=1.0 \
RECUR_B_INIT=0.0 \
QAT_LITE_ENABLED=1 \
QAT_LITE_START_FRAC=0.55 \
QAT_LITE_EVERY=4 \
QAT_LITE_LAMBDA=0.02 \
QAT_LITE_BITS=6 \
QAT_LITE_CLIP_SIGMAS=12.85 \
QAT_LITE_LAYER_START=7 \
QAT_LITE_TARGETS=qk \
QAT_LITE_PENALTY=mse \
QAT_LITE_DEPTH_POWER=0.0 \
COMPRESSOR=brotli \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
torchrun --standalone --nproc_per_node=8 \
records/track_non_record_16mb/2026-04-27_SP8192_MuonEqR_Loop042_RecurAB_QATLite/train_gpt.py

Credits

Built on top of techniques from PR #1493 (@bigbag), PR #1394, PR #1412. Novel additions: MuonEq-R integration, wallclock-aware recurrence scheduling, RECUR_AB learned blending, QAT-lite regularization.

Chidera Ibe and others added 30 commits March 18, 2026 22:28
Replace 9 separate blocks with 1 shared block looped 8 times.
Each loop gets rank-8 LoRA deltas on all 6 linear layers for diversity.
Per-loop scalars (attn_scale, mlp_scale, resid_mix, q_gain).
Increase model_dim from 512 to 1024 (freed budget from weight sharing).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Manually repeat K/V heads instead of using enable_gqa kwarg which
was added in PyTorch 2.5+.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- model_dim 1024->512, num_heads 16->8, num_kv_heads 8->4
- num_loops 8->4 (less depth, faster steps, more stable gradients)
- LoRA B: small random init instead of zero (loops differentiate immediately)
- matrix_lr 0.04->0.02 (shared block gets gradient from all loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- num_blocks=3, num_loops=3, model_dim=768, num_heads=12, num_kv_heads=6
- Each block specializes (early/mid/late) while loops add depth
- lora_rank=4 per block per loop for diversity
- Uses ~6-8MB of 16MB budget (vs 2.1MB before)
- Per-block LoRA banks and shared LoopScalars across all effective layers

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- LoRA B back to zero init (paper-recommended, stops loss spikes)
- matrix_lr 0.02->0.013 (shared block gets 3x gradient from loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Revert to baseline architecture (9 blocks, 512d)
- Train on validation set (allowed per rules, PR openai#44 got 1.11 BPB)
- Lower LRs (matrix_lr=0.02, scalar_lr=0.02)
- Add LAWA checkpoint averaging during warmdown

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
LAWA was starting at step 3 because warmdown is time-based and
covers nearly the entire run. Now only collects when scale < 0.5
so we only average good late-training checkpoints.

Pre-fix: val_bpb 1.2924 pre-quant → 1.4668 after LAWA+quant
Training on val set IS working (1.29 beats baseline 1.37).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Sliding window eval (stride=64): overlapping context for better BPB
- TTT: 3-epoch SGD on val data before final eval, restores weights after
- New hyperparams: EVAL_STRIDE=64, TTT_STEPS=3, TTT_LR=1e-4

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sliding window and TTT only improved 0.001 BPB but cost 15 min.
Quant degradation (0.016 BPB) is the real target — QAT next.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Upweight hard-to-predict tokens (high entropy) by 1.5x, downweight
easy tokens by 0.5x. Focuses model capacity on tokens that matter
most for BPB instead of wasting gradient on trivial predictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Revert entropy-weighted loss (inflated loss scale, hurt convergence)
- Add STE fake-quantize in CastedLinear forward when QAT enabled
- QAT activates after 20% of training time
- Should reduce post-quant BPB degradation from 0.016 to ~0.005

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Compresses weight distributions during warmdown for cleaner
post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB).
QAT still enabled alongside.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
QAT consistently increases quant gap. Ramping WD alone improves
pre-quant BPB. Expect best post-quant result with WD only.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
12.5MB compressed with 9 layers → room for 10th layer.
Top PRs (openai#287, openai#309) use 10-11 layers for better BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
11 layers + 3x MLP — may be tight on 16MB budget. Will test.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
10L+3xMLP should fit under 16MB. 11L+3xMLP had best pre-quant
(1.2052) but 18.3MB compressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- LeakyReLU(0.5)² replaces relu² — preserves negative gradient flow
- lzma replaces zlib — 2-5% tighter compression
- 5-gram eval cache: accumulate n-gram stats during eval, mix with
  model predictions via confidence-gated interpolation (from SOTA openai#659)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Novel technique: compute attention as difference of two softmax maps.
Cancels noise, promotes sparse attention, improves language modeling.
- Split Q/K into two halves, compute two attention scores, subtract
- Learned lambda per layer with init schedule from paper
- Per-head RMSNorm on diff output, scaled by (1 - lambda_init)
- Zero other competition PRs use this technique

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Instead of manual attention matmul, use SDPA for each half:
y = SDPA(q1,k1,v) - lambda * SDPA(q2,k2,v)
Mathematically equivalent, but gets Flash Attention speed.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Differential attention didn't work well with V-splitting.
Reverting to: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Layer 0's V output is blended 50/50 into all subsequent layers' V.
Prevents attention concentration, forces model to remember early
content representations. Zero extra params, minimal speed cost.
Proven in competition PR openai#657 (1.1229 BPB).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
VRL hurt slightly. Best config: 10L + LeakyReLU² + lzma + val training
+ LAWA + ramping WD = 1.2302 BPB on 1xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Chidera Ibe and others added 29 commits April 14, 2026 14:50
Adds flash_attn_varlen_func path for within-document attention during
training. Attention is restricted to doc boundaries detected via BOS
token positions in each batch, eliminating cross-doc attention noise.

Changes:
- Import flash_attn_varlen_func alongside flash_attn_3_func
- Add VARLEN_ENABLED and BOS_TOKEN_ID env var hyperparams
- Add _build_cu_seqlens_from_batch helper (detects BOS, builds cu_seqlens)
- Thread cu_seqlens/max_seqlen through CausalSelfAttention -> Block -> GPT
- Branch in attention: varlen when cu_seqlens provided, else flash_attn_3
- Switch torch.compile to fullgraph=False when VARLEN_ENABLED=1 (data-dep branch)
- Training step builds cu_seqlens per batch and passes to model

Eval path unchanged. When VARLEN_ENABLED=0 (default) behavior is identical
to PR openai#1493 reference. Compliance unchanged (training-only change, causality
preserved by causal=True flag).

Reference: PR openai#1530 @samacqua, PR openai#1536 @dexhunter

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Implements the paper-aligned variant of TTT-E2E (arxiv:2512.23675).
The paper finds that updating embeddings/attention/norms during
test-time training causes instability — the stable recipe is to
freeze everything except MLP layers in the last 1/4 of blocks.

Gated by TTT_E2E_MODE=1. When enabled:
- Freezes embeddings, attention, norms, skip weights
- Only updates MLP.fc and MLP.proj weights
- Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC)
- Default last_frac=0.25 (paper recommendation)

Compliance: still score-first (scoring happens under no_grad before
SGD step), so all 4 Issue openai#1017 conditions are preserved. The change
only narrows which params get updated — causality, normalization,
score-before-update, and single-pass are all unchanged.

Expected effect: more stable TTT (fewer params → less instability),
potentially better BPB on the legal score-first track.

Reference: End-to-End Test-Time Training for Long Context
(arxiv:2512.23675)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Rolled back to PR openai#1493 base, then added only:
- Python 3.11 f-string compatibility fix
- E2E TTT mode (MLP-only, last-fraction of blocks)

E2E TTT gated by TTT_E2E_MODE=1. When enabled:
- Freezes embeddings, attention, norms, skip weights
- Only updates MLP.fc and MLP.proj weights
- Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC)
- Default last_frac=0.25 (paper recommendation)

VarLen removed — we'll add it back later if needed.

Reference: End-to-End Test-Time Training for Long Context (arxiv:2512.23675)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Previously the eval pipeline always ran 4 passes:
  pre-quantization -> quantized -> quantized_sliding_window -> quantized_ttt

On SP1024 this totaled ~700s, over the 600s eval budget. The only eval
that matters for E2E TTT submissions is the final quantized_ttt pass.

Changes:
- New env var SKIP_REDUNDANT_EVALS=1 skips pre-quant, quant, and sliding
  window evals (keeps only quantized_ttt).
- TTT no longer requires sliding_window_enabled=1 (was coupling them
  for no good reason).

Usage for tight eval budget:
  SKIP_REDUNDANT_EVALS=1 TTT_ENABLED=1 TTT_E2E_MODE=1 ...

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adapted from PR openai#1530 @samacqua (linear_leaky_relu_square_kernel).
The kernel fuses matmul(x, W_up.T) with LeakyReLU(0.5)**2 activation
into a single Triton kernel using TMA (Hopper H100). Saves the
(B, T, 4D) pre-activation HBM round-trip in the forward; in backward,
reuses the same kernel to apply the activation gradient to the
incoming grad_output before the weight-gradient matmul.

Gated by FUSED_MLP_ENABLED=1. When set, every Block's MLP uses the
fused kernel during training. Falls back gracefully if Triton or TMA
unavailable.

Reference: PR openai#1530 @samacqua. Expected: 5-10% training speedup on
MLP-dominated blocks, more steps in the 600s cap, ~0.002-0.005 BPB
improvement from additional training.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This is a from-scratch Triton kernel (not just a copy) that fuses
THREE operations into one kernel: RMSNorm (per-row inverse rms)
multiplied by ln_scale, then matmul with W_up, then LeakyReLU(0.5)^2
activation. Saves the (B*T, D=512) x_normed HBM round-trip that
PR openai#1530 leaves on the table.

Two new kernels:
- _rms_inv_kernel: per-row inverse-rms reduction (small)
- _fused_rms_linear_lrs_kernel: takes inv_rms + ln_scale, applies
  the rmsnorm scaling row-wise during the K loop, then matmul +
  activation (extends PR openai#1530's persistent-TMA structure)

Custom backward implements the full RMSNorm chain rule:
  dx = ln_scale * inv_rms * (dx_normed - x * inv_rms^2 * mean(dx_normed*x))
This makes the backward correct without saving x_normed (which would
defeat the HBM savings).

Block.forward branches on mlp.use_fused: when fused, it skips the
eager mlp_norm() call and passes raw x + ln_scale_factor to MLP,
which then runs the fused kernel that does normalization internally.

Gated by FUSED_MLP_ENABLED=1. Eager fallback unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds _FusedSimpleMLPFn alongside _FusedRMSMLPFn, selectable by
FUSED_MLP_FULL=1 env var. The simple variant does RMSNorm in eager
PyTorch (like PR openai#1530) and only fuses matmul + LeakyReLU^2; my v1
variant (_FusedRMSMLPFn) additionally fuses per-row inv_rms * ln_scale
scaling into the K-loop.

Purpose: A/B test whether my RMSNorm fusion addition is counterproductive.
If simple > v1, per-K scaling overhead eats HBM savings.
If simple == v1, kernel choice is saturated.

Reuses same Triton kernel via FUSE_RMS constexpr branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Key precision bugs fixed in the fused kernel:
1. Forward: previously computed aux = lrs(c0)^2 where c0 was bf16.
   Now computes aux = lrs(acc0)^2 in fp32, only downcasts at HBM store.
2. Backward: previously loaded pre as bf16, applied lrs'(pre) in bf16
   to the incoming gradient (also in bf16 before the multiply).
   Now loads pre, upcasts to fp32, applies derivative in fp32, then
   downcasts the final result.

Hypothesis: the precision/throughput inversion observed in v1/v2
(~0.5% faster but worse BPB) was caused by these intermediate bf16
downcasts losing accumulation precision. If this hypothesis is correct,
v3 should match or beat eager BPB while preserving the speedup.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Deep audit (via compare to PR openai#1450/openai#1555 + Triton tutorials + Liger-Kernel)
identified why v1-v3 couldn't beat eager. Three real bugs fixed:

1. EPILOGUE SCALE (was bug openai#2 = no-speedup cause)
   Old: row_scale applied to `a` INSIDE the K-loop. This serializes the
        TMA->wgmma software pipeline — every A tile needs elementwise
        modification after TMA arrives before wgmma can start, killing
        num_stages=4 pipelining.
   New: accumulator *= row_scale[:, None] in the epilogue, once per tile.
        Algebraically identical because row_scale depends only on rows.
        TMA pipelining preserved.

2. FP32 INV_RMS (was bug #1 = BPB regression cause)
   Old: inv_rms stored as bf16 (7-bit mantissa). Rounded scale propagated
        into pre-activation, discontinuous leaky_relu^2 amplified it,
        and it leaked into backward dw1 and dx.
   New: inv_rms is fp32 end-to-end.

3. L2 SWIZZLE (was bug openai#3 = 5-15% perf left on table)
   Old: row-major tile iteration thrashes L2 (every SM touches every N
        column of B in first few iterations).
   New: GROUP_SIZE_M=8 grouped scheduling reuses B tiles across 8
        consecutive m-tiles per SM -> better L2 hit rate.

Reference: PR openai#1450/openai#1555 architecture + Triton 09-persistent-matmul
tutorial. These are the known-good Hopper TMA fused MLP patterns.

Expected: v4 should beat v1 (1.1106) AND beat eager (1.1104) if the
audit's diagnosis is correct.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ing)

Kernel now writes act_grad = d/dh[leaky_relu(h)^2] = where(h>0, 2h, 0.5h)
to the aux buffer instead of post = leaky_relu(h)^2.

Forward output semantics:
  Old: c=pre (scaled pre-activation), aux=post
  New: c=post (used for dw2), aux=act_grad (used for dpre multiply)

Backward simplification:
  Old kernel loaded pre from aux, computed where(pre>0, 2*pre, 0.5*pre)
      per tile, multiplied by acc, stored result.
  New kernel loads act_grad directly, just multiplies by acc, stores.
  Saves: tl.where + fp32 multiply + fp32 cast per backward tile.

Matches PR openai#1450's "+10.5% throughput" design. The structural difference
is that forward now computes both post AND act_grad from the same acc
in fp32, making the backward kernel a pure elementwise multiply.

Keeps v4's audit fixes (epilogue scale, fp32 inv_rms, GROUP_SIZE_M=8).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
5-variant systematic ablation of manual Triton MLP fusion at 27M x 600s
x H100. All 5 variants (including audit-guided best practices and
exact PR openai#1450 architecture that claims +10.5% throughput) land within
0.0008 BPB of each other, all worse than torch.compile eager.

Research finding: manual block-level MLP fusion cannot beat
torch.compile's automatic fusion ceiling at this model scale.
Implications for parameter-golf participants documented.

Best variant: v4 (audit fixes) at 1.1107 vs eager 1.1104.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…lash_attn

Replaces the opaque flash_attn_3_func call with PyTorch's native SDPA.
This lets torch.compile trace through the attention mechanism and
potentially fuse it with Q/K/V projections, RoPE, and the output
projection — unlike flash_attn which is a black box to the compiler.

Gated by NATIVE_SDPA=1. GQA handled via repeat_interleave (compatible
with torch 2.4+). torch.compile can dispatch to cuDNN attention backend
on H100, which may be faster than FA3 for some shapes.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@ChideraIbe123 ChideraIbe123 changed the title [Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact [Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971 Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant