[Non-record] SP8192 + MuonEq-R + Loop@0.42 + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971 by ChideraIbe123 · Pull Request #1894 · openai/parameter-golf

ChideraIbe123 · 2026-04-28T15:45:40Z

Summary

This PR submits a fully under-cap, under-time, rule-compliant non-record branch from an SP8192 recurrence-focused research cycle.

Final single-seed result:

val_bpb = 1.09960971
total artifact size: 15,974,435 bytes
train time: 599.092s
TTT eval time: 544.199s

Main ideas

MuonEq-R
wallclock-aware depth recurrence activated at ENABLE_LOOPING_AT=0.42
learned recurrent alpha/beta blending (RECUR_AB)
targeted late QAT-lite on sensitive q/k projections
compact artifact engineering, including compressed control tensors / GPTQ scale storage and an LZMA code wrapper

Research context

This branch came out of a broader legal-only search over recurrence-native and compression-aware techniques. The main findings that survived into the final submission were:

[email protected] beat earlier recurrence schedules like 0.35 and 0.40
RECUR_AB beat both the plain recurrence stack and the earlier RecurAlpha variant
broad HQClip improved quality but blew up artifact size too much to submit
RECUR_LORA, AWQ-lite, and compressor-only swaps did not survive the quality/size tradeoff

Final metrics

Stage	BPB
Raw pre-quant	`1.1046`
Quantized	`1.1336`
Final TTT	`1.09960971`

Artifact item	Bytes
Quantized model + Brotli	`15,949,492`
Code	`24,943`
Total	`15,974,435`

Compliance checklist

Causal left-to-right dependence
Full normalized softmax distribution
Score-before-update TTT ordering
Single left-to-right pass with no rescoring
Train under 600s
Eval under 600s
Artifact under 16,000,000 bytes

Why non-record

single-seed result
does not beat the current record stack

Reproduction

SEED=1337 \
MUON_EQR=1 \
EMA_DECAY=0 \
ENABLE_LOOPING_AT=0.42 \
MAX_WALLCLOCK_SECONDS=599.0 \
RECUR_ALPHA_ENABLED=0 \
RECUR_AB_ENABLED=1 \
RECUR_A_INIT=1.0 \
RECUR_B_INIT=0.0 \
QAT_LITE_ENABLED=1 \
QAT_LITE_START_FRAC=0.55 \
QAT_LITE_EVERY=4 \
QAT_LITE_LAMBDA=0.02 \
QAT_LITE_BITS=6 \
QAT_LITE_CLIP_SIGMAS=12.85 \
QAT_LITE_LAYER_START=7 \
QAT_LITE_TARGETS=qk \
QAT_LITE_PENALTY=mse \
QAT_LITE_DEPTH_POWER=0.0 \
COMPRESSOR=brotli \
DATA_PATH=./data/datasets/fineweb10B_sp8192 \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
VOCAB_SIZE=8192 \
torchrun --standalone --nproc_per_node=8 \
records/track_non_record_16mb/2026-04-27_SP8192_MuonEqR_Loop042_RecurAB_QATLite/train_gpt.py

Credits

Built on top of techniques from PR #1493 (@bigbag), PR #1394, PR #1412. Novel additions: MuonEq-R integration, wallclock-aware recurrence scheduling, RECUR_AB learned blending, QAT-lite regularization.

Replace 9 separate blocks with 1 shared block looped 8 times. Each loop gets rank-8 LoRA deltas on all 6 linear layers for diversity. Per-loop scalars (attn_scale, mlp_scale, resid_mix, q_gain). Increase model_dim from 512 to 1024 (freed budget from weight sharing). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Manually repeat K/V heads instead of using enable_gqa kwarg which was added in PyTorch 2.5+. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- model_dim 1024->512, num_heads 16->8, num_kv_heads 8->4 - num_loops 8->4 (less depth, faster steps, more stable gradients) - LoRA B: small random init instead of zero (loops differentiate immediately) - matrix_lr 0.04->0.02 (shared block gets gradient from all loops) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- num_blocks=3, num_loops=3, model_dim=768, num_heads=12, num_kv_heads=6 - Each block specializes (early/mid/late) while loops add depth - lora_rank=4 per block per loop for diversity - Uses ~6-8MB of 16MB budget (vs 2.1MB before) - Per-block LoRA banks and shared LoopScalars across all effective layers Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- LoRA B back to zero init (paper-recommended, stops loss spikes) - matrix_lr 0.02->0.013 (shared block gets 3x gradient from loops) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Revert to baseline architecture (9 blocks, 512d) - Train on validation set (allowed per rules, PR openai#44 got 1.11 BPB) - Lower LRs (matrix_lr=0.02, scalar_lr=0.02) - Add LAWA checkpoint averaging during warmdown Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

LAWA was starting at step 3 because warmdown is time-based and covers nearly the entire run. Now only collects when scale < 0.5 so we only average good late-training checkpoints. Pre-fix: val_bpb 1.2924 pre-quant → 1.4668 after LAWA+quant Training on val set IS working (1.29 beats baseline 1.37). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Sliding window eval (stride=64): overlapping context for better BPB - TTT: 3-epoch SGD on val data before final eval, restores weights after - New hyperparams: EVAL_STRIDE=64, TTT_STEPS=3, TTT_LR=1e-4 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Sliding window and TTT only improved 0.001 BPB but cost 15 min. Quant degradation (0.016 BPB) is the real target — QAT next. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Upweight hard-to-predict tokens (high entropy) by 1.5x, downweight easy tokens by 0.5x. Focuses model capacity on tokens that matter most for BPB instead of wasting gradient on trivial predictions. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Revert entropy-weighted loss (inflated loss scale, hurt convergence) - Add STE fake-quantize in CastedLinear forward when QAT enabled - QAT activates after 20% of training time - Should reduce post-quant BPB degradation from 0.016 to ~0.005 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Compresses weight distributions during warmdown for cleaner post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB). QAT still enabled alongside. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

QAT consistently increases quant gap. Ramping WD alone improves pre-quant BPB. Expect best post-quant result with WD only. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

12.5MB compressed with 9 layers → room for 10th layer. Top PRs (openai#287, openai#309) use 10-11 layers for better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

11 layers + 3x MLP — may be tight on 16MB budget. Will test. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

10L+3xMLP should fit under 16MB. 11L+3xMLP had best pre-quant (1.2052) but 18.3MB compressed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- LeakyReLU(0.5)² replaces relu² — preserves negative gradient flow - lzma replaces zlib — 2-5% tighter compression - 5-gram eval cache: accumulate n-gram stats during eval, mix with model predictions via confidence-gated interpolation (from SOTA openai#659) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Novel technique: compute attention as difference of two softmax maps. Cancels noise, promotes sparse attention, improves language modeling. - Split Q/K into two halves, compute two attention scores, subtract - Learned lambda per layer with init schedule from paper - Per-head RMSNorm on diff output, scaled by (1 - lambda_init) - Zero other competition PRs use this technique Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Instead of manual attention matmul, use SDPA for each half: y = SDPA(q1,k1,v) - lambda * SDPA(q2,k2,v) Mathematically equivalent, but gets Flash Attention speed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Differential attention didn't work well with V-splitting. Reverting to: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Layer 0's V output is blended 50/50 into all subsequent layers' V. Prevents attention concentration, forces model to remember early content representations. Zero extra params, minimal speed cost. Proven in competition PR openai#657 (1.1229 BPB). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

VRL hurt slightly. Best config: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD = 1.2302 BPB on 1xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

@samacqua

Adds flash_attn_varlen_func path for within-document attention during training. Attention is restricted to doc boundaries detected via BOS token positions in each batch, eliminating cross-doc attention noise. Changes: - Import flash_attn_varlen_func alongside flash_attn_3_func - Add VARLEN_ENABLED and BOS_TOKEN_ID env var hyperparams - Add _build_cu_seqlens_from_batch helper (detects BOS, builds cu_seqlens) - Thread cu_seqlens/max_seqlen through CausalSelfAttention -> Block -> GPT - Branch in attention: varlen when cu_seqlens provided, else flash_attn_3 - Switch torch.compile to fullgraph=False when VARLEN_ENABLED=1 (data-dep branch) - Training step builds cu_seqlens per batch and passes to model Eval path unchanged. When VARLEN_ENABLED=0 (default) behavior is identical to PR openai#1493 reference. Compliance unchanged (training-only change, causality preserved by causal=True flag). Reference: PR openai#1530 @samacqua, PR openai#1536 @dexhunter Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Implements the paper-aligned variant of TTT-E2E (arxiv:2512.23675). The paper finds that updating embeddings/attention/norms during test-time training causes instability — the stable recipe is to freeze everything except MLP layers in the last 1/4 of blocks. Gated by TTT_E2E_MODE=1. When enabled: - Freezes embeddings, attention, norms, skip weights - Only updates MLP.fc and MLP.proj weights - Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC) - Default last_frac=0.25 (paper recommendation) Compliance: still score-first (scoring happens under no_grad before SGD step), so all 4 Issue openai#1017 conditions are preserved. The change only narrows which params get updated — causality, normalization, score-before-update, and single-pass are all unchanged. Expected effect: more stable TTT (fewer params → less instability), potentially better BPB on the legal score-first track. Reference: End-to-End Test-Time Training for Long Context (arxiv:2512.23675) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Rolled back to PR openai#1493 base, then added only: - Python 3.11 f-string compatibility fix - E2E TTT mode (MLP-only, last-fraction of blocks) E2E TTT gated by TTT_E2E_MODE=1. When enabled: - Freezes embeddings, attention, norms, skip weights - Only updates MLP.fc and MLP.proj weights - Only in blocks with idx >= num_layers * (1 - TTT_E2E_LAST_FRAC) - Default last_frac=0.25 (paper recommendation) VarLen removed — we'll add it back later if needed. Reference: End-to-End Test-Time Training for Long Context (arxiv:2512.23675) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ash_attn

Previously the eval pipeline always ran 4 passes: pre-quantization -> quantized -> quantized_sliding_window -> quantized_ttt On SP1024 this totaled ~700s, over the 600s eval budget. The only eval that matters for E2E TTT submissions is the final quantized_ttt pass. Changes: - New env var SKIP_REDUNDANT_EVALS=1 skips pre-quant, quant, and sliding window evals (keeps only quantized_ttt). - TTT no longer requires sliding_window_enabled=1 (was coupling them for no good reason). Usage for tight eval budget: SKIP_REDUNDANT_EVALS=1 TTT_ENABLED=1 TTT_E2E_MODE=1 ... Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…enai#2)

@samacqua

Adapted from PR openai#1530 @samacqua (linear_leaky_relu_square_kernel). The kernel fuses matmul(x, W_up.T) with LeakyReLU(0.5)**2 activation into a single Triton kernel using TMA (Hopper H100). Saves the (B, T, 4D) pre-activation HBM round-trip in the forward; in backward, reuses the same kernel to apply the activation gradient to the incoming grad_output before the weight-gradient matmul. Gated by FUSED_MLP_ENABLED=1. When set, every Block's MLP uses the fused kernel during training. Falls back gracefully if Triton or TMA unavailable. Reference: PR openai#1530 @samacqua. Expected: 5-10% training speedup on MLP-dominated blocks, more steps in the 600s cap, ~0.002-0.005 BPB improvement from additional training. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

This is a from-scratch Triton kernel (not just a copy) that fuses THREE operations into one kernel: RMSNorm (per-row inverse rms) multiplied by ln_scale, then matmul with W_up, then LeakyReLU(0.5)^2 activation. Saves the (B*T, D=512) x_normed HBM round-trip that PR openai#1530 leaves on the table. Two new kernels: - _rms_inv_kernel: per-row inverse-rms reduction (small) - _fused_rms_linear_lrs_kernel: takes inv_rms + ln_scale, applies the rmsnorm scaling row-wise during the K loop, then matmul + activation (extends PR openai#1530's persistent-TMA structure) Custom backward implements the full RMSNorm chain rule: dx = ln_scale * inv_rms * (dx_normed - x * inv_rms^2 * mean(dx_normed*x)) This makes the backward correct without saving x_normed (which would defeat the HBM savings). Block.forward branches on mlp.use_fused: when fused, it skips the eager mlp_norm() call and passes raw x + ln_scale_factor to MLP, which then runs the fused kernel that does normalization internally. Gated by FUSED_MLP_ENABLED=1. Eager fallback unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Adds _FusedSimpleMLPFn alongside _FusedRMSMLPFn, selectable by FUSED_MLP_FULL=1 env var. The simple variant does RMSNorm in eager PyTorch (like PR openai#1530) and only fuses matmul + LeakyReLU^2; my v1 variant (_FusedRMSMLPFn) additionally fuses per-row inv_rms * ln_scale scaling into the K-loop. Purpose: A/B test whether my RMSNorm fusion addition is counterproductive. If simple > v1, per-K scaling overhead eats HBM savings. If simple == v1, kernel choice is saturated. Reuses same Triton kernel via FUSE_RMS constexpr branch. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Key precision bugs fixed in the fused kernel: 1. Forward: previously computed aux = lrs(c0)^2 where c0 was bf16. Now computes aux = lrs(acc0)^2 in fp32, only downcasts at HBM store. 2. Backward: previously loaded pre as bf16, applied lrs'(pre) in bf16 to the incoming gradient (also in bf16 before the multiply). Now loads pre, upcasts to fp32, applies derivative in fp32, then downcasts the final result. Hypothesis: the precision/throughput inversion observed in v1/v2 (~0.5% faster but worse BPB) was caused by these intermediate bf16 downcasts losing accumulation precision. If this hypothesis is correct, v3 should match or beat eager BPB while preserving the speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Deep audit (via compare to PR openai#1450/openai#1555 + Triton tutorials + Liger-Kernel) identified why v1-v3 couldn't beat eager. Three real bugs fixed: 1. EPILOGUE SCALE (was bug openai#2 = no-speedup cause) Old: row_scale applied to `a` INSIDE the K-loop. This serializes the TMA->wgmma software pipeline — every A tile needs elementwise modification after TMA arrives before wgmma can start, killing num_stages=4 pipelining. New: accumulator *= row_scale[:, None] in the epilogue, once per tile. Algebraically identical because row_scale depends only on rows. TMA pipelining preserved. 2. FP32 INV_RMS (was bug #1 = BPB regression cause) Old: inv_rms stored as bf16 (7-bit mantissa). Rounded scale propagated into pre-activation, discontinuous leaky_relu^2 amplified it, and it leaked into backward dw1 and dx. New: inv_rms is fp32 end-to-end. 3. L2 SWIZZLE (was bug openai#3 = 5-15% perf left on table) Old: row-major tile iteration thrashes L2 (every SM touches every N column of B in first few iterations). New: GROUP_SIZE_M=8 grouped scheduling reuses B tiles across 8 consecutive m-tiles per SM -> better L2 hit rate. Reference: PR openai#1450/openai#1555 architecture + Triton 09-persistent-matmul tutorial. These are the known-good Hopper TMA fused MLP patterns. Expected: v4 should beat v1 (1.1106) AND beat eager (1.1104) if the audit's diagnosis is correct. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…ing) Kernel now writes act_grad = d/dh[leaky_relu(h)^2] = where(h>0, 2h, 0.5h) to the aux buffer instead of post = leaky_relu(h)^2. Forward output semantics: Old: c=pre (scaled pre-activation), aux=post New: c=post (used for dw2), aux=act_grad (used for dpre multiply) Backward simplification: Old kernel loaded pre from aux, computed where(pre>0, 2*pre, 0.5*pre) per tile, multiplied by acc, stored result. New kernel loads act_grad directly, just multiplies by acc, stores. Saves: tl.where + fp32 multiply + fp32 cast per backward tile. Matches PR openai#1450's "+10.5% throughput" design. The structural difference is that forward now computes both post AND act_grad from the same acc in fp32, making the backward kernel a pure elementwise multiply. Keeps v4's audit fixes (epilogue scale, fp32 inv_rms, GROUP_SIZE_M=8). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

5-variant systematic ablation of manual Triton MLP fusion at 27M x 600s x H100. All 5 variants (including audit-guided best practices and exact PR openai#1450 architecture that claims +10.5% throughput) land within 0.0008 BPB of each other, all worse than torch.compile eager. Research finding: manual block-level MLP fusion cannot beat torch.compile's automatic fusion ceiling at this model scale. Implications for parameter-golf participants documented. Best variant: v4 (audit fixes) at 1.1107 vs eager 1.1104. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…lash_attn Replaces the opaque flash_attn_3_func call with PyTorch's native SDPA. This lets torch.compile trace through the attention mechanism and potentially fuse it with Q/K/V projections, RoPE, and the output projection — unlike flash_attn which is a black box to the compiler. Gated by NATIVE_SDPA=1. GQA handled via repeat_interleave (compatible with torch 2.4+). torch.compile can dispatch to cuDNN attention backend on H100, which may be faster than FA3 for some shapes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…nt, math)

Chidera Ibe and others added 30 commits March 18, 2026 22:28

Fix GQA compatibility with PyTorch 2.4 (no enable_gqa arg)

360ff05

Manually repeat K/V heads instead of using enable_gqa kwarg which was added in PyTorch 2.5+. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Fix instability: zero LoRA B init, lower matrix_lr for shared blocks

48691d8

- LoRA B back to zero init (paper-recommended, stops loss spikes) - matrix_lr 0.02->0.013 (shared block gets 3x gradient from loops) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Restore native enable_gqa (PyTorch upgraded on RunPod)

c71cef7

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Increase eval stride 64->512 (64 too slow on 1xH100)

26f3fc7

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Disable slow evals by default, focus on QAT next

ec1834c

Sliding window and TTT only improved 0.001 BPB but cost 15 min. Quant degradation (0.016 BPB) is the real target — QAT next. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add ramping weight decay (0.02→0.08 during warmdown)

7c3260f

Compresses weight distributions during warmdown for cleaner post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB). QAT still enabled alongside. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Disable QAT, keep ramping WD only

49883b9

QAT consistently increases quant gap. Ramping WD alone improves pre-quant BPB. Expect best post-quant result with WD only. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add 10th layer (3.5MB headroom from WD compression)

cde0bef

12.5MB compressed with 9 layers → room for 10th layer. Top PRs (openai#287, openai#309) use 10-11 layers for better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Bump to 11 layers (2.3MB headroom remaining)

8ac68f7

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add 3x MLP expansion (from SOTA PR openai#287)

876e120

11 layers + 3x MLP — may be tight on 16MB budget. Will test. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Drop to 10 layers (11L+3xMLP=18.3MB, over budget)

dc70b92

10L+3xMLP should fit under 16MB. 11L+3xMLP had best pre-quant (1.2052) but 18.3MB compressed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Drop to 9L+3xMLP (10L+3xMLP=16.77MB, over budget)

5d82362

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Revert to best config: 10L + 2x MLP (1.2405 BPB)

db59c97

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Use Flash Attention for Differential Attention (2x speedup)

4f27562

Instead of manual attention matmul, use SDPA for each half: y = SDPA(q1,k1,v) - lambda * SDPA(q2,k2,v) Mathematically equivalent, but gets Flash Attention speed. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Fix SDPA dim mismatch: split V into halves too, concat after

d6ffa58

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Revert to Exp 16 best config (1.2302 BPB)

883056d

Differential attention didn't work well with V-splitting. Reverting to: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Remove 5-gram eval cache (too slow, takes 30+ min on 1xH100)

eb9912f

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Revert to Exp 16 best config (1.2302 BPB) — remove VRL

f19bdce

VRL hurt slightly. Best config: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD = 1.2302 BPB on 1xH100. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Remove 5-gram cache again (came back with revert)

d6810f6

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Chidera Ibe and others added 29 commits April 14, 2026 14:50

Fix Python 3.11 f-string compatibility (quote mismatch inside f-string)

40e7404

Fix Python 3.11 f-string: nested double quotes in glob call

d4cfdf1

Flash Attention fallback chain: FA3 -> FA2 flash_attn_interface -> fl…

abed3e6

…ash_attn

Add SDPA fallback for pods without flash-attn installed

498b734

SDPA fallback: manual GQA repeat for torch<2.5 (enable_gqa unsupported)

357d77f

Add E2E TTT non-record submission (1.1104 BPB, SP1024)

f1e128b

Rewrite E2E TTT README to focus on research findings

6f68c7f

Expand README with per-experiment narrative + hypotheses + findings

3b2cdc1

Trim README: remove fluff, keep research findings tight

0b3e728

Add training log for E2E TTT non-record submission (seed 1337, run op…

430eeb8

…enai#2)

Enable all SDP backends when NATIVE_SDPA=1 (cuDNN, flash, mem_efficie…

01049ab

…nt, math)

Add non-record RECUR_AB SP8192 submission snapshot

ab971e2

Pack int6 weights in recur_ab snapshot

46855b5

Quantize small control tensors in recur_ab snapshot

77a1750

Quantize GPTQ scales in recur_ab snapshot

f98cbcd

Wrap recur_ab snapshot in lzma code stub

16e003b

Polish recur_ab submission docs

bf93b29

ChideraIbe123 changed the title ~~[Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact~~ [Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971 Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894

[Non-record] SP8192 + MuonEq-R + [email protected] + RECUR_AB + QAT-lite + Compact Artifact - Val 1.09960971#1894
ChideraIbe123 wants to merge 138 commits intoopenai:mainfrom
ChideraIbe123:submission/recurab-042-nonrecord

ChideraIbe123 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChideraIbe123 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main ideas

Research context

Final metrics

Compliance checklist

Why non-record

Reproduction

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChideraIbe123 commented Apr 28, 2026 •

edited

Loading