Skip to content

Record: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB#1776

Open
anmarhindi wants to merge 1 commit intoopenai:mainfrom
anmarhindi:submission/sp8192-parresid-3layerloop-qk525-ttt
Open

Record: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB#1776
anmarhindi wants to merge 1 commit intoopenai:mainfrom
anmarhindi:submission/sp8192-parresid-3layerloop-qk525-ttt

Conversation

@anmarhindi
Copy link
Copy Markdown

Summary

val_bpb (TTT) = 1.08083 (3-seed mean, std 0.00062) | ~15.97 MB | 8xH100 SXM

3-Seed Results

Seed Pre-Quant Sliding TTT Artifact
1337 1.08609 1.08083 1.08032 15,971,929
42 1.08733 1.08214 1.08152 15,973,790
7 1.08635 1.08135 1.08064 15,971,863
Mean 1.08659 1.08144 1.08083 15,972,527
Std 0.00064 0.00066 0.00062

Key Techniques

  1. SP8192 + GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0), zero pruning
  2. 3-Layer Depth Recurrence: loops layers 3,4,5 twice, activates at frac=0.35
  3. Parallel Residuals (layers 7+): GPT-J style, attention and MLP share the same
    pre-residual input
  4. QK-Gain 5.25: learnable per-head query scaling
  5. Legal Score-First TTT: SGD (lr=0.005, momentum=0.9), 3 epochs per 32K-token chunk,
    freeze first 9 blocks, cross-rank dist.all_reduce on gradients, cosine LR decay
  6. Tuned HPs: MUON_WD=0.095, MATRIX_LR=0.022, EMA=0.9965, WARMDOWN_FRAC=0.72
  7. FA3/SDPA backend switch: USE_FA3=1 uses flash_attn_3_func on Hopper; USE_FA3=0
    falls back to F.scaled_dot_product_attention(..., enable_gqa=True) for non-Hopper GPUs

Rule Compliance (Issue #1017, Track B legal eval-time adaptation)

  • Causality: strict sliding-window causal eval
  • Normalized distribution: standard softmax over full vocab, no n-gram cache, no logit
    biasing
  • Score before update: each 32K-token chunk fully scored under torch.no_grad() BEFORE any
    SGD update
  • Single pass: each token scored exactly once, no rescoring / multi-pass selection
  • No SLOT, no pre-quant TTT, no ETLB, no n-gram cache or tilt
  • Artifact ≤ 16 MB on all 3 seeds (max 15,973,790 B)
  • Training ≤ 600 s on all 3 seeds (~588 s actual)
  • Eval (sliding + TTT) ≤ 600 s on all 3 seeds (~545 s actual

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant