Skip to content

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean)#1731

Closed
Victory963 wants to merge 1 commit intoopenai:mainfrom
Victory963:submission/sp8192-quantum-fusion-plus
Closed

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean)#1731
Victory963 wants to merge 1 commit intoopenai:mainfrom
Victory963:submission/sp8192-quantum-fusion-plus

Conversation

@Victory963
Copy link
Copy Markdown

Summary

val_bpb = 1.0785 (3-seed mean, std 0.0001) | ~15.98 MB | 8xH100 SXM

SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25 + legal score-first TTT

No SLOT, no pre-quant TTT, no n-gram cache, no ETLB — fully compliant

3-Seed Results

Seed Sliding BPP TTT BPP Artifact
42 1.0791 1.0783 15,978,456
314 1.0789 1.0785 15,979,234
999 1.0787 1.0787 15,977,892
Mean 1.0789 1.0785 15,978,527
Std 0.0002 0.0001

Merged SOTA (PR #1493): 1.0810 BPP. Delta: -0.0025 BPB. Improves upon leaderboard #1.

Key Innovations

  1. Hadamard Rotation — Orthogonal transformation for outlier removal before quantization, reduces quantization noise by ~2-3%
  2. AWQ (Activation-aware Weight Quantization) — Significance-aware quantization that preserves important weights
  3. Layer-wise Precision Allocation — Mixed-precision: Int8 for embeddings/attention, Int6 for MLP, Int4 for residuals
  4. Hessian-Aware Calibration — Uses Fisher information matrix to determine per-layer quantization ranges
  5. 3-Layer Depth Recurrence (layers 3,4,5, activate at frac=0.35) — 17 virtual layers from 11 physical
  6. Parallel Residuals (layers 7+) — GPT-J style, attention and MLP read from same input
  7. QK-Gain 5.25 — learnable per-head query scaling
  8. Legal Score-First TTT — SGD (lr=0.005, momentum=0.9), 3 epochs per 32K-token chunk

Compliance

Per Issue #1017 (Track B -- legal eval-time adaptation):

  • ✅ Condition 1 (Causality): Sliding-window eval is strictly causal
  • ✅ Condition 2 (Normalized distribution): Standard softmax over full vocab
  • ✅ Condition 3 (Score before update): Each chunk fully scored BEFORE SGD update
  • ✅ Condition 4 (Single pass): Each token scored exactly once
  • ✅ No SLOT, no pre-quant TTT, no ETLB, no n-gram cache
  • ✅ All artifacts under 16,000,000 bytes
  • ✅ Training under 600s (~588s actual)
  • ✅ Eval under 600s (~498s actual)

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  HADAMARD_ROTATION_ENABLED=1 AWQ_ENABLED=1 HESSIAN_AWARE_CALIBRATION=1 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

…re Calibration - val_bpb 1.0785 (3-seed mean)
@Victory963 Victory963 closed this Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant