Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Parallel Muon + Legal TTT

**val_bpb = 1.0824** (3-seed mean, std 0.0004) | 8xH100 80GB HBM3 SXM

![Training Curves](fig1_convergence.png)
![Eval Comparison](fig2_eval_comparison.png)

## Summary

We explore adding novel training-time techniques on top of the PR #1493 stack (current SOTA at 1.0810). Our submission introduces **four new components** — Gated Attention, NorMuon, Norm-PCT-Dropout, and Parallel Muon — each independently validated across multiple seeds before integration. We achieve **1.0824 BPP** (3-seed mean), placing within **+0.0014 BPP** of the current record.

Notably, our quantization gap is **smaller** than PR #1493's (10.3 vs 11.7 milli-BPP), suggesting our novel components produce weight distributions that are more amenable to GPTQ compression. The eval pipeline comparison chart above breaks down exactly where each milli-BPP is won or lost.

## 3-Seed Results

| Seed | Pre-quant | Quantized | Sliding | **TTT** | Artifact |
|------|-----------|-----------|---------|---------|----------|
| 42 | 1.0898 | 1.1001 | 1.0833 | **1.0824** | 16,051,299 |
| 314 | 1.0894 | 1.0997 | 1.0827 | **1.0819** | 16,050,433 |
| 999 | 1.0903 | 1.1000 | 1.0828 | **1.0828** | 16,051,839 |
| **Mean** | **1.0898** | **1.0999** | **1.0829** | **1.0824** | — |
| **Std** | **0.0004** | **0.0003** | **0.0003** | **0.0004** | — |

**Current SOTA** (PR #1493): 1.0810 BPP. Delta: +0.0014 BPP.

## Novel Techniques

These four techniques were developed and validated independently before being stacked on the PR #1493 base architecture.

### 1. Gated Attention

Per-head learnable sigmoid gate applied to the attention output, after multi-head attention but before the residual connection. Each head learns when to attenuate its contribution, allowing the model to dynamically suppress noisy or redundant heads during different parts of training.

- Validated across **5 independent seeds** (NIGHT_MODE campaign)
- Architectural — no eval-time overhead, no compliance concerns

### 2. NorMuon (Post-NS Row Normalization)

A variant of the MuonEq-R optimizer where row normalization is applied **after** the Newton-Schulz orthogonalization steps rather than before. This preserves the directional information from NS while still normalizing the update magnitudes. The standard MuonEq-R normalizes rows before NS, which can wash out useful gradient structure.

- Validated across **2 seeds**
- Optimizer-only change, no model architecture impact

### 3. Norm-PCT-Dropout

A regularization technique that zeros the **top 1% highest L2-norm rows** of the FFN intermediate activation during training. Unlike standard dropout (which is random), this targets the most activated neurons — acting as an implicit capacity regularizer that prevents the model from over-relying on a small set of dominant pathways.

- Validated across **2 seeds**
- Training-time only, no eval impact

### 4. Parallel Muon (Batched Newton-Schulz)

Groups parameters with matching shapes and runs the Newton-Schulz orthogonalization steps as a single batched matrix operation rather than sequential per-parameter calls. Pure throughput optimization with no quality impact.

- **~3% training speedup** on 8xH100 SXM
- ~3 additional training steps within the 600s budget

## Experimental Journey

Our path to this result involved extensive experimentation:

1. **Phase 1 (cheap GPU)**: Validated all novel techniques independently on RTX 3090 / A6000 pods. Over 50 training runs across different seeds, hyperparameters, and technique combinations. Key finding: techniques must be validated in isolation before stacking — combined techniques can interfere.

2. **Phase 2 (speed optimization)**: Systematic A/B testing of training throughput improvements. Discovered that `torch.compile(mode='max-autotune-no-cudagraphs')` + Flash Attention 3 + Parallel Muon compose cleanly for a **2.14x total speedup** over baseline.

3. **Int8 quantization discovery**: Found that converged smaller models exhibit catastrophic GPTQ int6 quantization failure (3+ BPP gap). Int8 eliminates this for small models but doesn't fit in the 16MB cap for the full 11L+4x architecture. This led us to use int6 for the final submission while retaining the architectural insights.

4. **Integration**: Stacked all validated techniques onto the PR #1493 base architecture (11L + 4x MLP + depth recurrence + parallel residuals + legal TTT). The result is within +0.0014 BPP of SOTA with a **better quantization gap** than the baseline.

## Architecture

```
11 layers x 512 dim x 8 heads / 4 KV heads
MLP: 4x with LeakyReLU(0.5)^2
35,989,681 parameters
Partial RoPE (16/64 dims), layerwise LN scale
Tied embeddings, logit softcap = 30.0
Depth recurrence: layers 3-5 looped 2x (17 virtual layers from 11 physical)
Parallel residuals: layers 7+ (GPT-J style)
Skip gates (sigmoid-gated U-Net connections)
Gated attention: per-head sigmoid gate
```

## Training

- **Optimizer**: MuonEq-R with NorMuon + Parallel Muon; AdamW for embeddings/scalars
- **Steps**: ~4450 in 588s on 8xH100 SXM
- **Schedule**: Linear warmdown over final 72%, EMA decay 0.9965
- **Regularization**: Norm-PCT-Dropout (top 1% FFN norm zeroing)
- **Compile**: `torch.compile(mode='max-autotune-no-cudagraphs')` + Flash Attention 3

## Quantization

Full-Hessian GPTQ with SDClip: `clip = k * std(row)`. Int6 for attention/MLP matrices, int8 for token embeddings. Brotli-11 compression.

**Note on artifact size**: Mean artifact is 16,051,190 bytes (~51KB over the 16,000,000 byte cap). An identified fix (enabling CMP_QUANT_VALUE_DEDUP, a validated alphabet-snap post-processing step) is expected to resolve this. See discussion below.

## TTT (Test-Time Training)

Score-first, chunk-based SGD adaptation per Issue #1017 Track B:
- 32K-token chunks, score under `torch.no_grad()` before each SGD update
- 3 epochs per chunk, cosine LR decay, gradient clipping at 1.0

## Compliance

Per Issue #1017:
- **Condition 1** (Causality): Strictly causal sliding-window eval
- **Condition 2** (Normalized): Standard softmax over full 8192-token vocab. No n-gram cache, no logit biasing.
- **Condition 3** (Score-before-update): Each chunk scored before SGD
- **Condition 4** (Single pass): Each token scored exactly once

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache.

## Reproduction

```bash
pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
python3 data/cached_challenge_fineweb.py --variant sp8192

SEEDS=42,314,999 bash submission/dry_run.sh
```

## Credits

- **@clarkkev** — SP8192 + GPTQ SDClip + MuonEq-R + depth recurrence (PR #1394)
- **@dexhunter** — 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
- **@abaybektursun** — Score-first TTT framework (PR #549)
- **@Robby955** — Parallel residuals on SP8192 (PR #1412)
- **@msisovic** — Parallel residuals concept (PR #1204)
- **@X-Abhishek-X** — Hyperparameter tuning (PR #1445)
- **@bigbag** — PR #1493 stack integration
- **@taka6745** — Gated Attention, NorMuon, Norm-PCT-Dropout, Parallel Muon, experimental campaign
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Experiment Log

This document summarizes the experiments conducted during the development of this submission. Over 60 training runs were performed across RTX 3090, A6000, and 8xH100 SXM hardware.

## Novel Technique Validation (NIGHT_MODE Campaign)

All novel techniques were validated independently on cheap GPUs before stacking on the final architecture.

| Technique | Seeds | Result | Verdict | Description |
|-----------|-------|--------|---------|-------------|
| **Gated Attention** | n=5 | train_loss 1.3711 (champion) | Confirmed win | Per-head sigmoid gate on attention output |
| **NorMuon** | n=2 | train_loss 1.40995 | Confirmed win | Post-NS row normalization (vs pre-NS in standard MuonEq-R) |
| **Norm-PCT-Dropout** | n=2 | train_loss 1.41365 | Confirmed win | Zero top 1% L2-norm FFN rows during training |
| **Parallel Muon** | n=2 | +3% throughput, quality neutral | Confirmed speedup | Batched Newton-Schulz across same-shape params |
| Gated + Legal TTT + N-gram Backoff (stacked) | n=2 | 1.45705 (+0.086 regression) | Stacking hostile | Too many novel techniques degrade each other |
| N-gram Bias Stack | n=3 | Various | Ruled out | Issue #1017 Condition 2 grey area; excluded from submission |
| CMP_QUANT_VALUE_DEDUP | n=2 | Quality neutral, -10-15% artifact size | Validated but not used | Alphabet-snap post-quant compression |

**Key finding**: Novel techniques that work in isolation can interfere when stacked. Our final stack uses only the 4 techniques that survived multi-seed validation AND compose cleanly.

## Phase 2: Speed Optimization (31 Experiments on RTX 3090)

| Exp | Config | ms/step | Speedup vs Baseline | Pre-quant BPB | Notes |
|-----|--------|---------|---------------------|---------------|-------|
| E1 | Baseline (no compile) | 2933 | 1.0x | 3.035 | Shot 0e quant gap 0.022 |
| E2 | torch.compile (default) | 1581 | **1.85x** | 2.920 | torch.compile is the biggest single win |
| E4b | max-autotune-no-cudagraphs | 1526 | **1.92x** | 2.923 | +3.7% over E2 |
| E5 | + cudnn.benchmark | 1514 | **1.94x** | 2.925 | +0.8% incremental |
| E6 | + Parallel Muon | 1369 | **2.14x** | 2.932 | Batched NS across params |
| E8 | + NUM_LOOPS=1 | 1410 | **2.08x** | 2.928 | Speed win but quality trade-off |
| E13 | NUM_LAYERS=8 | 1062 | **2.76x** | 3.052 | Layer reduction — faster but less capacity |
| E17 | NUM_LAYERS=8 + MLP=3 | 983 | **2.98x** | 3.065 | Near-3x baseline |
| E21 | NUM_LAYERS=6 | 856 | **3.43x** | 2.954 | Smaller model, more steps |
| E24 | NUM_LAYERS=6 + MLP=2 | 725 | **4.05x** | 2.971 | Best speed/quality balance |
| E26 | + TRAIN_SEQ_LEN=1024 | 643 | **4.56x** | 2.923 | Pareto optimal on 3090 |
| E29 | MODEL_DIM=256 | 343 | **8.55x** | 2.082 | Speed record but quant 3.64 (unusable) |

**Key insight**: 3090 is compute-bound. Bigger batches are a wash. Only cutting compute (fewer layers, smaller MLP, shorter sequences) or fusing kernels gives real speedups.

## Phase 2: Champion Full-Wallclock Runs (600s Budget)

| Config | Hardware | Steps | Pre-quant BPB | Quant BPB | Quant Gap | Notes |
|--------|----------|-------|---------------|-----------|-----------|-------|
| CHAMP_A (11L + MLP=2 + int6) | 3090 | 515 | 1.600 | 4.603 | **3.00** | Int6 catastrophic failure |
| CHAMP_B (6L + MLP=2 + int6) | 3090 | 813 | 1.399 | 4.966 | **3.57** | Int6 catastrophic failure |
| CHAMP_C (default + int6) | 3090 | 431 | 1.704 | 4.801 | **3.10** | Int6 catastrophic failure |
| **CHAMP_D (6L + MLP=2 + int8)** | 3090 | 813 | **1.398** | **1.399** | **0.001** | **Int8 breakthrough** |

**Critical discovery**: GPTQ int6 has insufficient precision for converged weight distributions on small models. The quant gap goes from ~0.02 (undertrained) to 3+ BPP (converged). Switching to int8 eliminates this entirely for small models.

For the full 11L+4x architecture used in the final submission, int8 doesn't fit the 16MB cap. We use int6 (matching PR #1493) and achieve a quant gap of **10.3 mBPP** — better than PR #1493's **11.7 mBPP**.

## Final Submission Run (8xH100 SXM)

| Retry | Issue | Resolution | Cost |
|-------|-------|------------|------|
| 1 | get_data.sh missing mkdir for cached SP model | Added mkdir -p before cp | ~$1.40 |
| 2 | Bootstrap STEP 3 ran with default config (not our stack) | Skipped bootstrap STEP 3, went straight to submission | ~$3 |
| 3 | Single-GPU (run.sh used python3 not torchrun) | Auto-detect GPU count, use torchrun when >1 | ~$8 |
| 4 | Flash Attention 3 not installed | pip install flash_attn_3 from wheel | ~$5 |
| **5 (final)** | Int8 quant doesn't fit 16MB + catastrophic gap with dedup | Switched to int6 matrices + int8 embeddings (matching PR #1493) | ~$25 |

Total compute cost: ~$60 across 5 retries. Effective (non-wasted) cost: ~$25.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
{
"author": "taka6745",
"github_id": "taka6745",
"name": "SP8192 + NL11 MLP4 + Parallel Residuals (L7+) + Gated Attention + NorMuon + Parallel Muon + Legal Score-First TTT",
"date": "2026-04-10",
"track": "10min_16mb",
"val_bpb": 1.08237,
"val_bpb_std": 0.00043,
"seeds": [42, 314, 999],
"seed_results": {
"42": {"val_bpb": 1.08243, "artifact_bytes": 16051299},
"314": {"val_bpb": 1.08192, "artifact_bytes": 16050433},
"999": {"val_bpb": 1.08276, "artifact_bytes": 16051839}
},
"hardware": "8xNVIDIA H100 80GB HBM3 SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP8192 + 11L 4xMLP (35.99M params) + 3-Layer Depth Recurrence (L3-5, activate at frac=0.35) + Parallel Residuals (L7+, GPT-J style) + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Gated Attention + NorMuon + Norm-PCT-Dropout + Parallel Muon + Score-First TTT (SGD 3ep) + GPTQ int6 SDClip + Brotli",
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": false,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
},
"attribution": {
"sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
"depth_recurrence": "@dexhunter (PR #1331, #1437)",
"parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
"legal_ttt_framework": "@abaybektursun (PR #549), @dexhunter (PR #1413)",
"hyperparameter_tuning": "@X-Abhishek-X (PR #1445), @bigbag (PR #1493)",
"gated_attention_normuon_norm_pct_dropout_parallel_muon": "@taka6745 (this submission)"
},
"notes": "Artifact is ~51KB over the 16MB cap (16,051,190 mean bytes). Known fix: CMP_QUANT_VALUE_DEDUP=1 or PARALLEL_START_LAYER=-1 (two-lane override bug). Prepped in commit ad8bb34 for retry 6."
}

Large diffs are not rendered by default.

Loading