Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Record: SP8192 + Adaptive Hessian-Sensitivity GPTQ Clipping

**val_bpb = 1.0822** (3-seed mean, std 0.0009) | **~15.91 MB** | 8xH100 SXM

## 3-Seed Results

| Seed | Sliding BPB | Artifact |
|----------|-------------|----------------|
| 1337 | **1.0811** | 15,906,928 |
| 42 | **1.0826** | 15,909,023 |
| 999 | **1.0828** | 15,911,535 |
| **Mean** | **1.0822** | **15,909,162** |
| **Std** | **0.0009** | |

Merged SOTA (PR #1019): **1.1147 BPB**. Delta: **-0.0325 BPP**. Clears the 0.005-nat threshold.

## Novel Contribution: Adaptive Hessian-Sensitivity GPTQ Clipping

Standard GPTQ uses a global `clip_sigmas` parameter (e.g., 12.85) for all weight matrices. This submission replaces that with **per-tensor adaptive clipping** derived from Hessian sensitivity analysis.

**Key insight:** Weight matrices with higher Hessian sensitivity (measured as `H_diag.mean() * row_variance`) suffer more from quantization error. These layers should use wider clipping windows (higher clip_sigmas) to preserve precision, while less-sensitive layers can tolerate tighter clipping for better compression.

**Algorithm:**
1. Collect full Hessian matrices from calibration data (same as baseline GPTQ)
2. For each weight matrix, compute sensitivity: `sens = mean(diag(H)) * mean(var(W, dim=1))`
3. Compute log-space raw clip_sigmas: `log_cs = -0.15 * log(sens)` (conservative exponent)
4. Binary search for an additive offset in log-space such that the numel-weighted mean of `log(clamp(exp(log_cs + offset), 6.0, 24.0))` equals `log(12.85) + 0.012` (baseline budget + compression margin)
5. Final per-tensor clip_sigmas: `clamp(exp(log_cs + offset), 6.0, 24.0)`

**Result:** Clip_sigmas range from ~8.5 (early layers, high sensitivity) to ~19.0 (deep decoder layers, low sensitivity), while the numel-weighted geometric mean preserves the compression budget of the global baseline.

Example per-tensor clip_sigmas from seed 1337:
- `blocks.0.mlp.proj.weight`: 8.46 (high sensitivity, tight clipping)
- `blocks.2.attn.proj.weight`: 14.05
- `blocks.8.attn.proj.weight`: 19.02 (low sensitivity, wide clipping)

## Other Techniques

1. **SP8192 + GPTQ SDClip** -- int6 matrices, int8 embeddings, zero selective pruning (PR #1394 @clarkkev)
2. **3-Layer Depth Recurrence** (layers 3,4,5, activate at frac=0.35) -- 17 virtual layers from 11 physical (PR #1331 @dexhunter, PR #1437 @dexhunter)
3. **Parallel Residuals** (layers 7+) -- GPT-J style, attention and MLP read from same input (PR #1412 @Robby955, PR #1204 @msisovic)
4. **QK-Gain 5.25** -- learnable per-head query scaling
5. **Tuned Hyperparameters** -- WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 (PR #1445 @X-Abhishek-X)
6. **zlib+base85 code wrapper** -- 61KB source compressed to ~19.8KB

## Architecture

11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: encoder [0,1,2,3,4,5,3,4] decoder [5,3,4,5,6,7,8,9,10] (loops layers 3-5, activated at step ~2038). Parallel residuals from layer 7: attention and MLP operate on same pre-residual input. Skip gates (sigmoid-gated U-Net connections).

## Training

MuonEq-R optimizer (row-normalized Muon, Newton-Schulz 5 steps), AdamW for embeddings/scalars. ~4612 steps in 588s on 8xH100 SXM. Linear warmdown to LR=0 over final 72% of training. EMA decay 0.9965.

## Quantization

Full-Hessian GPTQ with **adaptive per-tensor SDClip**: clip_sigmas derived from Hessian sensitivity (see above). int6 for attention/MLP matrices, int8 for token embeddings. Byte-shuffle + Brotli-11 compression.

## Compliance

- **Training under 600s:** ~588s on all seeds
- **Artifact under 16MB:** All seeds under 15,912,000 bytes (15.91 MB)
- **Eval under 600s:** Sliding window eval ~91s per seed
- **No SLOT, no ETLB, no n-gram cache**
- **Three seeds:** 1337, 42, 999

## Reproduction

```bash
pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

- **@clarkkev** -- SP8192 + GPTQ Embeddings + SDClip + MuonEq-R + depth recurrence (PR #1394)
- **@dexhunter** -- 3-layer depth recurrence (PR #1331, #1437)
- **@Robby955** -- Parallel residuals on SP8192 (PR #1412)
- **@msisovic** -- Parallel residuals concept (PR #1204)
- **@X-Abhishek-X** -- Hyperparameter tuning: WD=0.095, MLR=0.022, EMA=0.9965 (PR #1445)

## Included Files

- `README.md` (this file)
- `submission.json`
- `train_gpt.py` (zlib+base85 compressed wrapper, 19,846 bytes)
- `train_seed1337.log`
- `train_seed42.log`
- `train_seed999.log`
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"author": "chris-colinsky",
"github_id": "chris-colinsky",
"name": "SP8192 + Adaptive Hessian-Sensitivity GPTQ Clipping + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25",
"date": "2026-04-17",
"track": "10min_16mb",
"val_bpb": 1.08218,
"val_bpb_std": 0.00090,
"seeds": [1337, 42, 999],
"seed_results": {
"1337": {"val_bpb": 1.08114, "artifact_bytes": 15906928},
"42": {"val_bpb": 1.08265, "artifact_bytes": 15909023},
"999": {"val_bpb": 1.08275, "artifact_bytes": 15911535}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP8192 + Hessian-Sensitivity Adaptive GPTQ Clipping (per-tensor clip_sigmas from H_diag * row_var, exponent -0.15, binary-search offset) + 3-Layer Depth Recurrence (L3-5) + Parallel Residuals (L7+) + QK-Gain 5.25 + EMA 0.9965 + WD 0.095 + Sliding Window Eval + GPTQ SDClip + Brotli",
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"three_seeds": true
},
"attribution": {
"sp8192_gptq_sdclip": "@clarkkev (PR #1394)",
"depth_recurrence": "@dexhunter (PR #1331, #1437)",
"parallel_residuals": "@Robby955 (PR #1412), @msisovic (PR #1204)",
"hyperparameter_tuning": "@X-Abhishek-X (PR #1445)",
"adaptive_hessian_clip": "Novel contribution (this submission)"
}
}
Loading