Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# QAT Dead-Code Bug + 7 Untried Techniques: What I Found

**Non-Record Submission (Research Contribution)**
**Author:** [@wfproc](https://github.com/wfproc)
**Base:** PR #549 by @abaybektursun (1.1194 bpb, current SOTA)
**Hardware:** 1xH100 80GB SXM, 600s wallclock (~940 steps per run)
**Best pre-quant:** 1.3646 val_bpb | **Best post-quant (int6 sliding):** 2.3411 val_bpb

Not competitive with the 8xH100 leaderboard (7x fewer steps), but the findings transfer to any hardware.

---

## The Short Version

I spent two days running 20+ experiments on the SOTA #549 stack, testing techniques nobody else had submitted results for. The main finding: Late QAT is dead code in #315-derived submissions due to a torch.compile bug. Fixing it actually made the int6 gap *worse*, not better. I also tested 7 novel techniques from recent papers (Muon-VS, Deep Delta Learning, stable depth recurrence, anti-layer removal, NS step reduction, spectral SVD compression, wider models). All negative. The throughput tax at this scale kills everything that adds overhead.

---

## The QAT Dead-Code Bug

PR #315's README flagged this already, but I wanted to confirm it and try to fix it.

The issue: `CastedLinear._qat_enabled` is a class attribute set to `False`. When `torch.compile(fullgraph=True)` traces the model, it sees `_qat_enabled=False` and eliminates the STE branch from the compiled graph. Setting it to `True` later (when Late QAT is supposed to activate) does nothing. The compiled code is frozen.

This is present in the SOTA (#549). I verified by checking step times: with the original code, step time stays flat at ~625ms throughout training. If QAT was actually running, you'd see a jump when it activates (STE adds overhead).

I tried three ways to fix it:

| Fix | What happened |
|---|---|
| Mutable list `[False]` | `FailOnRecompileLimitHit` error, `fullgraph=True` won't recompile |
| Instance attribute per module | Same recompilation error |
| Tensor multiplier (0.0 or 1.0) | Works. Graph shape stays the same, no recompile needed |

The tensor approach: `w = w + qat_scale * (w_q - w).detach()`. When scale is 0.0, it's a no-op. When 1.0, full STE. The compiled graph handles both without recompilation.

With the fix, step time jumps from ~625ms to ~670ms when QAT activates. So QAT is actually running now. But the results got worse:

| Metric | Dead QAT (original) | Working QAT (tensor fix) |
|---|---|---|
| val_bpb pre-quant | 1.3631 | 1.3646 |
| val_bpb int6 sliding | 2.3107 | 2.3411 |

Why? A few theories:

1. At only ~940 steps (1xH100), the model hasn't converged enough for STE to help. The weights are still moving too fast.
2. WD=0.04 + EMA 0.997 already push weights toward int6-friendly distributions without needing explicit fake quantization.
3. The tensor multiply `qat_scale * (w_q - w).detach()` always executes (even at scale=0.0), adding a tiny overhead that costs a few training steps.

I can't rule out that QAT would help at 7000 steps (8xH100). But it's also possible that the dead code is genuinely the right call for this stack.

---

## Novel Technique Sweep

All tested on the SOTA #549 stack on 1xH100. Baseline: 1.3631 val_bpb (pre-quant, slope=0.75, 947 steps).

| # | Technique | Paper | val_bpb | Delta | Why it failed |
|---|---|---|---|---|---|
| 1 | Muon-VS (variance-adaptive) | 2601.14603 | 1.3884 | +0.019 | Variance buffer needs 100s of steps to warm up, +2.2% overhead |
| 2 | Deep Delta Learning | 2601.00417 | 1.3720 | +0.003 | Beta params stay near zero at init, within noise |
| 3 | Thinking Deeper recurrence | 2603.21676 | 1.4163 | +0.053 | 21% step overhead, 163 fewer steps |
| 4 | Anti-layer removal | 2603.19348 | n/a | n/a | No anti-layers found in 11L stack |
| 5 | Newton-Schulz steps=3 | - | 1.3798 | +0.017 | Worse orthogonalization quality, +9 steps doesn't compensate |
| 6 | Spectral SVD (640d 12L) | Novel | 1.4646 | +0.102 | 46% slower, only 649 steps, SVD can't separate signal from noise |
| 7 | Wider model (576d 11L) | - | 1.4504 | +0.087 | 33% slower, only 711 steps |

Also confirmed LeakyReLU(0.75)^2 > LeakyReLU(0.5)^2 by 0.006 bpb, matching PR #977.

### The throughput tax

This keeps coming up: at ~83ms/step on H100, each millisecond of added overhead costs roughly 0.007 bpb. Techniques 1, 3, 6, and 7 all added overhead. None of them delivered enough quality improvement per step to compensate. The only technique that worked (LeakyReLU slope change) adds zero overhead.

### Anti-layer results (full table)

Trained 943 steps, then zeroed each layer's attn_scale + mlp_scale and re-evaluated:

| Layer | Delta bpb when removed | Role |
|---|---|---|
| 0 | +1.127 | Critical (embedding interaction) |
| 1 | +0.346 | Critical |
| 2 | +0.237 | Important |
| 3 | +0.087 | Moderate |
| 4 | +0.081 | Moderate |
| 5-6 | +0.036 | Low |
| 7-8 | +0.031-0.034 | Low (layer 8 least important) |
| 9-10 | +0.040-0.044 | Low |

No layer hurts performance when present. The LN Scale (1/sqrt(L+1)) in the stack may prevent anti-layer formation by dampening deep layers.

---

## Other Findings

**SWA during QAT:** PR #989 found SWA sabotages QAT. I added a one-line fix that stops SWA accumulation when QAT activates. Worth adopting if you use both.

**Prune-then-quantize:** Implemented but only partially tested. Prunes smallest-magnitude weights before int6 quantization (arXiv:2603.18426). One partial run showed neutral results. The pruned zeros don't compress better than quantized values under lzma, so the artifact size didn't shrink. Needs full evaluation.

---

## Reproducing These Results

From the repo root:
```bash
# Download data (if not already present)
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10

# Main run (slope=0.75, includes QAT tensor-scale fix)
RUN_ID=main \
LEAKY_SLOPE=0.75 \
torchrun --standalone --nproc_per_node=1 \
records/track_non_record_16mb/2026-03-28_QAT_DeadCode_Analysis_NovelTechniques_1xH100/train_gpt.py

# With prune-then-quantize (5%)
RUN_ID=prune5 \
LEAKY_SLOPE=0.75 \
PRUNE_FRACTION=0.05 \
torchrun --standalone --nproc_per_node=1 \
records/track_non_record_16mb/2026-03-28_QAT_DeadCode_Analysis_NovelTechniques_1xH100/train_gpt.py

# With anti-layer diagnostic (runs layer ablation after training)
RUN_ID=antilayer \
LEAKY_SLOPE=0.75 \
ANTILAYER_DIAGNOSTIC=1 \
torchrun --standalone --nproc_per_node=1 \
records/track_non_record_16mb/2026-03-28_QAT_DeadCode_Analysis_NovelTechniques_1xH100/train_gpt.py
```

Other toggles: `MUON_VS=1`, `DEEP_DELTA=1`, `RECURRENCE_LAYERS=2 RECURRENCE_STEPS=2`, `SVD_RANK=128 MODEL_DIM=640 NUM_LAYERS=12`.

---

## What I'd try next with more compute

1. QAT fix at 7000 steps (8xH100) to see if the STE actually helps when the model is better converged
2. Spectral SVD compression at full scale (the concept needs 7000+ steps for weight matrices to develop spectral structure)
3. Prune-then-quantize sweep (5%, 10%, 15%)
4. LeakyReLU slope sweep at 0.80, 0.85, 0.90

---

## Included Files

- `train_gpt.py` - modified SOTA #549 script with all techniques as env var toggles
- `submission.json` - metadata
- `train.log` - QAT fix run output (slope=0.75 with tensor-scale QAT, 1xH100, 943 steps)

## Acknowledgments

Built on PR #549 (@abaybektursun), PR #315 (@jfprincz, first to flag the QAT torch.compile issue), PR #977 (@michaelwinczuk, LeakyReLU 0.75), PR #989 (@alexanderaperry-arch, SWA-QAT finding). Issue #140 commentary by @notapplica was really helpful for prioritizing what to try.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
torch>=2.6.0
numpy
sentencepiece
flash-attn
zstandard
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"author": "wfproc",
"github_id": "wfproc",
"name": "QAT Dead-Code Analysis + Novel Technique Sweep (1xH100)",
"blurb": "Non-record research submission. Found that torch.compile constant-folds Late QAT in #315-derived code (confirmed in SOTA #549). Tested a fix via tensor-scale STE, but working QAT actually made the int6 gap worse. Also swept 7 untried techniques from recent papers (Muon-VS, Deep Delta Learning, Thinking Deeper recurrence, anti-layer removal, NS step reduction, spectral SVD compression, wider models). All negative on 1xH100. Key finding: at 83ms/step, each 1ms of overhead costs about 0.007 bpb.",
"date": "2026-03-28T00:00:00Z",
"track": "non-record-16mb",
"val_loss": 2.3041,
"val_bpb": 1.3646,
"pre_quant_val_loss": null,
"pre_quant_val_bpb": 1.3646,
"step_stop": 943,
"wallclock_seconds": 600,
"bytes_total": 7303141,
"bytes_model_int6_lzma": 7200188,
"bytes_code": 102953,
"gpu": "1xH100-80GB-SXM"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
step:0/20000 val_loss:6.9309 val_bpb:4.1049 train_time:0ms step_avg:0.02ms
swa:start step:250
late_qat:enabled step:416 scale:0.1499
step:500/20000 train_loss:2.4300 train_time:318487ms step_avg:636.97ms
step:943/20000 val_loss:2.3041 val_bpb:1.3646 train_time:600509ms step_avg:636.81ms
stopping_early: wallclock_cap train_time:600509ms step:943/20000
DIAGNOSTIC post_ema val_loss:2.4761 val_bpb:1.4665 eval_time:15592ms
Serialized model: 106158518 bytes
Serialized model int6+lzma: 7200188 bytes
Total submission size int6+lzma: 7303141 bytes
final_int6_roundtrip val_loss:3.9539 val_bpb:2.3417 eval_time:28383ms
final_int6_roundtrip_exact val_loss:3.95388609 val_bpb:2.34171405
final_int6_sliding_window val_loss:3.9528 val_bpb:2.3411 stride:64 eval_time:565334ms
final_int6_sliding_window_exact val_loss:3.95279298 val_bpb:2.34107285
final_int8_zlib_roundtrip_exact val_loss:3.95279298 val_bpb:2.34107285
Loading