Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# 11L GPTQ-lite + Self-Distillation TTT

**val_bpb: 1.1260** (sliding window, stride=64) | **15.99 MB** | 8xH100 SXM, 600s

## Architecture

Built on PR #374's SOTA stack with two novel post-training optimizations.

- 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
- 3x MLP expansion with relu-squared activation
- Efficient Partial XSA on last 4 layers
- Partial RoPE (16/64 dims) + NTK-aware scaling
- LN Scale Factor 1/sqrt(layer_idx+1)
- U-Net skip connections (5 encoder, 6 decoder)
- SmearGate + BigramHash (2048 buckets, dim=128)
- Shared Value Embedding (dim=128, layers 9,10)
- FlashAttention 3 (Hopper)
- Orthogonal init with proj scaling
- Tight SWA (scale<0.2, every 50 steps, 12 checkpoints)
- Late QAT (STE int6 at lr_scale<0.1)
- EMA not used (Tight SWA instead)

## Novel Contributions

### 1. GPTQ-lite: Per-Layer Optimal Clip Percentile Search

Standard int6 quantization uses a fixed clipping strategy (row-wise amax). GPTQ-lite searches 5 clip percentiles per weight matrix (1.0, 0.999, 0.995, 0.99, 0.98) and selects the one minimizing reconstruction error. This reduces quantization degradation at zero compute cost during training.

### 2. Self-Distillation TTT (Eval-Time Adaptation)

Post-training KL-divergence adaptation on validation data. A frozen teacher (snapshot of the trained model) guides the student's adaptation, preserving XSA attention patterns that hard-label TTT disrupts (as documented in PR #303's negative interaction study). Temperature=2.0, freeze first 4 blocks, 2 epochs SGD (lr=0.001).

Result: SDTTT was slightly negative (-0.0003 bpb) in this run. The KL constraint may be too strong at T=2.0. Included for completeness and future tuning.

## Training

- Muon optimizer (matrices): lr=0.025, momentum=0.99 (warmup 0.92 over 1500 steps), WD=0.04
- AdamW (embeddings): lr=0.035, (scalars): lr=0.025, WD=0.04
- Gradient clip: 0.3
- Batch: 786,432 tokens/step, seq_len=2048
- Warmdown: 3000 iters (wallclock-based)
- Tight SWA: every 50 steps when scale<0.2 (12 checkpoints)
- Late QAT: STE int6 when LR scale<0.1

## Results

| Metric | Value |
|--------|-------|
| Steps | 6,701 |
| Step avg | 89.55ms |
| Pre-quant val_bpb | 1.1429 |
| Post-SWA val_bpb | 1.1428 |
| Post-SDTTT val_bpb | 1.1431 |
| Int6 roundtrip val_bpb | 1.1497 |
| **Sliding window val_bpb (s64)** | **1.1260** |
| Artifact size | 15,989,300 bytes |
| Peak memory | 20,680 MiB/GPU |

## Run

```bash
SDTTT_ENABLED=1 SDTTT_EPOCHS=2 SDTTT_LR=0.001 SDTTT_TEMPERATURE=2.0 \
SDTTT_FREEZE_BLOCKS=4 GPTQ_ENABLED=1 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

All other hyperparameters use PR #374 defaults (NUM_LAYERS=11, XSA_LAST_N=4, SWA_ENABLED=1, etc.).

## Code

- Full source and experiment history: https://github.com/dannywillowliu-uchi/parameter-golf-entry
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"author": "Danny Willow Liu",
"github_id": "dannywillowliu-uchi",
"name": "11L GPTQ-lite Int6 MLP3x",
"blurb": "PR #374 SOTA stack (11L XSA4, Tight SWA, Partial RoPE 16/64, LN Scale, Late QAT, Value Embedding) plus GPTQ-lite: per-layer optimal clip percentile search during int6 quantization. FA3 Hopper attention. Int6 per-row + zstd-22. 8xH100 SXM.",
"date": "2026-03-22T02:00:00Z",
"val_loss": 1.90068380,
"val_bpb": 1.12569499,
"roundtrip_val_loss": null,
"roundtrip_val_bpb": null,
"step_stop": 6733,
"wallclock_seconds": 600.024,
"bytes_total": null,
"bytes_model_int6_zstd": null,
"bytes_code": null
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
[attn_backend] flash_attn_3
[attn_backend] flash_attn_3[attn_backend] flash_attn_3

[attn_backend] flash_attn_3
[attn_backend] flash_attn_3
[attn_backend] flash_attn_3
[attn_backend] flash_attn_3
[attn_backend] flash_attn_3
NCCL version 2.25.1+cuda12.8
logs/385e0333-636d-4fbe-a752-caf2d26b62d9.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:26993756
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[7, 8, 9, 10]
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/20000 val_loss:6.9279 val_bpb:4.1031 train_time:0ms step_avg:0.02ms
step:1/20000 train_loss:6.9299 train_time:209ms step_avg:208.99ms
step:2/20000 train_loss:8.5550 train_time:282ms step_avg:140.98ms
step:3/20000 train_loss:7.8359 train_time:373ms step_avg:124.20ms
step:4/20000 train_loss:7.2015 train_time:463ms step_avg:115.83ms
step:5/20000 train_loss:7.0518 train_time:553ms step_avg:110.57ms
step:6/20000 train_loss:6.8319 train_time:641ms step_avg:106.85ms
step:7/20000 train_loss:6.7431 train_time:729ms step_avg:104.09ms
step:8/20000 train_loss:6.7552 train_time:815ms step_avg:101.92ms
step:9/20000 train_loss:6.4192 train_time:906ms step_avg:100.67ms
step:10/20000 train_loss:6.0776 train_time:994ms step_avg:99.36ms
step:500/20000 train_loss:2.4022 train_time:44544ms step_avg:89.09ms
step:1000/20000 train_loss:2.2693 train_time:89061ms step_avg:89.06ms
step:1500/20000 train_loss:2.2158 train_time:133657ms step_avg:89.10ms
step:2000/20000 train_loss:2.0564 train_time:178190ms step_avg:89.10ms
step:2500/20000 train_loss:2.1613 train_time:222758ms step_avg:89.10ms
step:3000/20000 train_loss:2.1512 train_time:267305ms step_avg:89.10ms
step:3500/20000 train_loss:2.1723 train_time:311815ms step_avg:89.09ms
step:4000/20000 train_loss:1.9722 train_time:356311ms step_avg:89.08ms
step:4000/20000 val_loss:2.0619 val_bpb:1.2212 train_time:356327ms step_avg:89.08ms
step:4500/20000 train_loss:2.1144 train_time:400827ms step_avg:89.07ms
step:5000/20000 train_loss:2.0986 train_time:445296ms step_avg:89.06ms
step:5500/20000 train_loss:2.0082 train_time:489836ms step_avg:89.06ms
step:6000/20000 train_loss:1.9310 train_time:534356ms step_avg:89.06ms
swa:start step:6150
late_qat:enabled step:6434 scale:0.0999
step:6500/20000 train_loss:2.0645 train_time:579215ms step_avg:89.11ms
step:6733/20000 val_loss:1.9277 val_bpb:1.1417 train_time:600024ms step_avg:89.12ms
stopping_early: wallclock_cap train_time:600024ms step:6733/20000
peak memory allocated: 20678 MiB reserved: 20782 MiB
swa:applying averaged 12 checkpoints
DIAGNOSTIC post_swa val_loss:1.9277 val_bpb:1.1417 eval_time:1980ms
Serialized model: 106178365 bytes
Code size: 81261 bytes
gptq:enabled — per-layer optimal clip search
Serialized model int6+zstd: 15850180 bytes
Total submission size int6+zstd: 15931441 bytes
final_int6_roundtrip val_loss:1.9408 val_bpb:1.1495 eval_time:220658ms
final_int6_roundtrip_exact val_loss:1.94084791 val_bpb:1.14947945
final_int6_sliding_window val_loss:1.9007 val_bpb:1.1257 stride:64 eval_time:190304ms
final_int6_sliding_window_exact val_loss:1.90068380 val_bpb:1.12569499
Loading