-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
75 changes: 75 additions & 0 deletions
75
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # Record: int5 GPTQ + 33.6M model + Soft-Round QAT + Legal Score-First TTT | ||
|
|
||
| ## Summary | ||
|
|
||
| **3-seed mean val_bpb = 1.1162 (std 0.0006)** | ||
|
|
||
| int5 GPTQ quantization (values in [-15, 15], 31 unique levels) with Hessian-aware error compensation enables a 33.6M parameter model to fit under 16MB. Soft-Round QAT replaces STE hard rounding with differentiable tanh-based rounding (alpha annealing 1→16) for better training quality at zero cost. Combined with early QAT at threshold 0.5, EMA 0.997, and legal score-first AdamW TTT with cosine LR decay across chunks. | ||
|
|
||
| ## Key Innovations | ||
|
|
||
| 1. **int5 quantization** — 31 unique values ([-15,15]) stored as int8, ~0.46 bytes/param after zstd. Lower entropy = better compression ratio than int6. | ||
| 2. **GPTQ error compensation** — Hessian-aware column reordering + Cholesky error redistribution. 256-sample calibration on training data. | ||
| 3. **33.6M param model** — MHA 8/8 (full attention), BigramHash 8192, MLP 3.5x (1792), enabled by int5 compression. | ||
| 4. **Soft-Round QAT** — Differentiable rounding `s_α(y) = floor(y) + 0.5 * tanh(α·r) / tanh(α/2) + 0.5` replaces STE. Alpha anneals from 1→16 during QAT steps. Better gradient flow = better training quality at zero computational cost. | ||
| 5. **Early QAT 0.5** — QAT clipping matched to int5 range (0.9995 percentile / 15.0), ~1750 QAT steps. | ||
| 6. **EMA 0.997** — Exponential moving average of weights, tuned from 0.9985. | ||
| 7. **Legal score-first TTT** — every token scored BEFORE any gradient update using it. Cosine LR decay across chunks. | ||
|
|
||
| ## Architecture | ||
|
|
||
| - 11 layers, model_dim=512, 8 heads / 8 KV heads (MHA), MLP 3.5x relu² | ||
| - XSA on all 11 layers | ||
| - Partial RoPE 16/64, LN Scale (1/√(layer+1)) | ||
| - SmearGate + OrthoInit | ||
| - BigramHash 8192, Shared VE128 (layers 9,10) | ||
| - Tight SWA (every 50) + EMA 0.997 | ||
| - Muon lr=0.025, WD=0.04 | ||
| - FA3 Hopper, ~98ms/step → ~6120 steps in 600s | ||
| - **33.6M params**, int5 GPTQ + zstd-22, 2% magnitude pruning | ||
|
|
||
| ## Quantization Pipeline | ||
|
|
||
| 1. **Early QAT** (threshold 0.5): QAT-aware training with int5 clipping (scale = row_clip / 15.0, clamp [-16, 15]) | ||
| 2. **GPTQ** (post-training): 256-sample Hessian calibration, per-row optimal scales (5-percentile search), column reordering by Hessian diagonal, block-128 Cholesky error compensation | ||
| 3. **int5 quantization** (range [-15, 15], 31 levels) stored as int8 | ||
| 4. **zstd-22** compression | ||
| 5. **2% magnitude pruning** | ||
|
|
||
| ## Legal Score-First TTT | ||
|
|
||
| - Val data split into 131072-token chunks (474 chunks) | ||
| - For each chunk: **score first** (sliding window stride=32, inference_mode), **then** adapt | ||
| - AdamW (lr=0.0001, wd=0.0), 3 epochs per chunk, cosine LR across chunks | ||
| - Last 2 blocks + norms + lm_head unfrozen (~5.8M / 33.6M params) | ||
| - Last chunk never trained on | ||
| - Every token scored BEFORE any gradient update using it | ||
| - Manual grad all_reduce (no DDP wrapper) | ||
|
|
||
| ## Results | ||
|
|
||
| | Seed | TTT BPB | Artifact | | ||
| |------|---------|----------| | ||
| | 1337 | **1.1155** | 15,822,078 bytes | | ||
| | 42 | **1.1163** | 15,415,405 bytes | | ||
| | 7 | **1.1167** | 15,368,627 bytes | | ||
| | **Mean** | **1.1162** | | | ||
| | **Std** | **0.0006** | | | ||
|
|
||
| ## Reproduction | ||
|
|
||
| ```bash | ||
| # On 8xH100 SXM: | ||
| pip install --break-system-packages zstandard | ||
| # Build FA3 Hopper (see repo README for instructions) | ||
| python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80 | ||
|
|
||
| SEED=1337 SKIP_SLIDING=1 PRUNE_PCT=0.02 \ | ||
| SOFT_ROUND_QAT=1 \ | ||
| TTT_EPOCHS=3 TTT_LR=0.0001 TTT_OPTIMIZER=adamw \ | ||
| TTT_FREEZE_BLOCKS=2 TTT_CHUNK_TOKENS=131072 \ | ||
| TTT_TEMPERATURE=0.98 INT6_LAST_N=0 \ | ||
| PPM_ALPHA=1.0 BYTE_WEIGHTED_TTT=0 USE_CACHE=0 \ | ||
| ADAPTIVE_LR=0 USE_MIXER=0 \ | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
| ``` | ||
22 changes: 22 additions & 0 deletions
22
records/track_10min_16mb/2026-03-24_Int5GPTQ_33M_LegalTTT/submission.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| { | ||
| "author": "Ethan Yang", | ||
| "github_id": "EthanYangTW", | ||
| "name": "Record: int5 GPTQ + 33.6M model + Soft-Round QAT + Legal Score-First TTT", | ||
| "blurb": "int5 GPTQ quantization ([-15,15], 31 levels) with Hessian-aware error compensation enables 33.6M params in 16MB. Soft-Round QAT (differentiable tanh rounding, alpha 1→16) replaces STE for better training quality. MHA 8/8, BigramHash 8192, MLP 3.5x (1792), XSA all 11 layers, Early QAT 0.5, EMA 0.997, legal score-first AdamW TTT with cosine LR decay.", | ||
| "date": "2026-03-24T00:00:00Z", | ||
| "val_bpb": 1.1162, | ||
| "val_bpb_std": 0.0006, | ||
| "val_loss_seed1337": 1.88347869, | ||
| "val_bpb_seed1337": 1.11550587, | ||
| "val_loss_seed42": 1.88480123, | ||
| "val_bpb_seed42": 1.11628915, | ||
| "val_loss_seed7": 1.88543543, | ||
| "val_bpb_seed7": 1.11666477, | ||
| "bytes_seed1337": 15822078, | ||
| "bytes_seed42": 15415405, | ||
| "bytes_seed7": 15368627, | ||
| "model_params": 33580124, | ||
| "quantization": "int5 GPTQ ([-15,15], 31 levels) + Soft-Round QAT", | ||
| "compression": "zstd-22", | ||
| "ttt": "legal score-first AdamW, 3 epochs, cosine LR across chunks" | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README describes the MLP as "relu²", but the implementation in
train_gpt.pyusesleaky_relu(..., negative_slope=0.5).square(). Updating the README to match the actual activation used will make the architecture description accurate and easier to reproduce/compare against other runs.