Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)

## Results

| Seed | Pre-quant BPB | Post-quant BPB | **Post-TTT BPB** | Artifact (bytes) | Train time | Eval time |
|------|---------------|----------------|------------------|-------------------|------------|-----------|
| 1337 | 1.06960 | 1.07925 | **1.06798** | 15,950,405 | 599.64s | 409.6s |
| 42 | 1.06984 | 1.07948 | **1.06824** | 15,952,215 | 599.65s | 421.0s |
| 2025 | 1.07060 | 1.08028 | **1.06909** | 15,948,755 | 599.56s | 381.2s |
| **Mean** | **1.07001** | **1.07967** | **1.06844** | **15,950,458** | **599.62s** | **403.9s** |
| **Std** | 0.00053 | 0.00053 | **0.00058** | 1,730 | 0.05s | 20.2s |

## Configuration

- **Base code:** PR #1874 (AjAnubolu) verbatim — no code modifications
- **Environment variables:** `MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0`
- **Hardware:** 8xH100 80GB SXM (RunPod on-demand)
- **Data template:** `c5dbhtfrrt` (SP8192, 128 train + 1 val shards)

## Techniques (all from PR #1874, activated via env vars)

1. **LQER Asymmetric Rank-4** — SVD-based low-rank quantization error reduction on top-K=3 highest-error GPTQ residuals
2. **SmearGate + Attention Output Gate (width 24)** — per-layer smoothing + full-dim attention gating
3. **Polar Express Newton-Schulz** — 5 per-iteration minimax-tuned coefficient tuples for Muon optimizer
4. **MIN_LR=0.10** — warmdown LR floor at 10% of max (prevents LR collapse to zero)
5. **QK_GAIN_INIT=5.25** — per-head query-key attention scaling
6. **GATE_ATTN_WIDTH=24** — doubled attention gate capacity
7. **GPTQ_RESERVE_SECONDS=0.5** — maximizes training steps (default 4.0 wastes ~28 steps)
8. **VAL_LOSS_EVERY=0** — eliminates mid-training eval overhead (~14s saved = ~112 extra steps)
9. **Phased Score-First TTT** — 3-phase AdamW LoRA-TTT (rank 128), score-first ordering

## Rule Compliance

- Score-first phased TTT (no re-scoring)
- No pre-quant TTT on validation data
- No n-gram cache or PPM
- No CaseOps, no casefold — standard SP8192 UTF-8 byte counting
- Artifact < 16,000,000 bytes (max 15,952,215 B)
- Train time < 600s (max 599.65s), eval time < 600s (max 421.0s)

## How to reproduce

```bash
SEED=1337 MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 \
GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Attribution

Built entirely on PR #1874 (AjAnubolu), which itself builds on PR #1790 (miaoyuxun), PR #1344 (Polar Express), PR #1787 (nprime06), PR #1797 (dexhunter).
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"author": "bigbag",
"val_bpb": 1.06844,
"val_bpb_std": 0.00058,
"bytes_total_max": 15952215,
"seed_results": [
{
"seed": 1337,
"val_bpb": 1.06798,
"val_loss": 2.75871500,
"bytes_total": 15950405,
"train_time_ms": 599643,
"eval_time_ms": 409616
},
{
"seed": 42,
"val_bpb": 1.06824,
"val_loss": 2.75937977,
"bytes_total": 15952215,
"train_time_ms": 599651,
"eval_time_ms": 420960
},
{
"seed": 2025,
"val_bpb": 1.06909,
"val_loss": 2.76156888,
"bytes_total": 15948755,
"train_time_ms": 599564,
"eval_time_ms": 381181
}
],
"score_first_ttt": true,
"no_pre_quant_ttt": true,
"no_ngram_cache": true,
"env_vars": "MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0",
"techniques": [
"PR #1874 verbatim code (SP8192 non-CaseOps)",
"LQER asymmetric rank-4 quantization correction",
"SmearGate + Attention Output Gate (width 24)",
"Polar Express Newton-Schulz coefficients",
"MIN_LR=0.10 warmdown floor",
"QK_GAIN_INIT=5.25 attention scaling",
"GATE_ATTN_WIDTH=24 full-dim gating",
"GPTQ_RESERVE_SECONDS=0.5 (maximize training steps)",
"VAL_LOSS_EVERY=0 (no mid-train eval overhead)",
"Phased TTT (3-phase score-first AdamW, LoRA rank 128)",
"11L x 512D x 8H/4KV, parallel residuals, depth recurrence",
"GPTQ int6/int7 + Brotli-11"
]
}
Loading