openai · gmn0105 · Apr 21, 2026
diff --git a/...rds/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/README.md b/...rds/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/README.md
@@ -0,0 +1,41 @@
+# SP8192 proxy-stack (5 shards, 1×H100 screening) — Non-record submission
+
+This is a small, reproducible **screening** submission showing that simply switching the Track-P proxy stack from **SP4096 → SP8192** improves validation BPB under a fixed 10-minute wallclock budget on **1×H100**, trained on **5 FineWeb training shards**.
+
+- **Track**: non-record (screening / grant experiments)
+- **Base trainer**: `records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/train_gpt.py` (env-driven)
+- **Change vs base**: `VOCAB_SIZE=8192` (and matching cached dataset variant)
+- **Budget**: `MAX_WALLCLOCK_SECONDS=600` (with GPTQ reserving ~10s)
+- **Train shards**: 5
+
+## Results (3-seed)
+
+Metric notes:
+- **Pre-quantization post-EMA** isolates raw model quality before GPTQ export.
+- **Sliding BPB** below uses `final_int6_sliding_window` from the trainer logs.
+
+| Seed | Steps @ cap | Pre-quant post-EMA `val_bpb` | `final_int6_sliding_window val_bpb` | Total submission size (int6+brotli) |
+|------|-------------|------------------------------:|------------------------------------:|------------------------------------:|
+| 0    | 1033        | 1.247118 | 1.247542 | 14,108,633 |
+| 42   | 1031        | 1.245035 | 1.245657 | 14,101,451 |
+| 1337 | 1035        | 1.245422 | 1.246063 | 14,124,187 |
+| **Mean** | | **1.245858** | **1.246421** | **14,111,424** |
+
+## How to run
+
+You must have the cached dataset + tokenizer for SP8192 available under `data/` (see repo root `README.md`), and FlashAttention 3 installed (same requirement as the base trainer).
+
+Example:
+
+```bash
+cd records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100
+SEED=1337 RUN_ID=screen_sp8192_1337 MAX_WALLCLOCK_SECONDS=600 VOCAB_SIZE=8192 \
+  torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+## Files
+
+- `train_gpt.py`: thin launcher that executes the base trainer with env vars.
+- `train_seed*.log`: end-of-run excerpts for the three seeds used above.
+- `submission.json`: metadata for this non-record screening submission.
+
diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/requirements.txt b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/requirements.txt
@@ -0,0 +1,7 @@
+# torch and flash-attn-3 are installed separately in setup.sh
+brotli
+huggingface-hub
+numpy
+sentencepiece
+tqdm
+
diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/submission.json b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/submission.json
@@ -0,0 +1,11 @@
+{
+  "track": "non_record_16mb",
+  "date": "2026-04-21",
+  "name": "SP8192 proxy-stack (5 shards, 1×H100 screening)",
+  "author": "Gautam Naik",
+  "github_id": "gautamnaik",
+  "val_bpb": 1.2464,
+  "val_loss": 3.2196,
+  "bytes_total": 14111424
+}
+
diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_gpt.py b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_gpt.py
@@ -0,0 +1,27 @@
+import runpy
+from pathlib import Path
+
+
+def main() -> None:
+    """
+    Thin launcher for the Track-P proxy trainer.
+
+    The actual trainer is kept in its original record folder so that this
+    submission remains a minimal, reproducible screen of the SP8192 variant.
+    """
+
+    repo_root = Path(__file__).resolve().parents[3]
+    base_trainer = (
+        repo_root
+        / "records"
+        / "track_10min_16mb"
+        / "2026-04-01_Vocab4096_MLPMult4_WD085"
+        / "train_gpt.py"
+    )
+
+    runpy.run_path(str(base_trainer), run_name="__main__")
+
+
+if __name__ == "__main__":
+    main()
+
diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed0.log b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed0.log
@@ -0,0 +1,36 @@
+====================================================================================================
+train_shards: 5
+val_tokens: 40540160
+model_params:37022811
+gptq:reserving 10s, effective=590000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+0/20000 val_loss: 9.0077 val_bpb: 3.4872
+1/20000 train_loss: 9.0091 train_time: 0.0m tok/s: 920139
+2/20000 train_loss: 12.3612 train_time: 0.0m tok/s: 887348
+3/20000 train_loss: 11.3208 train_time: 0.0m tok/s: 886729
+4/20000 train_loss: 9.7618 train_time: 0.0m tok/s: 890945
+5/20000 train_loss: 8.5732 train_time: 0.0m tok/s: 881862
+500/20000 train_loss: 3.3830 train_time: 4.8m tok/s: 915579
+1000/20000 train_loss: 3.0805 train_time: 9.5m tok/s: 917274
+1033/20000 val_loss: 3.1159 val_bpb: 1.2062
+stopping_early: wallclock_cap train_time: 590437ms step: 1033/20000
+peak memory allocated: 18138 MiB reserved: 18162 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:3.22143551 val_bpb:1.24711808 eval_time:13560ms
+Serialized model: 137648707 bytes
+Code size: 68206 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 66 Hessians in 6.7s
+GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
+Serialized model int6+brotli: 14040427 bytes
+Total submission size int6+brotli: 14108633 bytes
+final_int6_roundtrip val_loss:3.26372721 val_bpb:1.26349051 eval_time:28224ms
+final_int6_sliding_window val_loss:3.22253024 val_bpb:1.24754188 eval_time:1557761ms
+
diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed1337.log b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed1337.log
@@ -0,0 +1,36 @@
+====================================================================================================
+train_shards: 5
+val_tokens: 40540160
+model_params:37022811
+gptq:reserving 10s, effective=590000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+0/20000 val_loss: 9.0095 val_bpb: 3.4879
+1/20000 train_loss: 9.0107 train_time: 0.0m tok/s: 922987
+2/20000 train_loss: 12.3211 train_time: 0.0m tok/s: 915076
+3/20000 train_loss: 11.3036 train_time: 0.0m tok/s: 908496
+4/20000 train_loss: 9.7746 train_time: 0.0m tok/s: 907798
+5/20000 train_loss: 8.5988 train_time: 0.0m tok/s: 905012
+500/20000 train_loss: 3.3773 train_time: 4.8m tok/s: 917401
+1000/20000 train_loss: 3.0757 train_time: 9.5m tok/s: 918889
+1035/20000 val_loss: 3.1104 val_bpb: 1.2041
+stopping_early: wallclock_cap train_time: 590545ms step: 1035/20000
+peak memory allocated: 18138 MiB reserved: 18162 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:3.21705311 val_bpb:1.24542151 eval_time:13548ms
+Serialized model: 137648707 bytes
+Code size: 68206 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 66 Hessians in 6.7s
+GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
+Serialized model int6+brotli: 14055981 bytes
+Total submission size int6+brotli: 14124187 bytes
+final_int6_roundtrip val_loss:3.25959265 val_bpb:1.26188989 eval_time:17556ms
+final_int6_sliding_window val_loss:3.21870914 val_bpb:1.24606261 eval_time:548318ms
+
diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed42.log b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed42.log
@@ -0,0 +1,36 @@
+====================================================================================================
+train_shards: 5
+val_tokens: 40540160
+model_params:37022811
+gptq:reserving 10s, effective=590000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+0/20000 val_loss: 9.0081 val_bpb: 3.4873
+1/20000 train_loss: 9.0096 train_time: 0.0m tok/s: 919196
+2/20000 train_loss: 12.3809 train_time: 0.0m tok/s: 757742
+3/20000 train_loss: 11.3150 train_time: 0.0m tok/s: 794518
+4/20000 train_loss: 9.7588 train_time: 0.0m tok/s: 816594
+5/20000 train_loss: 8.6266 train_time: 0.1m tok/s: 827810
+500/20000 train_loss: 3.3751 train_time: 4.8m tok/s: 913731
+1000/20000 train_loss: 3.0715 train_time: 9.5m tok/s: 915409
+1031/20000 val_loss: 3.1077 val_bpb: 1.2031
+stopping_early: wallclock_cap train_time: 590461ms step: 1031/20000
+peak memory allocated: 18138 MiB reserved: 18212 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:3.21605510 val_bpb:1.24503515 eval_time:13576ms
+Serialized model: 137648707 bytes
+Code size: 68206 bytes
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 66 Hessians in 6.9s
+GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
+Serialized model int6+brotli: 14033245 bytes
+Total submission size int6+brotli: 14101451 bytes
+final_int6_roundtrip val_loss:3.25903846 val_bpb:1.26167535 eval_time:17535ms
+final_int6_sliding_window val_loss:3.21766207 val_bpb:1.24565726 eval_time:547785ms
+