diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/README.md b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/README.md new file mode 100644 index 0000000000..d2d210562d --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/README.md @@ -0,0 +1,41 @@ +# SP8192 proxy-stack (5 shards, 1×H100 screening) — Non-record submission + +This is a small, reproducible **screening** submission showing that simply switching the Track-P proxy stack from **SP4096 → SP8192** improves validation BPB under a fixed 10-minute wallclock budget on **1×H100**, trained on **5 FineWeb training shards**. + +- **Track**: non-record (screening / grant experiments) +- **Base trainer**: `records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/train_gpt.py` (env-driven) +- **Change vs base**: `VOCAB_SIZE=8192` (and matching cached dataset variant) +- **Budget**: `MAX_WALLCLOCK_SECONDS=600` (with GPTQ reserving ~10s) +- **Train shards**: 5 + +## Results (3-seed) + +Metric notes: +- **Pre-quantization post-EMA** isolates raw model quality before GPTQ export. +- **Sliding BPB** below uses `final_int6_sliding_window` from the trainer logs. + +| Seed | Steps @ cap | Pre-quant post-EMA `val_bpb` | `final_int6_sliding_window val_bpb` | Total submission size (int6+brotli) | +|------|-------------|------------------------------:|------------------------------------:|------------------------------------:| +| 0 | 1033 | 1.247118 | 1.247542 | 14,108,633 | +| 42 | 1031 | 1.245035 | 1.245657 | 14,101,451 | +| 1337 | 1035 | 1.245422 | 1.246063 | 14,124,187 | +| **Mean** | | **1.245858** | **1.246421** | **14,111,424** | + +## How to run + +You must have the cached dataset + tokenizer for SP8192 available under `data/` (see repo root `README.md`), and FlashAttention 3 installed (same requirement as the base trainer). + +Example: + +```bash +cd records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100 +SEED=1337 RUN_ID=screen_sp8192_1337 MAX_WALLCLOCK_SECONDS=600 VOCAB_SIZE=8192 \ + torchrun --standalone --nproc_per_node=1 train_gpt.py +``` + +## Files + +- `train_gpt.py`: thin launcher that executes the base trainer with env vars. +- `train_seed*.log`: end-of-run excerpts for the three seeds used above. +- `submission.json`: metadata for this non-record screening submission. + diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/requirements.txt b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/requirements.txt new file mode 100644 index 0000000000..850325d183 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/requirements.txt @@ -0,0 +1,7 @@ +# torch and flash-attn-3 are installed separately in setup.sh +brotli +huggingface-hub +numpy +sentencepiece +tqdm + diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/submission.json b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/submission.json new file mode 100644 index 0000000000..f2bb428733 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/submission.json @@ -0,0 +1,11 @@ +{ + "track": "non_record_16mb", + "date": "2026-04-21", + "name": "SP8192 proxy-stack (5 shards, 1×H100 screening)", + "author": "Gautam Naik", + "github_id": "gautamnaik", + "val_bpb": 1.2464, + "val_loss": 3.2196, + "bytes_total": 14111424 +} + diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_gpt.py b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_gpt.py new file mode 100644 index 0000000000..3ff35ef89c --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_gpt.py @@ -0,0 +1,27 @@ +import runpy +from pathlib import Path + + +def main() -> None: + """ + Thin launcher for the Track-P proxy trainer. + + The actual trainer is kept in its original record folder so that this + submission remains a minimal, reproducible screen of the SP8192 variant. + """ + + repo_root = Path(__file__).resolve().parents[3] + base_trainer = ( + repo_root + / "records" + / "track_10min_16mb" + / "2026-04-01_Vocab4096_MLPMult4_WD085" + / "train_gpt.py" + ) + + runpy.run_path(str(base_trainer), run_name="__main__") + + +if __name__ == "__main__": + main() + diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed0.log b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed0.log new file mode 100644 index 0000000000..1bcaf1825b --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed0.log @@ -0,0 +1,36 @@ +==================================================================================================== +train_shards: 5 +val_tokens: 40540160 +model_params:37022811 +gptq:reserving 10s, effective=590000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +0/20000 val_loss: 9.0077 val_bpb: 3.4872 +1/20000 train_loss: 9.0091 train_time: 0.0m tok/s: 920139 +2/20000 train_loss: 12.3612 train_time: 0.0m tok/s: 887348 +3/20000 train_loss: 11.3208 train_time: 0.0m tok/s: 886729 +4/20000 train_loss: 9.7618 train_time: 0.0m tok/s: 890945 +5/20000 train_loss: 8.5732 train_time: 0.0m tok/s: 881862 +500/20000 train_loss: 3.3830 train_time: 4.8m tok/s: 915579 +1000/20000 train_loss: 3.0805 train_time: 9.5m tok/s: 917274 +1033/20000 val_loss: 3.1159 val_bpb: 1.2062 +stopping_early: wallclock_cap train_time: 590437ms step: 1033/20000 +peak memory allocated: 18138 MiB reserved: 18162 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:3.22143551 val_bpb:1.24711808 eval_time:13560ms +Serialized model: 137648707 bytes +Code size: 68206 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 6.7s +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +Serialized model int6+brotli: 14040427 bytes +Total submission size int6+brotli: 14108633 bytes +final_int6_roundtrip val_loss:3.26372721 val_bpb:1.26349051 eval_time:28224ms +final_int6_sliding_window val_loss:3.22253024 val_bpb:1.24754188 eval_time:1557761ms + diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed1337.log b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed1337.log new file mode 100644 index 0000000000..03229e1ebe --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed1337.log @@ -0,0 +1,36 @@ +==================================================================================================== +train_shards: 5 +val_tokens: 40540160 +model_params:37022811 +gptq:reserving 10s, effective=590000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +0/20000 val_loss: 9.0095 val_bpb: 3.4879 +1/20000 train_loss: 9.0107 train_time: 0.0m tok/s: 922987 +2/20000 train_loss: 12.3211 train_time: 0.0m tok/s: 915076 +3/20000 train_loss: 11.3036 train_time: 0.0m tok/s: 908496 +4/20000 train_loss: 9.7746 train_time: 0.0m tok/s: 907798 +5/20000 train_loss: 8.5988 train_time: 0.0m tok/s: 905012 +500/20000 train_loss: 3.3773 train_time: 4.8m tok/s: 917401 +1000/20000 train_loss: 3.0757 train_time: 9.5m tok/s: 918889 +1035/20000 val_loss: 3.1104 val_bpb: 1.2041 +stopping_early: wallclock_cap train_time: 590545ms step: 1035/20000 +peak memory allocated: 18138 MiB reserved: 18162 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:3.21705311 val_bpb:1.24542151 eval_time:13548ms +Serialized model: 137648707 bytes +Code size: 68206 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 6.7s +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +Serialized model int6+brotli: 14055981 bytes +Total submission size int6+brotli: 14124187 bytes +final_int6_roundtrip val_loss:3.25959265 val_bpb:1.26188989 eval_time:17556ms +final_int6_sliding_window val_loss:3.21870914 val_bpb:1.24606261 eval_time:548318ms + diff --git a/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed42.log b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed42.log new file mode 100644 index 0000000000..647ddaae83 --- /dev/null +++ b/records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100/train_seed42.log @@ -0,0 +1,36 @@ +==================================================================================================== +train_shards: 5 +val_tokens: 40540160 +model_params:37022811 +gptq:reserving 10s, effective=590000ms +warmup_step: 1/20 +warmup_step: 2/20 +warmup_step: 3/20 +warmup_step: 4/20 +warmup_step: 5/20 +warmup_step: 6/20 +warmup_step: 10/20 +warmup_step: 20/20 +0/20000 val_loss: 9.0081 val_bpb: 3.4873 +1/20000 train_loss: 9.0096 train_time: 0.0m tok/s: 919196 +2/20000 train_loss: 12.3809 train_time: 0.0m tok/s: 757742 +3/20000 train_loss: 11.3150 train_time: 0.0m tok/s: 794518 +4/20000 train_loss: 9.7588 train_time: 0.0m tok/s: 816594 +5/20000 train_loss: 8.6266 train_time: 0.1m tok/s: 827810 +500/20000 train_loss: 3.3751 train_time: 4.8m tok/s: 913731 +1000/20000 train_loss: 3.0715 train_time: 9.5m tok/s: 915409 +1031/20000 val_loss: 3.1077 val_bpb: 1.2031 +stopping_early: wallclock_cap train_time: 590461ms step: 1031/20000 +peak memory allocated: 18138 MiB reserved: 18212 MiB +ema:applying EMA weights +pre-quantization post-ema val_loss:3.21605510 val_bpb:1.24503515 eval_time:13576ms +Serialized model: 137648707 bytes +Code size: 68206 bytes +GPTQ:collecting Hessians from calibration data... +GPTQ:collected 66 Hessians in 6.9s +GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search +Serialized model int6+brotli: 14033245 bytes +Total submission size int6+brotli: 14101451 bytes +final_int6_roundtrip val_loss:3.25903846 val_bpb:1.26167535 eval_time:17535ms +final_int6_sliding_window val_loss:3.21766207 val_bpb:1.24565726 eval_time:547785ms +