Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# SP8192 proxy-stack (5 shards, 1×H100 screening) — Non-record submission

This is a small, reproducible **screening** submission showing that simply switching the Track-P proxy stack from **SP4096 → SP8192** improves validation BPB under a fixed 10-minute wallclock budget on **1×H100**, trained on **5 FineWeb training shards**.

- **Track**: non-record (screening / grant experiments)
- **Base trainer**: `records/track_10min_16mb/2026-04-01_Vocab4096_MLPMult4_WD085/train_gpt.py` (env-driven)
- **Change vs base**: `VOCAB_SIZE=8192` (and matching cached dataset variant)
- **Budget**: `MAX_WALLCLOCK_SECONDS=600` (with GPTQ reserving ~10s)
- **Train shards**: 5

## Results (3-seed)

Metric notes:
- **Pre-quantization post-EMA** isolates raw model quality before GPTQ export.
- **Sliding BPB** below uses `final_int6_sliding_window` from the trainer logs.

| Seed | Steps @ cap | Pre-quant post-EMA `val_bpb` | `final_int6_sliding_window val_bpb` | Total submission size (int6+brotli) |
|------|-------------|------------------------------:|------------------------------------:|------------------------------------:|
| 0 | 1033 | 1.247118 | 1.247542 | 14,108,633 |
| 42 | 1031 | 1.245035 | 1.245657 | 14,101,451 |
| 1337 | 1035 | 1.245422 | 1.246063 | 14,124,187 |
| **Mean** | | **1.245858** | **1.246421** | **14,111,424** |

## How to run

You must have the cached dataset + tokenizer for SP8192 available under `data/` (see repo root `README.md`), and FlashAttention 3 installed (same requirement as the base trainer).

Example:

```bash
cd records/track_non_record_16mb/2026-04-21_SP8192_ProxyStack_5Shards_1xH100
SEED=1337 RUN_ID=screen_sp8192_1337 MAX_WALLCLOCK_SECONDS=600 VOCAB_SIZE=8192 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Files

- `train_gpt.py`: thin launcher that executes the base trainer with env vars.
- `train_seed*.log`: end-of-run excerpts for the three seeds used above.
- `submission.json`: metadata for this non-record screening submission.

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# torch and flash-attn-3 are installed separately in setup.sh
brotli
huggingface-hub
numpy
sentencepiece
tqdm

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"track": "non_record_16mb",
"date": "2026-04-21",
"name": "SP8192 proxy-stack (5 shards, 1×H100 screening)",
"author": "Gautam Naik",
"github_id": "gautamnaik",
"val_bpb": 1.2464,
"val_loss": 3.2196,
"bytes_total": 14111424
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import runpy
from pathlib import Path


def main() -> None:
"""
Thin launcher for the Track-P proxy trainer.

The actual trainer is kept in its original record folder so that this
submission remains a minimal, reproducible screen of the SP8192 variant.
"""

repo_root = Path(__file__).resolve().parents[3]
base_trainer = (
repo_root
/ "records"
/ "track_10min_16mb"
/ "2026-04-01_Vocab4096_MLPMult4_WD085"
/ "train_gpt.py"
)

runpy.run_path(str(base_trainer), run_name="__main__")


if __name__ == "__main__":
main()

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
====================================================================================================
train_shards: 5
val_tokens: 40540160
model_params:37022811
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 9.0077 val_bpb: 3.4872
1/20000 train_loss: 9.0091 train_time: 0.0m tok/s: 920139
2/20000 train_loss: 12.3612 train_time: 0.0m tok/s: 887348
3/20000 train_loss: 11.3208 train_time: 0.0m tok/s: 886729
4/20000 train_loss: 9.7618 train_time: 0.0m tok/s: 890945
5/20000 train_loss: 8.5732 train_time: 0.0m tok/s: 881862
500/20000 train_loss: 3.3830 train_time: 4.8m tok/s: 915579
1000/20000 train_loss: 3.0805 train_time: 9.5m tok/s: 917274
1033/20000 val_loss: 3.1159 val_bpb: 1.2062
stopping_early: wallclock_cap train_time: 590437ms step: 1033/20000
peak memory allocated: 18138 MiB reserved: 18162 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:3.22143551 val_bpb:1.24711808 eval_time:13560ms
Serialized model: 137648707 bytes
Code size: 68206 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 6.7s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
Serialized model int6+brotli: 14040427 bytes
Total submission size int6+brotli: 14108633 bytes
final_int6_roundtrip val_loss:3.26372721 val_bpb:1.26349051 eval_time:28224ms
final_int6_sliding_window val_loss:3.22253024 val_bpb:1.24754188 eval_time:1557761ms

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
====================================================================================================
train_shards: 5
val_tokens: 40540160
model_params:37022811
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 9.0095 val_bpb: 3.4879
1/20000 train_loss: 9.0107 train_time: 0.0m tok/s: 922987
2/20000 train_loss: 12.3211 train_time: 0.0m tok/s: 915076
3/20000 train_loss: 11.3036 train_time: 0.0m tok/s: 908496
4/20000 train_loss: 9.7746 train_time: 0.0m tok/s: 907798
5/20000 train_loss: 8.5988 train_time: 0.0m tok/s: 905012
500/20000 train_loss: 3.3773 train_time: 4.8m tok/s: 917401
1000/20000 train_loss: 3.0757 train_time: 9.5m tok/s: 918889
1035/20000 val_loss: 3.1104 val_bpb: 1.2041
stopping_early: wallclock_cap train_time: 590545ms step: 1035/20000
peak memory allocated: 18138 MiB reserved: 18162 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:3.21705311 val_bpb:1.24542151 eval_time:13548ms
Serialized model: 137648707 bytes
Code size: 68206 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 6.7s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
Serialized model int6+brotli: 14055981 bytes
Total submission size int6+brotli: 14124187 bytes
final_int6_roundtrip val_loss:3.25959265 val_bpb:1.26188989 eval_time:17556ms
final_int6_sliding_window val_loss:3.21870914 val_bpb:1.24606261 eval_time:548318ms

Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
====================================================================================================
train_shards: 5
val_tokens: 40540160
model_params:37022811
gptq:reserving 10s, effective=590000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
0/20000 val_loss: 9.0081 val_bpb: 3.4873
1/20000 train_loss: 9.0096 train_time: 0.0m tok/s: 919196
2/20000 train_loss: 12.3809 train_time: 0.0m tok/s: 757742
3/20000 train_loss: 11.3150 train_time: 0.0m tok/s: 794518
4/20000 train_loss: 9.7588 train_time: 0.0m tok/s: 816594
5/20000 train_loss: 8.6266 train_time: 0.1m tok/s: 827810
500/20000 train_loss: 3.3751 train_time: 4.8m tok/s: 913731
1000/20000 train_loss: 3.0715 train_time: 9.5m tok/s: 915409
1031/20000 val_loss: 3.1077 val_bpb: 1.2031
stopping_early: wallclock_cap train_time: 590461ms step: 1031/20000
peak memory allocated: 18138 MiB reserved: 18212 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:3.21605510 val_bpb:1.24503515 eval_time:13576ms
Serialized model: 137648707 bytes
Code size: 68206 bytes
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 66 Hessians in 6.9s
GPTQ quantization: 66 layers with full GPTQ, 0 fallback to clip-search
Serialized model int6+brotli: 14033245 bytes
Total submission size int6+brotli: 14101451 bytes
final_int6_roundtrip val_loss:3.25903846 val_bpb:1.26167535 eval_time:17535ms
final_int6_sliding_window val_loss:3.21766207 val_bpb:1.24565726 eval_time:547785ms