[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min by Christopher-Lee-McClendon · Pull Request #1979 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-04-30T12:11:36Z

[Non-record] Long-Train Artifact Scaling: Final post-TTT BPB = 1.0399, no compressibility gain

Summary

Non-record experiment studying whether longer training makes the PR #1950 recipe (INT6 GPTQ + per-group lrzip) produce smaller compressed artifacts that could free byte budget for a larger record-track model.

Result: Artifact size is essentially constant (±9 KB = 0.06%) across 10–60 minutes of training. BPB improves substantially (post-TTT: 1.06 → 1.04) but the compression pipeline has already reached its entropy floor by 10 minutes.

No ML changes from PR #1950 — identical architecture, hyperparameters, tokenizer, and scoring pipeline. Only the wallclock cap was extended.

Motivation

If a model trained for 60 minutes compresses to a significantly smaller artifact than one trained for 10 minutes, the freed bytes could fund a larger model variant (more layers, wider dim, or higher LQER rank) on the record track. This experiment tests that hypothesis.

Results

Artifact Size vs Training Duration

Minute	Steps	Artifact (bytes)	Δ vs 10 min	Notes
10	6,348	15,953,292	baseline	In-loop export
20	7,193	15,952,677	−615	In-loop export
30	7,899	15,956,638	+3,346	In-loop export
45	12,135	15,955,847	+2,555	In-loop export
final (~60 min cap)	16,001	15,944,203	−9,089	Post-stop export at 3598s¹

¹ Training stopped at 3598.25s due to GPTQ_RESERVE_SECONDS=5.5. The in-loop 60-min trigger
never fired. The final row is the standard end-of-training serialize (identical path to PR #1950).

Final Model Quality (post-stop at 3598s)

Eval	val_bpb
Pre-quantization (EMA)	1.03969
Post-INT6-GPTQ	1.04944
Post-TTT (phased score-first)	1.03988

Derived Metrics (final model only)

Quantization tax: 0.00975 BPB
TTT gain: 0.00956 BPB
Artifact shrink (final vs same-run 10-min): −9,089 bytes (0.06%)
PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950 3-seed mean post-TTT: 1.06003; this run final post-TTT: 1.03988

Conclusion

The INT6 GPTQ + per-group lrzip compression reaches its entropy floor within the first 10 minutes of training. Longer training improves model quality (BPB) but does not reduce the compressed artifact size enough to justify a larger model variant. The −9 KB shrink is far below our pre-registered 300 KB threshold for recommending architectural changes.

Method

Training uses the exact PR #1950 recipe with one addition: NON_RECORD_LONGTRAIN=1 mode which:

Extends training to 60 min instead of 10 min
At milestones (10/20/30/45/60 min), pauses training and runs full GPTQ+lrzip serialize
Records artifact size and step count per milestone
After reaching the wallclock cap, runs final serialize + phased TTT eval

All checkpoint exports use proper distributed synchronization (dist.broadcast + barriers) to prevent NCCL rank desync during the 130s serialize pause.

Hardware

8×H100 80GB SXM (RunPod COMMUNITY, $21.52/hr)
Seed 42, ~101 min total runtime (incl. data download + checkpoints + eval)

Compliance

⚠️ NOT record-track compliant — training time 3598s >> 600s budget
✅ Eval scoring unchanged from PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950
✅ No PPM-D, n-gram cache, or eval-time scoring changes
✅ Score-first TTT (no validation tokens accessed before scoring)
✅ No external network calls during train/eval
✅ Artifact fits within 16 MB cap (15,944,203 < 16,000,000 bytes)
✅ No ML changes from base recipe

Related PRs

PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950: Base recipe (compliance-audited reproduction of PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934)
PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934: Original record (clips=12.0, EMBED_WD=0.06, COMPRESSOR=pergroup)

Files

records/track_non_record_16mb/2026-04-30_PR1950_LongTrainArtifactScaling/
├── README.md                  # Full documentation
├── submission.json            # Experiment metadata
├── train_gpt.py               # Modified PR #1950 script (NON_RECORD_LONGTRAIN mode)
├── train.log                  # Rank-0 training log (seed 42, 8×H100)
├── pgolf_stdout.txt           # Combined stdout (launcher + training + eval)
├── notes/
│   └── IMPLEMENTATION_NOTES.md
├── scripts/
│   ├── run_longtrain_scaling.sh
│   ├── analyze_scaling.py
│   └── make_larger_variant_plan.py
└── results/
    ├── checkpoint_10min.json
    ├── checkpoint_20min.json
    ├── checkpoint_30min.json
    ├── checkpoint_45min.json
    ├── checkpoint_60min.json   # Manually created from final export metrics
    ├── scaling_results.csv
    └── experiment_summary.json

Reproduces PR openai#1934's exact recipe (pergroup lrzip compression, EMBED_WD=0.06, tightened clip sigmas) with GPTQ_RESERVE_SECONDS=5.5 to ensure GPTQ hessians complete within the 600s training budget. Results (3-seed mean: 1.06003, std: 0.000385): - Seed 42: 1.05987 (4962 steps, artifact 15,971,933 B) - Seed 314: 1.05975 (4952 steps, artifact 15,970,997 B) - Seed 999: 1.06047 (4954 steps, artifact 15,974,305 B) Compliance: train_loop + hessians = 598.2s max (< 600s) Delta vs PR openai#1934: +0.00010 BPB (negligible, within noise) Co-authored-by: Copilot <[email protected]>

…-TTT BPB 1.0399 Studies artifact size as a function of training duration for the PR openai#1950 (compliance-audited PR openai#1934 reproduction) recipe on 8xH100 SXM. Key findings: - Artifact size is constant (±9 KB / 0.06%) across 10-60 min training - INT6 GPTQ + per-group lrzip is already at entropy floor by 10 min - BPB improves substantially: 1.06 (10 min) → 1.04 (60 min) post-TTT - Quantization tax (~0.01) and TTT gain (~0.01) stable across durations - No justification for larger model under same 16 MB cap Bug fix included: - NCCL rank desync during checkpoint export (broadcast sync + barriers) Non-record: training wallclock 3598s >> 600s budget. No ML changes from PR openai#1950. Co-authored-by: Copilot <[email protected]>

…etrics - Fix metric stage labeling (training_val vs quantized vs post_ttt) - Add TTT sweep results (3 successful, 4 failed variants) - v0 control (PR openai#1979 params) best at 1.03471 BPB - Fix --sweep-only-artifact mode for dedicated sweep pods - Bundle TTT sweep script via --extra-file in launcher - Add checkpoint_360min.json and sweep CSV/JSON results - Soften claims per red-team review (optimal → best among tested) Co-authored-by: Copilot <[email protected]>

- Replace '~30K final stop/export' with exact step 29,888 (wallclock stop) - Replace '~1.0600 live' with exact 1.0600 - Fix memory diff: 4.2 GB (matching displayed 47.8-43.6) not 4.1 GiB - Add missing 300-min row to training scaling summary table - Clarify that PR openai#1979 60-min (8xH100) differs from standalone 4h 60-min checkpoint - Add hardware annotations (8xH100 SXM / 4xH100 NVL) to source column Co-authored-by: Copilot <[email protected]>

Christopher-Lee-McClendon and others added 2 commits April 29, 2026 18:48

Christopher-Lee-McClendon mentioned this pull request Apr 30, 2026

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 #2008

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min#1979

[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min#1979
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-longtrain-scaling

Christopher-Lee-McClendon commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Christopher-Lee-McClendon commented Apr 30, 2026

[Non-record] Long-Train Artifact Scaling: Final post-TTT BPB = 1.0399, no compressibility gain

Summary

Motivation

Results

Artifact Size vs Training Duration

Final Model Quality (post-stop at 3598s)

Derived Metrics (final model only)

Conclusion

Method

Hardware

Compliance

Related PRs

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant