[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min#1979
Open
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Conversation
Reproduces PR openai#1934's exact recipe (pergroup lrzip compression, EMBED_WD=0.06, tightened clip sigmas) with GPTQ_RESERVE_SECONDS=5.5 to ensure GPTQ hessians complete within the 600s training budget. Results (3-seed mean: 1.06003, std: 0.000385): - Seed 42: 1.05987 (4962 steps, artifact 15,971,933 B) - Seed 314: 1.05975 (4952 steps, artifact 15,970,997 B) - Seed 999: 1.06047 (4954 steps, artifact 15,974,305 B) Compliance: train_loop + hessians = 598.2s max (< 600s) Delta vs PR openai#1934: +0.00010 BPB (negligible, within noise) Co-authored-by: Copilot <[email protected]>
…-TTT BPB 1.0399 Studies artifact size as a function of training duration for the PR openai#1950 (compliance-audited PR openai#1934 reproduction) recipe on 8xH100 SXM. Key findings: - Artifact size is constant (±9 KB / 0.06%) across 10-60 min training - INT6 GPTQ + per-group lrzip is already at entropy floor by 10 min - BPB improves substantially: 1.06 (10 min) → 1.04 (60 min) post-TTT - Quantization tax (~0.01) and TTT gain (~0.01) stable across durations - No justification for larger model under same 16 MB cap Bug fix included: - NCCL rank desync during checkpoint export (broadcast sync + barriers) Non-record: training wallclock 3598s >> 600s budget. No ML changes from PR openai#1950. Co-authored-by: Copilot <[email protected]>
Christopher-Lee-McClendon
added a commit
to Christopher-Lee-McClendon/parameter-golf
that referenced
this pull request
May 1, 2026
…etrics - Fix metric stage labeling (training_val vs quantized vs post_ttt) - Add TTT sweep results (3 successful, 4 failed variants) - v0 control (PR openai#1979 params) best at 1.03471 BPB - Fix --sweep-only-artifact mode for dedicated sweep pods - Bundle TTT sweep script via --extra-file in launcher - Add checkpoint_360min.json and sweep CSV/JSON results - Soften claims per red-team review (optimal → best among tested) Co-authored-by: Copilot <[email protected]>
Christopher-Lee-McClendon
added a commit
to Christopher-Lee-McClendon/parameter-golf
that referenced
this pull request
May 2, 2026
- Replace '~30K final stop/export' with exact step 29,888 (wallclock stop) - Replace '~1.0600 live' with exact 1.0600 - Fix memory diff: 4.2 GB (matching displayed 47.8-43.6) not 4.1 GiB - Add missing 300-min row to training scaling summary table - Clarify that PR openai#1979 60-min (8xH100) differs from standalone 4h 60-min checkpoint - Add hardware annotations (8xH100 SXM / 4xH100 NVL) to source column Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Non-record] Long-Train Artifact Scaling: Final post-TTT BPB = 1.0399, no compressibility gain
Summary
Non-record experiment studying whether longer training makes the PR #1950 recipe (INT6 GPTQ + per-group lrzip) produce smaller compressed artifacts that could free byte budget for a larger record-track model.
Result: Artifact size is essentially constant (±9 KB = 0.06%) across 10–60 minutes of training. BPB improves substantially (post-TTT: 1.06 → 1.04) but the compression pipeline has already reached its entropy floor by 10 minutes.
No ML changes from PR #1950 — identical architecture, hyperparameters, tokenizer, and scoring pipeline. Only the wallclock cap was extended.
Motivation
If a model trained for 60 minutes compresses to a significantly smaller artifact than one trained for 10 minutes, the freed bytes could fund a larger model variant (more layers, wider dim, or higher LQER rank) on the record track. This experiment tests that hypothesis.
Results
Artifact Size vs Training Duration
¹ Training stopped at 3598.25s due to
GPTQ_RESERVE_SECONDS=5.5. The in-loop 60-min triggernever fired. The final row is the standard end-of-training serialize (identical path to PR #1950).
Final Model Quality (post-stop at 3598s)
Derived Metrics (final model only)
Conclusion
The INT6 GPTQ + per-group lrzip compression reaches its entropy floor within the first 10 minutes of training. Longer training improves model quality (BPB) but does not reduce the compressed artifact size enough to justify a larger model variant. The −9 KB shrink is far below our pre-registered 300 KB threshold for recommending architectural changes.
Method
Training uses the exact PR #1950 recipe with one addition:
NON_RECORD_LONGTRAIN=1mode which:All checkpoint exports use proper distributed synchronization (
dist.broadcast+ barriers) to prevent NCCL rank desync during the 130s serialize pause.Hardware
Compliance
Related PRs
Files