Skip to content

[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min#1979

Open
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-longtrain-scaling
Open

[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min#1979
Christopher-Lee-McClendon wants to merge 2 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-longtrain-scaling

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

[Non-record] Long-Train Artifact Scaling: Final post-TTT BPB = 1.0399, no compressibility gain

Summary

Non-record experiment studying whether longer training makes the PR #1950 recipe (INT6 GPTQ + per-group lrzip) produce smaller compressed artifacts that could free byte budget for a larger record-track model.

Result: Artifact size is essentially constant (±9 KB = 0.06%) across 10–60 minutes of training. BPB improves substantially (post-TTT: 1.06 → 1.04) but the compression pipeline has already reached its entropy floor by 10 minutes.

No ML changes from PR #1950 — identical architecture, hyperparameters, tokenizer, and scoring pipeline. Only the wallclock cap was extended.

Motivation

If a model trained for 60 minutes compresses to a significantly smaller artifact than one trained for 10 minutes, the freed bytes could fund a larger model variant (more layers, wider dim, or higher LQER rank) on the record track. This experiment tests that hypothesis.

Results

Artifact Size vs Training Duration

Minute Steps Artifact (bytes) Δ vs 10 min Notes
10 6,348 15,953,292 baseline In-loop export
20 7,193 15,952,677 −615 In-loop export
30 7,899 15,956,638 +3,346 In-loop export
45 12,135 15,955,847 +2,555 In-loop export
final (~60 min cap) 16,001 15,944,203 −9,089 Post-stop export at 3598s¹

¹ Training stopped at 3598.25s due to GPTQ_RESERVE_SECONDS=5.5. The in-loop 60-min trigger
never fired. The final row is the standard end-of-training serialize (identical path to PR #1950).

Final Model Quality (post-stop at 3598s)

Eval val_bpb
Pre-quantization (EMA) 1.03969
Post-INT6-GPTQ 1.04944
Post-TTT (phased score-first) 1.03988

Derived Metrics (final model only)

Conclusion

The INT6 GPTQ + per-group lrzip compression reaches its entropy floor within the first 10 minutes of training. Longer training improves model quality (BPB) but does not reduce the compressed artifact size enough to justify a larger model variant. The −9 KB shrink is far below our pre-registered 300 KB threshold for recommending architectural changes.

Method

Training uses the exact PR #1950 recipe with one addition: NON_RECORD_LONGTRAIN=1 mode which:

  1. Extends training to 60 min instead of 10 min
  2. At milestones (10/20/30/45/60 min), pauses training and runs full GPTQ+lrzip serialize
  3. Records artifact size and step count per milestone
  4. After reaching the wallclock cap, runs final serialize + phased TTT eval

All checkpoint exports use proper distributed synchronization (dist.broadcast + barriers) to prevent NCCL rank desync during the 130s serialize pause.

Hardware

  • 8×H100 80GB SXM (RunPod COMMUNITY, $21.52/hr)
  • Seed 42, ~101 min total runtime (incl. data download + checkpoints + eval)

Compliance

  • ⚠️ NOT record-track compliant — training time 3598s >> 600s budget
  • ✅ Eval scoring unchanged from PR Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950
  • ✅ No PPM-D, n-gram cache, or eval-time scoring changes
  • ✅ Score-first TTT (no validation tokens accessed before scoring)
  • ✅ No external network calls during train/eval
  • ✅ Artifact fits within 16 MB cap (15,944,203 < 16,000,000 bytes)
  • ✅ No ML changes from base recipe

Related PRs

Files

records/track_non_record_16mb/2026-04-30_PR1950_LongTrainArtifactScaling/
├── README.md                  # Full documentation
├── submission.json            # Experiment metadata
├── train_gpt.py               # Modified PR #1950 script (NON_RECORD_LONGTRAIN mode)
├── train.log                  # Rank-0 training log (seed 42, 8×H100)
├── pgolf_stdout.txt           # Combined stdout (launcher + training + eval)
├── notes/
│   └── IMPLEMENTATION_NOTES.md
├── scripts/
│   ├── run_longtrain_scaling.sh
│   ├── analyze_scaling.py
│   └── make_larger_variant_plan.py
└── results/
    ├── checkpoint_10min.json
    ├── checkpoint_20min.json
    ├── checkpoint_30min.json
    ├── checkpoint_45min.json
    ├── checkpoint_60min.json   # Manually created from final export metrics
    ├── scaling_results.csv
    └── experiment_summary.json

Reproduces PR openai#1934's exact recipe (pergroup lrzip compression,
EMBED_WD=0.06, tightened clip sigmas) with GPTQ_RESERVE_SECONDS=5.5
to ensure GPTQ hessians complete within the 600s training budget.

Results (3-seed mean: 1.06003, std: 0.000385):
- Seed 42: 1.05987 (4962 steps, artifact 15,971,933 B)
- Seed 314: 1.05975 (4952 steps, artifact 15,970,997 B)
- Seed 999: 1.06047 (4954 steps, artifact 15,974,305 B)

Compliance: train_loop + hessians = 598.2s max (< 600s)
Delta vs PR openai#1934: +0.00010 BPB (negligible, within noise)

Co-authored-by: Copilot <[email protected]>
…-TTT BPB 1.0399

Studies artifact size as a function of training duration for the PR openai#1950
(compliance-audited PR openai#1934 reproduction) recipe on 8xH100 SXM.

Key findings:
- Artifact size is constant (±9 KB / 0.06%) across 10-60 min training
- INT6 GPTQ + per-group lrzip is already at entropy floor by 10 min
- BPB improves substantially: 1.06 (10 min) → 1.04 (60 min) post-TTT
- Quantization tax (~0.01) and TTT gain (~0.01) stable across durations
- No justification for larger model under same 16 MB cap

Bug fix included:
- NCCL rank desync during checkpoint export (broadcast sync + barriers)

Non-record: training wallclock 3598s >> 600s budget.
No ML changes from PR openai#1950.

Co-authored-by: Copilot <[email protected]>
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request May 1, 2026
…etrics

- Fix metric stage labeling (training_val vs quantized vs post_ttt)
- Add TTT sweep results (3 successful, 4 failed variants)
- v0 control (PR openai#1979 params) best at 1.03471 BPB
- Fix --sweep-only-artifact mode for dedicated sweep pods
- Bundle TTT sweep script via --extra-file in launcher
- Add checkpoint_360min.json and sweep CSV/JSON results
- Soften claims per red-team review (optimal → best among tested)

Co-authored-by: Copilot <[email protected]>
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request May 2, 2026
- Replace '~30K final stop/export' with exact step 29,888 (wallclock stop)
- Replace '~1.0600 live' with exact 1.0600
- Fix memory diff: 4.2 GB (matching displayed 47.8-43.6) not 4.1 GiB
- Add missing 300-min row to training scaling summary table
- Clarify that PR openai#1979 60-min (8xH100) differs from standalone 4h 60-min checkpoint
- Add hardware annotations (8xH100 SXM / 4xH100 NVL) to source column

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant