Skip to content

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387#2008

Open
Christopher-Lee-McClendon wants to merge 14 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-4h-longtrain-ttt-sweep
Open

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387#2008
Christopher-Lee-McClendon wants to merge 14 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/non-record-4h-longtrain-ttt-sweep

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Apr 30, 2026

[Non-Record] 6h Long-Train Scaling + TTT Hyperparameter Sweep

Current best 360-minute post-TTT BPB: 1.03387 (v7_noqv_rank96, single seed, 4xH100 NVL)

Summary

Formal non-record submission studying BPB as a function of training duration (10 min -> 6h) and systematically sweeping TTT/LoRA hyperparameters on the final 6h quantized artifact.

At a glance

Metric Value Notes
Best 360-min post-TTT BPB 1.03387340 v7_noqv_rank96 on the final 360-min artifact (single seed)
Matched 360-min pre-quant EMA BPB 1.03340201 eval-only follow-up from saved resume checkpoint
Matched 360-min quantized sliding BPB 1.04273086 same artifact, no TTT
6h quantization tax +0.00932885 BPB quantized minus matched pre-quant EMA
Best TTT recovery at 6h 0.00885746 BPB (~95%) v7_noqv_rank96; recovery fraction = 0.00885746 / 0.00932885 = 94.94%
Final artifact size 15,926,271 bytes final_model.int6.360min.ptz
Run shape two RunPod sessions for the artifact path; third later pod for matched pre-quant recovery downloaded 300-min snapshot -> 4-GPU continuation -> later eval-only follow-up

Key findings

  1. Post-TTT BPB improves from 1.06003 (10-min reference, PR Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934 3-seed mean) to 1.03387 (6h single-seed, v7_noqv_rank96). This is a descriptive endpoint comparison across durations and seeds, not a controlled scaling estimate.
  2. A matched 360-min comparator gives pre-quant EMA 1.03340201 -> quantized 1.04273086 -> post-TTT 1.03387340 (v7), so GPTQ adds +0.00932885 BPB at 6h and best TTT recovers 0.00885746 BPB of that tax.
  3. In this single-seed run, the best 6h post-TTT result remains only +0.00047139 BPB above the matched 6h pre-quant EMA.
  4. Additional matched 240-min and 300-min controls show the same pattern: EMA helps, GPTQ adds a modest tax, and TTT recovers most or all of that tax.
  5. Artifact size is effectively constant across this family of runs; quality improves more than bytes do.
  6. Removing Q and V LoRA targets (v7: K+MLP+O+lm_head only) beats both the original full-target control (v0) and the lighter single-phase variant (v12).

Acknowledged PR lineage for this stack

These are the PRs most directly responsible for the training recipe, optimizer substrate, continuation semantics, and TTT control/sweep used here.

PR Why it matters here
PR #1934 Original record-track recipe that this long-train study extends in non-record form
PR #1950 Compliance-audited reproduction of PR #1934; exact base training recipe used here
PR #1979 1-hour long-train precursor; provides the 60-min comparator and the original v0_control_pr1979 TTT settings
PR #461 Original legal score-first TTT framework that all post-TTT comparisons here still follow
PR #1767 TTT alpha / warm-start / weight-decay improvements carried into the control TTT recipe
PR #1855 QK-gain and TTT-rank exploration that informed the long-train control and later sweep directions
PR #1344 Polar Express per-iteration Newton-Schulz coefficients concept for Muon
PR #1787 Parameter-golf integration of Polar Express Muon coefficients used by this training stack

Training scaling results

All durations use the same PR #1950 / PR #1934 recipe. To avoid mixing live-training metrics with matched eval-only comparators, the live checkpoint trajectory and the post-TTT horizon table are separated below.

Duration Source Export / endpoint step Live training val_bpb near export Artifact
60 min PR #1979 (8xH100 SXM) 16,001 1.0615 15,944 KB
240 min standalone 4h run (4xH100 NVL) 29,888 (wallclock stop) 1.0600 15,933 KB
300 min seed snapshot for continuation 36,452 1.0871 15,937 KB
360 min resumed 6h chain (4xH100 NVL) 49,765 1.0599* 15,926 KB

*Last logged live validation BPB near the 360-min export; the matched 360-min EMA / quantized / post-TTT comparator chain is reported later. The 60-min row is a separate 8xH100 run (PR #1979), not the same pod as the 240/300/360 chain.

Live training trajectory around saved/exported checkpoints

This table reports the last logged live training metrics near each saved/exported checkpoint, not matched EMA/quantized/post-TTT evals. The 60/120/180/240 rows come from the standalone 4h run (4xH100 NVL); the 300/360 rows come from the resume chain that produced the final 6h artifact. Note: the PR #1979 60-min artifact (step 16,001, 8xH100 SXM) in the summary table above is a different run from the 60-min checkpoint here (step 10,488, standalone 4h).

Checkpoint minute Source run Saved/exported step Last logged train_loss near checkpoint Last logged live val_loss Last logged live val_bpb
60 standalone 4h run 10,488 2.4241 (step 10,000) 2.5649 (step 8,000) 1.1720
120 standalone 4h run 17,480 2.5575 (step 17,000) 2.4924 (step 16,000) 1.1389
180 standalone 4h run 23,418 2.4389 (step 23,000) 2.4474 (step 20,000) 1.1183
240 standalone 4h run 29,888 (wallclock stop) 2.3156 (step 29,500) 2.3199 (step 29,888) 1.0600
300 downloaded seed snapshot for continuation 36,452 2.4071 (step 36,000) 2.3792 (step 36,000) 1.0871
360 resumed 6h continuation 49,765 2.2774 (step 48,000) 2.3197 (step 48,000) 1.0599

How the 6h artifact and later follow-ups were actually produced

The final 360-minute artifact itself was produced in two RunPod sessions. A third later pod was used only for matched pre-quant follow-up recovery.

Phase Pod Persistent checkpoint/export state What it was used for
Initial live training run y3ulfm7pb5kqyt Downloaded results/8h_longtrain_final/resume_snapshot_step_36452/ containing resume_manifest.json + resume_rank{0..3}_step36452.pt; manifest reports step=36452, training_time_ms=18000630.06, world_size=4, exported_minutes=[60,120,180,240,300] Authoritative 300-minute restart point pulled back to HPC before the original pod expired
Resumed 6h-horizon continuation mu4c253h9yoiy3 Wrote results/resumed_6h_horizon_continuation_step36452/final_model.int6.360min.ptz and checkpoint_360min.json (train_steps=49765, train_wallclock_seconds=21600.15, artifact_bytes=15926271); log also shows resume saves at 330 min (step=43125) and 360 min (step=49765) Produced the 360-minute submission artifact and the original 6h post-TTT control result
Later pre-quant follow-up safety capture h2fkfy6usuw72n Downloaded results/prequant_360min_from_step36452/resume_snapshot_step_43062/ with manifest + all 4 rank files; manifest reports step=43062, training_time_ms=19800085.99, world_size=4 Fallback 330-minute restart snapshot captured while recovering the matched 360-minute pre-quant EMA comparator stored in results/prequant_360min_from_step36452/prequant_eval_summary.live.json

What was done, exactly:

  1. The original 4-GPU live pod was allowed to run until a full 300-minute resume snapshot existed, then all four rank-local checkpoint files plus the manifest were downloaded under results/8h_longtrain_final/resume_snapshot_step_36452/.
  2. The continuation resumed from that downloaded snapshot on 4 GPUs only. The continuation log confirms RESUME: restored step=36452, training_time=18000.6s, exported_minutes=[60, 120, 180, 240, 300].
  3. The seed run was already a 6-hour training-wallclock run (training_wallclock=21600 in results/8h_longtrain_final/launcher_state.json). The resumed pod used a longer hard stop than 6h, but explicitly kept SCHEDULE_HORIZON_SECONDS=21600, so LR warmdown and schedule-dependent behavior still followed the original 6-hour horizon. This is a faithful continuation of the 6h schedule, not a fresh longer-horizon rerun.
  4. The submission artifact for this PR is the 360-minute export from the resumed pod: results/resumed_6h_horizon_continuation_step36452/final_model.int6.360min.ptz.
  5. The later NCCL timeout in the continuation log happened after the 360-minute export and 360-minute resume save were written, so it does not invalidate the artifact used here.
  6. The 330-minute step differs slightly between the main continuation (43125) and the later pre-quant follow-up snapshot (43062) because those are different resumed pods launched from the same 300-minute seed snapshot for different purposes.

Post-TTT BPB over time

This table is the easiest way to see how the post-TTT endpoint moves with training duration. Only 240/300/360 have matched artifact/checkpoint controls in this session; 120 and 180 were not separately evaluated with TTT.

Training horizon Source / comparator TTT config post_ttt_bpb Notes
10 min PR #1934 reference record submission config 1.06003 3-seed mean reference point
60 min PR #1979 original long-train control 1.03988 8xH100, 60-min precursor
240 min matched 240-min artifact v0_control_pr1979 1.03539272 nearly returns to matched 240-min pre-quant EMA (1.03545673)
300 min matched 300-min checkpoint original control recipe 1.04210727 from resume-decomposition follow-up on the same saved checkpoint
360 min matched 360-min artifact v0_control_pr1979 1.03471322 original 6h control used in the first sweep
360 min matched 360-min artifact v12_rank96_phase1_prefix1000 1.03421043 single-phase / lower-global-compute variant
360 min matched 360-min artifact v7_noqv_rank96 1.03387340 best result: Q/V LoRA removed, K+MLP+O+lm_head only

TTT/LoRA sweep on the 360-min quantized artifact

Variant LoRA rank/alpha LR Batch post_ttt_bpb Peak memory Status
sliding_window_control 1.04273086 5.3 GB baseline
v0_control_pr1979 96 / 144 1e-4 64 1.03471322 47.8 GB control
v12_rank96_phase1_prefix1000 96 / 144 1e-4 64 1.03421043 47.7 GB better than control
v7_noqv_rank96 96 / 144 (K+MLP+O+lm_head only) 1e-4 64 1.03387340 43.6 GB best
v1_rank128_alpha192 128 / 192 1e-4 64 1.03877 worse
v2_rank128_lr3e4 128 / 192 3e-4 64 1.09049 regression
v3_local_batch_chunk 128 / 192 3e-4 128 failed (no clean traceback; likely memory pressure / unstable config)
v4_global2_largechunk 128 / 192 3e-4 128 failed (no clean traceback; likely memory pressure / unstable config)
v5_prefix3000 128 / 192 3e-4 128 failed (no clean traceback; likely memory pressure / unstable config)
v6_prefix3000_phase4_optional 128 / 192 3e-4 128 failed (no clean traceback; likely memory pressure / unstable config)

Interpretation:

  • The sliding-window control isolates the TTT contribution on the same 360-minute artifact.
  • v7 improves on the control while using 4.2 GB less peak memory (43.6 vs 47.8 GB) than the full-target v0 recipe.
  • v12 is interesting because it nearly matches the original 3-phase control while using much less global-TTT compute.

Matched decomposition and comparator chain

Stage BPB Delta
Matched 6h pre-quant EMA 1.03340201 baseline
Quantized 6h artifact (sliding eval) 1.04273086 +0.00932885 vs matched pre-quant EMA
Post-TTT (v0_control_pr1979) 1.03471322 -0.00801764 vs quantized, +0.00131121 vs matched pre-quant EMA
Post-TTT (v7_noqv_rank96) 1.03387340 -0.00885746 vs quantized, +0.00047139 vs matched pre-quant EMA

Additional matched controls:

  • 240 min: pre-quant EMA 1.03545673 -> quantized 1.04485881 (+0.00940208 tax) -> post-TTT 1.03539272
  • 300 min: live 1.08215117 -> EMA 1.04945326 -> quantized 1.05603004 (+0.00657678 tax) -> post-TTT 1.04210727
  • 360 min: the original control (v0) reaches 1.03471322, while the later Q/V-ablation follow-up (v7) improves further to 1.03387340

Scientific hypotheses tested

  1. H1: Longer training improves post-TTT BPB -> supported descriptively
  2. H2: Longer training meaningfully reduces compressed artifact size -> not supported
  3. H3: Higher LoRA rank improves TTT on this 6h artifact -> not supported
  4. H4: Higher LR improves TTT at rank 128 -> rejected
  5. H5: Larger local batch / chunk improves TTT -> untested because those variants failed
  6. H6: GPTQ degrades BPB on matched checkpoints -> supported at 240, 300, and 360 minutes
  7. H7: Q/V LoRA targets are necessary for best 6h TTT -> rejected by v7_noqv_rank96

Infrastructure additions used by this PR

  • Resumable rank-local checkpoints with manifest-driven restore
  • SCHEDULE_HORIZON_SECONDS to decouple stop horizon from LR / schedule horizon during continuation
  • sweep-only-artifact mode for standalone TTT evaluation on an existing quantized artifact
  • HTTP-based artifact upload/download around RunPod proxy instability
  • Per-variant isolated TTT sweep execution with JSON / CSV summaries

Compliance

Hardware and cost

Phase Hardware Notes
1h precursor 8xH100 SXM PR #1979 baseline
4h standalone run 4xH100 NVL 60/120/180/240 checkpoint study
6h continuation 4xH100 NVL downloaded 300-min snapshot -> 360-min resumed artifact
TTT sweep + follow-ups 4xH100 NVL 240-min TTT-only, 300-min decomposition, 360-min pre-quant recovery, v7/v12 follow-up sweep

Estimated total cost across the long-train stack and follow-ups is on the order of ~$160.

Reproduces PR openai#1934's exact recipe (pergroup lrzip compression,
EMBED_WD=0.06, tightened clip sigmas) with GPTQ_RESERVE_SECONDS=5.5
to ensure GPTQ hessians complete within the 600s training budget.

Results (3-seed mean: 1.06003, std: 0.000385):
- Seed 42: 1.05987 (4962 steps, artifact 15,971,933 B)
- Seed 314: 1.05975 (4952 steps, artifact 15,970,997 B)
- Seed 999: 1.06047 (4954 steps, artifact 15,974,305 B)

Compliance: train_loop + hessians = 598.2s max (< 600s)
Delta vs PR openai#1934: +0.00010 BPB (negligible, within noise)

Co-authored-by: Copilot <[email protected]>
…-TTT BPB 1.0399

Studies artifact size as a function of training duration for the PR openai#1950
(compliance-audited PR openai#1934 reproduction) recipe on 8xH100 SXM.

Key findings:
- Artifact size is constant (±9 KB / 0.06%) across 10-60 min training
- INT6 GPTQ + per-group lrzip is already at entropy floor by 10 min
- BPB improves substantially: 1.06 (10 min) → 1.04 (60 min) post-TTT
- Quantization tax (~0.01) and TTT gain (~0.01) stable across durations
- No justification for larger model under same 16 MB cap

Bug fix included:
- NCCL rank desync during checkpoint export (broadcast sync + barriers)

Non-record: training wallclock 3598s >> 600s budget.
No ML changes from PR openai#1950.

Co-authored-by: Copilot <[email protected]>
Create scripts/run_longtrain_ttt_sweep.py with 7 sweep variants for
evaluating different TTT/LoRA configurations on a fixed quantized artifact.

Features:
- 7 defined variants (v0 baseline through v6 exploratory)
- Dry-run mode, on-pod execution, pod command emission
- Per-variant isolation with separate output directories
- Configurable timeout, GPU count, variant filtering
- JSON manifest + CSV + summary aggregation
- Re-aggregation from existing per-variant results

Add tests/test_ttt_sweep.py with 26 tests covering variant definitions,
env construction, selection, manifest generation, CSV aggregation,
dry-run output, and pod command generation.

Co-authored-by: Copilot <[email protected]>
Phase 1: Add save_resume_checkpoint / load_resume_checkpoint functions
with atomic writes, hparam fingerprint compatibility checks, manifest
schema, Muon shard_mom persistence, and old checkpoint cleanup.

Phase 2: Add state_dict() / load_state_dict() to DocumentPackingLoader
for deterministic data-loader resume (shard index + cursor).

Integration: RESUME_ENABLED=1 RESUME_FROM=<dir> loads checkpoint;
RESUME_SAVE_MINUTES=5,10,20 triggers periodic saves during training.
No-op when RESUME_ENABLED is unset.

Co-authored-by: Copilot <[email protected]>
…ine-readable outputs

Phase 3 (run_longtrain_scaling.py):
- Add --duration-hours with auto-defaults (wallclock, max-minutes, export, resume, iterations)
- Add --iterations, --enable-resume, --resume-save-minutes, --resume-from, --resume-keep-last
- Add --run-ttt-sweep-after-train, --ttt-sweep-variants, --ttt-max-minutes-per-variant
- build_seed_cmd() emits RESUME_*/ITERATIONS env vars when flags set
- TTT sweep script appended to pod command; sweep results copied for HTTP serving
- build_download_list() includes ttt_sweep/ files when sweep enabled
- Bundle includes scripts/run_longtrain_ttt_sweep.py via extra_files
- Dry-run output shows all new settings
- 4-hour default constants (DEFAULT_4H_*)

Phase 5 (train_gpt.py):
- Write JSON summary after TTT eval (TTT_EVAL_OUTPUT_JSON or artifact_dir default)
- LOAD_QUANTIZED_MODEL_PATH env override for eval-only / sweep runs

Tests:
- 23 new tests in test_launcher_longtrain.py covering all new args, command building, and defaults

Co-authored-by: Copilot <[email protected]>
…st-TTT)

4-hour training on PR openai#1950 recipe (seed 42, 4xH100 NVL):
- Pre-quant post-EMA BPB: 1.0355
- Quantized (INT6 GPTQ) BPB: 1.0449
- Artifact: 15,932,638 bytes (67K headroom under 16 MB)
- Artifact shrinks 15 KB from 60 min to 240 min
- BPB improves monotonically: 1.172 -> 1.057 pre-quant over 4h
- Quantization tax stable at 0.0094

Infrastructure:
- Resumable rank-local checkpoints (RESUME_ENABLED=1)
- DocumentPackingLoader state save/restore
- TTT/LoRA eval sweep orchestrator (7 variants)
- Extended launcher with 4h mode, dynamic seed timeout
- 74 tests passing

TTT eval interrupted at phase 1/3 by shell timeout.
TTT sweep not run. Full post-TTT BPB estimated ~1.02-1.03.

Co-authored-by: Copilot <[email protected]>
- 1.0449 quantized does NOT beat 1h post-TTT 1.0399 (lower is better)
- Changed 'beats' to 'approaches' with explicit 0.005 gap noted
- Fixed 240-min table row: use quantized BPB (1.0449) with footnote
- Pre-quant post-EMA 1.0355 does surpass 1h post-TTT, noted clearly
- Fixed submission.json key_findings accordingly

Co-authored-by: Copilot <[email protected]>
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title [Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449 (Beats 1h Post-TTT) [Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449 Apr 30, 2026
…etrics

- Fix metric stage labeling (training_val vs quantized vs post_ttt)
- Add TTT sweep results (3 successful, 4 failed variants)
- v0 control (PR openai#1979 params) best at 1.03471 BPB
- Fix --sweep-only-artifact mode for dedicated sweep pods
- Bundle TTT sweep script via --extra-file in launcher
- Add checkpoint_360min.json and sweep CSV/JSON results
- Soften claims per red-team review (optimal → best among tested)

Co-authored-by: Copilot <[email protected]>
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title [Non-Record] 4h Long-Train Scaling: Quantized BPB 1.0449 [Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03471 May 1, 2026
Christopher-Lee-McClendon and others added 3 commits May 1, 2026 14:13
- Implement SLIDING_EVAL=1 mode in train_gpt.py for non-TTT quantized eval
- Add v_sliding_window_control variant to TTT sweep (TTT_ENABLED=0)
- Measured quantized_bpb_360min = 1.04273086 (INT6 GPTQ, no TTT)
- TTT gain properly decomposed: 1.04273 → 1.03471 = 0.00802 BPB
- GPTQ quantization itself improves BPB by 0.017 vs live model (1.0599 → 1.0427)
- Update submission.json, README, PR body with full 3-stage decomposition
- Add H6: GPTQ quantization acts as regularization (unexpected finding)
- Include sliding eval results in submission directory

Co-authored-by: Copilot <[email protected]>
Add matched 240/300/360 comparator evidence, capture the true 6h pre-quant EMA result, harden RunPod artifact retrieval, and refresh the non-record submission materials for PR openai#2008.

Co-authored-by: Copilot <[email protected]>
- Add TTT_Q_LORA / TTT_V_LORA env vars to train_gpt.py (default=1 for backward compat)
- Guard q_loras/v_loras creation and forward paths with None checks
- Add v7_noqv_rank96, v8_noqv_rank128, v12_rank96_phase1_prefix1000 to sweep launcher
- v7 (no Q/V, K+MLP+O+lm_head only): 1.03387 BPB, 43.6 GiB peak, 641s eval
- v12 (1-phase, 1000 prefix): 1.03421 BPB, 47.7 GiB peak, 663s eval
- Both beat v0 control (1.03471) — v7 is new best with less memory
- TTT now recovers ~95% of 6h quantization tax (was ~86% with v0)
- Updated PR body, README, submission.json with red-teamed claims
- Added reproducibility.md guide
- Added AGENTS.md with RunPod operational lessons

Co-authored-by: Copilot <[email protected]>
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title [Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03471 [Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 May 1, 2026
Christopher-Lee-McClendon and others added 3 commits May 1, 2026 19:07
- explain the exact restart chain used for the 360min artifact
- document downloaded 300min snapshot step36452 and fallback 330min snapshot step43062
- note that the 360min export was produced before the later NCCL timeout
- sync reproducibility guide to the actual two-stage artifact path

Co-authored-by: Copilot <[email protected]>
- add explicit PR lineage / acknowledgement table for the longtrain + TTT stack
- add live checkpoint trajectory table and post-TTT-by-horizon table
- clarify the exact two-pod artifact path and third-pod prequant follow-up
- promote single-seed and non-record caveats earlier in the body
- improve sweep failure annotations and overall PR formatting

Co-authored-by: Copilot <[email protected]>
- Replace '~30K final stop/export' with exact step 29,888 (wallclock stop)
- Replace '~1.0600 live' with exact 1.0600
- Fix memory diff: 4.2 GB (matching displayed 47.8-43.6) not 4.1 GiB
- Add missing 300-min row to training scaling summary table
- Clarify that PR openai#1979 60-min (8xH100) differs from standalone 4h 60-min checkpoint
- Add hardware annotations (8xH100 SXM / 4xH100 NVL) to source column

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant