(Nonrecord) Applied Async Prefetching Potentially Boosts Performance#785
(Nonrecord) Applied Async Prefetching Potentially Boosts Performance#785SirSaltySalmon wants to merge 7 commits intoopenai:mainfrom
Conversation
Community Review — (Nonrecord) Applied Async Prefetching Potentially Boosts PerformanceBPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1207 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=95277 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=95277 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
LeakyReLU^2 + Legal TTT + Parallel Muon + systems: prefetch & fusion-friendly MLP
Reference baseline:
2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/README.mdOutcome
This variant improves throughput slightly, but does not improve quality versus the original 3-seed 8xH100 runs.
step_avg: 83.53ms -> 83.44ms (faster)val_bpb(final_int6_sliding_window_exact): 1.12184 -> 1.12334 (worse by +0.00151)val_bpb(legal_ttt_exact): 1.11938 -> 1.12096 (worse by +0.00158)3-seed comparison (8xH100, 600s train budget)
1xH100 ablation (Modal sanity check, 600s train budget)
Interpretation
The data is consistent across all three seeds: the systems changes increase training throughput, but that throughput gain does not translate into better final validation quality in this setup.
So the result here is best described as a speed optimization with neutral-to-slightly-negative quality impact relative to the original record recipe. Likely just means noise impacted the training result, as training math and process is exactly the same.
On 1xH100, the same systems changes looked clearly positive (more steps and better post-TTT bpb), while on 8xH100 they remain speed-positive but quality-negative. The practical interpretation is that prefetch/fusion behavior does not transfer linearly from single-GPU to multi-GPU quality outcomes and should be treated as a throughput optimization first. Likely, I/O is no longer bottleneck at large scale, and more so communication between GPUs tend to be the target.
I will continue iterating on this as increased training speed shows promises. This attempt tries to prove that async prefetching and memory pinning can improve the throughput of most approaches, but requires more experimentation to investigate compatibility with other methods. Aiming to increase optimization's compatibility with parallel GPUs next.
What changed vs. base record
All differences are in data loading and MLP forward; model architecture, banking, Parallel Muon, FlashAttention-3,
torch.compileusage, TTT protocol, and env-driven hyperparameters are otherwise aligned with base PR1. Pinned async prefetch (
PrefetchingDistributedTokenLoader)queue,threading.TRAIN_PREFETCH(default1)TRAIN_PREFETCH_QUEUE(default2)TRAIN_COPY_STREAM(default1) — when enabled with prefetch, H2D uses a dedicatedtorch.cuda.Streamand the default stream waits on it._cpu_batch_from_stream,_h2d_int64_batches.(x, y)on CPU,contiguous().pin_memory(), boundedqueue.Queue;next_batchdequees and copies to device.make_train_loader()factory; after optimizer state rewind (e.g. SWA branch), existing prefetch thread isshutdown()before a fresh loader is created so the token stream does not advance in the background.2. Fusion-friendly LeakyReLU² MLP
Base:
This submission:
Mathematically identical to LeakyReLU(0.5)² feeding the down projection; the change is layout / fusion hints for the compiled training graph, the Inductor fuses or simplifies more than before.
ENV
Same as the base run command, with optional prefetch toggles (defaults match optimized script):
Credits