Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb by Gusanidas · Pull Request #1219 · openai/parameter-golf

Gusanidas · 2026-04-01T13:21:37Z

Based on PR #1105 (abaybektursun) with this changes:

Causal n-gram fix (within_hint/word_hint prefix-only)
Window attention (size=512) on layers 2,4,6,8,10 via FA3
Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
Train-data GPTQ calibration (14s vs 220s AR self-gen)
Auto eval_seq_len detection from max train seq_len
Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
seed 1337: 1.1077
seed 42: 1.1083
seed 7: 1.1091
mean: 1.1084 (vs leader 1.1147)

It has plenty of room to be further optimized

Based on PR openai#1105 (abaybektursun) with improvements: - Window attention (size=512) on layers 2,4,6,8,10 via FA3 - Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10 - Train-data GPTQ calibration (14s vs 220s AR self-gen) - Auto eval_seq_len detection from max train seq_len - Causal n-gram fix (within_hint/word_hint prefix-only) - Sliding window eval at seq_len=6144, stride=128 3-seed results (sliding window bpb): seed 1337: 1.1077 seed 42: 1.1083 seed 7: 1.1091 mean: 1.1084 (vs leader 1.1147) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thesis: the speed path is the most underutilized section of openai/parameter-golf. The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties. Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses under free wins + comp ports. Findings: TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total: - Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in plain sight. We're paying 8x kernel-launch overhead because grad_accum was inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup. - Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is 625K sequential forwards at B=1 stride=64. 97% of each window's context is shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives 5-15x eval speedup, saves 3-5 min of the 600s budget. - Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv 2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35 backlog, never shipped. - Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved). Replaces 220s AR self-gen with 14s. +2000 extra training steps. - Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time. TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan: - Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the fastest step in the leaderboard at 69.6 ms/step) - Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4 contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms → 1.3 ms (15x). World-novel, NOT in modded-nanogpt. - Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity (PRs openai#1105, openai#1420). Identity itself looks world-novel. - Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE megakernel). Combined eval speedup ~5x on top of Shot 0b. TIER 3 BIG DREAMS (world-first opportunities): - Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels; nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens templates. Potential PhD-defensible mini-paper. - Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel) - Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC. - Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC. - Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick winner, continue. Online hyperband. 200 LOC. - Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min compile cold-start permanently. Stacked expected impact: - Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6 - +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35 - +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22 - +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15 - +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12 - +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**) 10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100. That's where val_bpb drops BELOW comp records. Key finding: eval path holds the biggest speed wins currently, not training. Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2 Shots 13-14 save 5-8 min per eval pass. More than any training-side single patch would buy at our current rate. Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed), /tmp/phase2_world_speed_research.md (12 research areas surveyed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:57:35Z

Community Review — Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1219 ("WindowAttn_MixedSeq") is a pure neural submission. No illegal techniques detected. ## Analysis ### 1. N-gram family bug check (target XOR'd into hash key) The `BigramHashEmbedding.bigram_hash()` (line 1310-1315) hashes only input tokens (context): `xor(36313 * t[..., 1:], 27191 * t[..., :-1])`. The target (`y`) is never incorporated into any hash key. `EngramLite.forward()` (line 1348-1364) similarly hashes only `input_ids` and padded `prev_ids`/`pp_ids` — the future target token is never part of the hash. No n-gram family bug. ### 2. Pre-Quant TTT check (multi-epoch on val_tokens without score-first) `val_tokens` appears only in inference-mode evaluation calls: `eval_val()` (line 457) and `eval_val_sliding()` (line 1687). Both functions run under `torch.inference_mode()` and `model.eval()`. There is no `backward()`, `optimizer.step()`, or gradient flow anywhere near val data. No test-time training of any kind is present. No Pre-Quant TTT. ### 3. Score-first TTT / is_last_chunk guard No TTT mechanism exists in this PR at all — no adapter fine-tuning, no gradient-based adaptation at inference, no `is_last_chunk` logic. Not applicable. ### 4. Scored-region SLOT No scored-region slot mechanism detected. The sliding window eval at line 1687 correctly uses stride-gated scoring (`s = 0 if ws == 0 else max(wlen - stride, 0)`) but this is standard sliding-window evaluation, not a competition slot exploit. ### 5. Architecture summary Pure transformer with: window attention (`WINDOW_SIZE` / `WINDOW_ATTN_LAYERS`), bigram/trigram hash embeddings (`BigramHashEmbedding`), optional `EngramLite` multi-head n-gram embeddings, sliding-window final eval, and post-training GPTQ-lite int6 quantization. All techniques are legal neural architecture choices. No access to future tokens, no val-data gradient...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1

Gusanidas commented Apr 1, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gusanidas commented Apr 1, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants