Skip to content

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219

Open
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1
Open

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1

Conversation

@Gusanidas
Copy link
Copy Markdown

Based on PR #1105 (abaybektursun) with this changes:

  • Causal n-gram fix (within_hint/word_hint prefix-only)
  • Window attention (size=512) on layers 2,4,6,8,10 via FA3
  • Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
  • Train-data GPTQ calibration (14s vs 220s AR self-gen)
  • Auto eval_seq_len detection from max train seq_len
  • Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
seed 1337: 1.1077
seed 42: 1.1083
seed 7: 1.1091
mean: 1.1084 (vs leader 1.1147)

It has plenty of room to be further optimized

Based on PR openai#1105 (abaybektursun) with improvements:
- Window attention (size=512) on layers 2,4,6,8,10 via FA3
- Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
- Train-data GPTQ calibration (14s vs 220s AR self-gen)
- Auto eval_seq_len detection from max train seq_len
- Causal n-gram fix (within_hint/word_hint prefix-only)
- Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
  seed 1337: 1.1077
  seed 42:   1.1083
  seed 7:    1.1091
  mean:      1.1084 (vs leader 1.1147)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 9, 2026
Thesis: the speed path is the most underutilized section of openai/parameter-golf.
The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties.
Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses
under free wins + comp ports.

Findings:

TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total:
- Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in
  plain sight. We're paying 8x kernel-launch overhead because grad_accum was
  inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup.
- Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is
  625K sequential forwards at B=1 stride=64. 97% of each window's context is
  shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives
  5-15x eval speedup, saves 3-5 min of the 600s budget.
- Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv
  2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35
  backlog, never shipped.
- Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved).
  Replaces 220s AR self-gen with 14s. +2000 extra training steps.
- Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time.

TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan:
- Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the
  fastest step in the leaderboard at 69.6 ms/step)
- Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4
  contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms
  → 1.3 ms (15x). World-novel, NOT in modded-nanogpt.
- Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity
  (PRs openai#1105, openai#1420). Identity itself looks world-novel.
- Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE
  megakernel). Combined eval speedup ~5x on top of Shot 0b.

TIER 3 BIG DREAMS (world-first opportunities):
- Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent
  SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels;
  nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of
  our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens
  templates. Potential PhD-defensible mini-paper.
- Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel)
- Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint
  operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC.
- Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint
  operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC.
- Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint
  operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick
  winner, continue. Online hyperband. 200 LOC.
- Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min
  compile cold-start permanently.

Stacked expected impact:
- Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6
- +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35
- +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22
- +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15
- +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12
- +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**)

10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100.
That's where val_bpb drops BELOW comp records.

Key finding: eval path holds the biggest speed wins currently, not training.
Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2
Shots 13-14 save 5-8 min per eval pass. More than any training-side single
patch would buy at our current rate.

Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed),
/tmp/phase2_world_speed_research.md (12 research areas surveyed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary PR #1219 ("WindowAttn_MixedSeq") is a pure neural submission. No illegal techniques detected. ## Analysis ### 1. N-gram family bug check (target XOR'd into hash key) The BigramHashEmbedding.bigram_hash() (line 1310-1315) hashes only input tokens (context): xor(36313 * t[..., 1:], 27191 * t[..., :-1]). The target (y) is never incorporated into any hash key. EngramLite.forward() (line 1348-1364) similarly hashes only input_ids and padded prev_ids/pp_ids — the future target token is never part of the hash. No n-gram family bug. ### 2. Pre-Quant TTT check (multi-epoch on val_tokens without score-first) val_tokens appears only in inference-mode evaluation calls: eval_val() (line 457) and eval_val_sliding() (line 1687). Both functions run under torch.inference_mode() and model.eval(). There is no backward(), optimizer.step(), or gradient flow anywhere near val data. No test-time training of any kind is present. No Pre-Quant TTT. ### 3. Score-first TTT / is_last_chunk guard No TTT mechanism exists in this PR at all — no adapter fine-tuning, no gradient-based adaptation at inference, no is_last_chunk logic. Not applicable. ### 4. Scored-region SLOT No scored-region slot mechanism detected. The sliding window eval at line 1687 correctly uses stride-gated scoring (s = 0 if ws == 0 else max(wlen - stride, 0)) but this is standard sliding-window evaluation, not a competition slot exploit. ### 5. Architecture summary Pure transformer with: window attention (WINDOW_SIZE / WINDOW_ATTN_LAYERS), bigram/trigram hash embeddings (BigramHashEmbedding), optional EngramLite multi-head n-gram embeddings, sliding-window final eval, and post-training GPTQ-lite int6 quantization. All techniques are legal neural architecture choices. No access to future tokens, no val-data gradient...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants