Non-record: systems-fusion investigation + H-Net M1 pilot by diaslmb · Pull Request #1615 · openai/parameter-golf

diaslmb · 2026-04-14T11:11:07Z

What this PR is

Two contributions bundled as a single non-record submission, both on top of @bigbag's PR #1493 stack:

Part 1 — systems-fusion investigation (negative result). I built Triton fwd+bwd kernels for XSA and a fused-QKV patch, then measured them end-to-end in a 200-step 1×H100 training pilot. Neither helps: torch.compile(max-autotune-no-cudagraphs) with FA3 is already at the kernel-fusion ceiling at D=512. The torch.autograd.Function wrapper for the Triton XSA creates a graph break whose cost exceeds the kernel's advantage. Full methodology, four phases of drift-controlled benchmarking, and reproduction scripts shipped.
Part 2 — H-Net Milestone 1 pilot (signs of life). Addresses the unchecked "H-net tokenization" entry on the repo's Requests-for-PRs list. Hierarchical byte-level stack (byte encoder + fixed stride-4 chunker + main network + byte-encoder→byte-decoder skip + byte decoder). Four pilot runs on 1×H100:

run	steps	tokens	skip	val_bpb
pilot	300	10 M	✗	4.49
long	1500	49 M	✗	4.40
+skip	1500	49 M	✓	3.15
final	4500	147 M	✓	2.51

The skip connection alone dropped val_bpb by 1.25 (at matched data); 3× more data dropped another 0.64. Loss is still decreasing at step 4500, no plateau.

Read order

Start with README.md in the submission folder. It has the tl;dr, the full results tables, and the decision tree that led from Part 1 to Part 2.

Artifacts

xsa_triton.py, qkv_fuse.py, phase3_run.py, bench_scripts/* — Part 1.
hnet_m1/* — Part 2 (model factory, byte shard preprocessor, training script, phase4 orchestration).
bootstrap.sh, unpack.py — one-shot pod setup.
hnet_scope.md — detailed M2–M4 design for the grant-funded work.

Grant note

Applying for the $500 OpenAI dev grant for M2 (learned chunker) through M4 (ablations). Details and per-milestone abort criteria in hnet_scope.md. Total grant-funded GPU spend projected at ~$500.

Credits

@bigbag (PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493) — SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25 + legal TTT baseline.
H-Net: Hwang et al., arXiv:2507.07955.
FA3 binaries: windreamer.github.io/flash-attention3-wheels (cu128_torch280).
@abaybektursun PR Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105 (fused MLP via Triton+CUTLASS EVT) and @shram86 PR Add non-record 16MB submission: FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ #1447 (FlashMuon) — informed the "systems corner is saturated" decision to pivot to H-Net.

Part 1 (systems, negative result): Triton XSA fwd+bwd kernel + fused QKV on bigbag PR openai#1493. 200-step 1xH100 training pilot shows 0% speedup from QKV fusion and -6% regression from Triton XSA (autograd.Function graph break eats the kernel advantage). Inductor + max-autotune already at the D=512 kernel-fusion ceiling. Part 2 (H-Net M1 pilot, signs of life): Hierarchical byte-level stack with fixed stride-4 chunker and byte-encoder->byte-decoder skip. Four pilot runs on 1xH100: val_bpb 4.49 (300 steps, no skip) -> 4.40 (1500 steps, no skip) -> 3.15 (1500 steps, +skip) -> 2.51 (4500 steps, +skip). Skip connection -1.25 bpb; 3x more data -0.64 bpb. 33.9M params; main network dominates at 31.7M. Proposes M2-M4 (learned chunker, 16MB recipe, ablations) as OpenAI dev-grant scope. Addresses the unchecked 'H-net tokenization' entry on the Requests-for-PRs list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: systems-fusion investigation + H-Net M1 pilot#1615

Non-record: systems-fusion investigation + H-Net M1 pilot#1615
diaslmb wants to merge 1 commit intoopenai:mainfrom
diaslmb:hnet-m1-and-systems-investigation

diaslmb commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diaslmb commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR is

Read order

Artifacts

Grant note

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

diaslmb commented Apr 14, 2026 •

edited

Loading