Non-record: systems-fusion investigation + H-Net M1 pilot#1615
Open
diaslmb wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: systems-fusion investigation + H-Net M1 pilot#1615diaslmb wants to merge 1 commit intoopenai:mainfrom
diaslmb wants to merge 1 commit intoopenai:mainfrom
Conversation
Part 1 (systems, negative result): Triton XSA fwd+bwd kernel + fused QKV on bigbag PR openai#1493. 200-step 1xH100 training pilot shows 0% speedup from QKV fusion and -6% regression from Triton XSA (autograd.Function graph break eats the kernel advantage). Inductor + max-autotune already at the D=512 kernel-fusion ceiling. Part 2 (H-Net M1 pilot, signs of life): Hierarchical byte-level stack with fixed stride-4 chunker and byte-encoder->byte-decoder skip. Four pilot runs on 1xH100: val_bpb 4.49 (300 steps, no skip) -> 4.40 (1500 steps, no skip) -> 3.15 (1500 steps, +skip) -> 2.51 (4500 steps, +skip). Skip connection -1.25 bpb; 3x more data -0.64 bpb. 33.9M params; main network dominates at 31.7M. Proposes M2-M4 (learned chunker, 16MB recipe, ablations) as OpenAI dev-grant scope. Addresses the unchecked 'H-net tokenization' entry on the Requests-for-PRs list.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR is
Two contributions bundled as a single non-record submission, both on top of @bigbag's PR #1493 stack:
Part 1 — systems-fusion investigation (negative result). I built Triton fwd+bwd kernels for XSA and a fused-QKV patch, then measured them end-to-end in a 200-step 1×H100 training pilot. Neither helps:
torch.compile(max-autotune-no-cudagraphs)with FA3 is already at the kernel-fusion ceiling at D=512. Thetorch.autograd.Functionwrapper for the Triton XSA creates a graph break whose cost exceeds the kernel's advantage. Full methodology, four phases of drift-controlled benchmarking, and reproduction scripts shipped.Part 2 — H-Net Milestone 1 pilot (signs of life). Addresses the unchecked "H-net tokenization" entry on the repo's Requests-for-PRs list. Hierarchical byte-level stack (byte encoder + fixed stride-4 chunker + main network + byte-encoder→byte-decoder skip + byte decoder). Four pilot runs on 1×H100:
The skip connection alone dropped val_bpb by 1.25 (at matched data); 3× more data dropped another 0.64. Loss is still decreasing at step 4500, no plateau.
Read order
Start with
README.mdin the submission folder. It has the tl;dr, the full results tables, and the decision tree that led from Part 1 to Part 2.Artifacts
xsa_triton.py,qkv_fuse.py,phase3_run.py,bench_scripts/*— Part 1.hnet_m1/*— Part 2 (model factory, byte shard preprocessor, training script, phase4 orchestration).bootstrap.sh,unpack.py— one-shot pod setup.hnet_scope.md— detailed M2–M4 design for the grant-funded work.Grant note
Applying for the $500 OpenAI dev grant for M2 (learned chunker) through M4 (ablations). Details and per-milestone abort criteria in
hnet_scope.md. Total grant-funded GPU spend projected at ~$500.Credits