Skip to content

Non-record: systems-fusion investigation + H-Net M1 pilot#1615

Open
diaslmb wants to merge 1 commit intoopenai:mainfrom
diaslmb:hnet-m1-and-systems-investigation
Open

Non-record: systems-fusion investigation + H-Net M1 pilot#1615
diaslmb wants to merge 1 commit intoopenai:mainfrom
diaslmb:hnet-m1-and-systems-investigation

Conversation

@diaslmb
Copy link
Copy Markdown

@diaslmb diaslmb commented Apr 14, 2026

What this PR is

Two contributions bundled as a single non-record submission, both on top of @bigbag's PR #1493 stack:

  1. Part 1 — systems-fusion investigation (negative result). I built Triton fwd+bwd kernels for XSA and a fused-QKV patch, then measured them end-to-end in a 200-step 1×H100 training pilot. Neither helps: torch.compile(max-autotune-no-cudagraphs) with FA3 is already at the kernel-fusion ceiling at D=512. The torch.autograd.Function wrapper for the Triton XSA creates a graph break whose cost exceeds the kernel's advantage. Full methodology, four phases of drift-controlled benchmarking, and reproduction scripts shipped.

  2. Part 2 — H-Net Milestone 1 pilot (signs of life). Addresses the unchecked "H-net tokenization" entry on the repo's Requests-for-PRs list. Hierarchical byte-level stack (byte encoder + fixed stride-4 chunker + main network + byte-encoder→byte-decoder skip + byte decoder). Four pilot runs on 1×H100:

run steps tokens skip val_bpb
pilot 300 10 M 4.49
long 1500 49 M 4.40
+skip 1500 49 M 3.15
final 4500 147 M 2.51

The skip connection alone dropped val_bpb by 1.25 (at matched data); 3× more data dropped another 0.64. Loss is still decreasing at step 4500, no plateau.

Read order

Start with README.md in the submission folder. It has the tl;dr, the full results tables, and the decision tree that led from Part 1 to Part 2.

Artifacts

  • xsa_triton.py, qkv_fuse.py, phase3_run.py, bench_scripts/* — Part 1.
  • hnet_m1/* — Part 2 (model factory, byte shard preprocessor, training script, phase4 orchestration).
  • bootstrap.sh, unpack.py — one-shot pod setup.
  • hnet_scope.md — detailed M2–M4 design for the grant-funded work.

Grant note

Applying for the $500 OpenAI dev grant for M2 (learned chunker) through M4 (ablations). Details and per-milestone abort criteria in hnet_scope.md. Total grant-funded GPU spend projected at ~$500.

Credits

Part 1 (systems, negative result): Triton XSA fwd+bwd kernel + fused QKV on bigbag PR openai#1493. 200-step 1xH100 training pilot shows 0% speedup from QKV fusion and -6% regression from Triton XSA (autograd.Function graph break eats the kernel advantage). Inductor + max-autotune already at the D=512 kernel-fusion ceiling.

Part 2 (H-Net M1 pilot, signs of life): Hierarchical byte-level stack with fixed stride-4 chunker and byte-encoder->byte-decoder skip. Four pilot runs on 1xH100: val_bpb 4.49 (300 steps, no skip) -> 4.40 (1500 steps, no skip) -> 3.15 (1500 steps, +skip) -> 2.51 (4500 steps, +skip). Skip connection -1.25 bpb; 3x more data -0.64 bpb. 33.9M params; main network dominates at 31.7M. Proposes M2-M4 (learned chunker, 16MB recipe, ablations) as OpenAI dev-grant scope.

Addresses the unchecked 'H-net tokenization' entry on the Requests-for-PRs list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant