Skip to content

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff#915

Open
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-megakernel
Open

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff#915
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-megakernel

Conversation

@anthony-maio
Copy link
Copy Markdown

Non-Record Submission: Makora-Generated Fused CUDA Kernel

Answers OpenAI's explicit request for megakernels.

Fused Softcap + Cross Entropy CUDA Kernel

Generated via Makora automated kernel generation. Fuses 30*tanh(logits/30) + cross_entropy into a single CUDA launch:

  • 1.94x faster than torch.compile
  • 7.51x faster than eager PyTorch
  • Warp-level online softmax reduction
  • bf16 input, float32 accumulation
  • Numerically correct to 5 decimal places (max diff: 0.00001431)

Compiled via torch.utils.cpp_extension.load_inline at startup — zero external dependencies.

Architecture

Same 11L VRL+LeakyReLU² stack as PR #889 (0.9642 bpb) with the fused kernel integrated into the sliding window eval path. Training uses standard PyTorch (kernel is forward-only).

Validation

  • Kernel compiled and verified on H100
  • fused_softcap_ce:True confirmed in training logs
  • Correctness: max absolute diff vs reference = 0.00001431
  • 8xH100 full run pending (GPUs currently unavailable)

Why This Matters

Custom CUDA kernels at this model scale typically lose to torch.compile (we proved this ourselves with 8 Makora kernels on Day 1 — all added overhead). The softcap+CE fusion is the exception because it eliminates a large intermediate tensor (B*T × V float32 capped logits) that torch.compile cannot optimize away.

Credits

anthony-maio and others added 2 commits March 26, 2026 20:07
Makora-generated kernel fuses 30*tanh(x/30) + cross_entropy into one
CUDA launch. Warp-level reduction, online softmax, bf16 input.
Compiled via load_inline at startup. Falls back to standard PyTorch
if compilation fails.

Currently loaded but not yet wired into eval path — needs forward_logits
to expose pre-softcap logits. Included for documentation and future
integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- forward_logits_raw() returns pre-softcap logits
- Eval uses fused_softcap_ce(raw_logits, targets) when available
- Falls back to standard forward_logits + F.cross_entropy if not
- USE_FUSED_CE=0 to disable
- Logs fused_softcap_ce:True/False at startup

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings March 27, 2026 01:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10min/16MB record submission that integrates an n‑gram backoff evaluator and a Makora-generated fused CUDA softcap+cross-entropy path to speed up sliding-window evaluation.

Changes:

  • Introduces an inline-compiled CUDA extension to fuse 30*tanh(logits/30) + cross-entropy into a single kernel call for sliding-window eval.
  • Adds multi-order (2–7) n-gram backoff evaluation with entropy-adaptive mixing.
  • Adds record metadata/artifacts (README, submission.json, training log) and adjusts .gitignore log handling.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py New record training/eval script with fused softcap+CE CUDA extension + n-gram backoff eval.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed1337.log Added training/eval log artifact for reproducibility.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json Adds submission metadata for the record entry.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md Documents results, method, and reproduction command for the record.
.gitignore Changes log ignore behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

except Exception:
_HAS_FUSED_CE = False
def fused_softcap_ce(logits, targets):
capped = 30.0 * torch.tanh(logits.float() / 30.0)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python fallback fused_softcap_ce path also hard-codes the 30.0 softcap. This will silently diverge from the configured model softcap if LOGIT_SOFTCAP is changed, even when the CUDA extension fails to build. Consider using args.logit_softcap (or a module-level constant derived from Hyperparameters.logit_softcap) instead of a literal 30.0.

Suggested change
capped = 30.0 * torch.tanh(logits.float() / 30.0)
softcap = getattr(Hyperparameters, "logit_softcap", 30.0)
capped = softcap * torch.tanh(logits.float() / softcap)

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +58
fused_sc_ce<<<(B+3)/4,128>>>((const __nv_bfloat16*)L.data_ptr<at::BFloat16>(),
T.data_ptr<int64_t>(),O.data_ptr<float>(),B,V);
return O;
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA extension launches fused_sc_ce on the default stream and doesn’t check for launch errors. In PyTorch extensions this can cause incorrect stream semantics (race with surrounding ops on the current stream) and makes kernel failures hard to diagnose. Consider launching on PyTorch’s current CUDA stream (and/or passing an explicit stream in the <<<...>>> launch) and adding a kernel launch error check (e.g., C10_CUDA_KERNEL_LAUNCH_CHECK / AT_CUDA_CHECK).

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +66
_HAS_FUSED_CE = False
try:
from torch.utils.cpp_extension import load_inline as _load_inline
_FUSED_CE_SRC = r"""
#include <torch/extension.h>
#include <cuda_runtime.h>
#include <cuda_bf16.h>
#include <math.h>
#define CAP 30.0f
#define INV_CAP 0.03333333333333333f
__device__ __forceinline__ float _sc(float x){return CAP*tanhf(x*INV_CAP);}
extern "C" __global__ void __launch_bounds__(128) fused_sc_ce(
const __nv_bfloat16* __restrict__ L,const int64_t* __restrict__ T,
float* __restrict__ O,int B,int V){
int tid=threadIdx.x,lane=tid&31,wid=tid>>5,row=blockIdx.x*4+wid;
if(row>=B)return;
const __nv_bfloat16* rp=L+(size_t)row*V;int tgt=(int)T[row];
float mx=-1e38f,se=0.f,tc=0.f;
for(int i=lane;i<V;i+=32){
float c=_sc(__bfloat162float(rp[i]));
if(i==tgt)tc=c;
if(c>mx){se=se*expf(mx-c)+1.f;mx=c;}else{se+=expf(c-mx);}
}
for(int o=16;o>0;o>>=1){
float om=__shfl_xor_sync(0xffffffff,mx,o),os=__shfl_xor_sync(0xffffffff,se,o),
ot=__shfl_xor_sync(0xffffffff,tc,o);
float nm=fmaxf(mx,om);se=se*expf(mx-nm)+os*expf(om-nm);mx=nm;tc+=ot;
}
if(lane==0)O[row]=logf(se)+mx-tc;
}
torch::Tensor fused_sc_ce_call(torch::Tensor L,torch::Tensor T){
int B=L.size(0),V=L.size(1);
auto O=torch::empty({B},L.options().dtype(torch::kFloat32));
fused_sc_ce<<<(B+3)/4,128>>>((const __nv_bfloat16*)L.data_ptr<at::BFloat16>(),
T.data_ptr<int64_t>(),O.data_ptr<float>(),B,V);
return O;
}
"""
_fused_ce_lib = _load_inline(name="fused_sc_ce",cpp_sources="torch::Tensor fused_sc_ce_call(torch::Tensor,torch::Tensor);",
cuda_sources=_FUSED_CE_SRC,functions=["fused_sc_ce_call"],verbose=False,extra_cuda_cflags=["-O3","--use_fast_math"])
def fused_softcap_ce(logits, targets):
if not logits.is_contiguous(): logits = logits.contiguous()
return _fused_ce_lib.fused_sc_ce_call(logits.bfloat16(), targets)
_HAS_FUSED_CE = True
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_inline is executed at import time, so under torchrun each rank will attempt to build/load the CUDA extension concurrently before dist is initialized. This can add large startup overhead and can be flaky due to build-directory contention. Consider deferring compilation until after rank/world_size are known and only compiling on rank 0 (with a barrier), or using a file lock/unique build directory per rank.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +68
except Exception:
_HAS_FUSED_CE = False
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad except Exception around extension compilation silently disables the fused path without logging the failure reason. This makes it hard to tell whether a run is actually using the fused kernel (and why it isn’t). Consider logging the exception (ideally only on rank 0) or re-raising when USE_FUSED_CE is explicitly requested.

Suggested change
except Exception:
_HAS_FUSED_CE = False
except Exception as exc:
# Extension compilation failed; fall back to PyTorch implementation.
# If fused CE is explicitly requested, surface the error instead of silently disabling it.
if os.environ.get("USE_FUSED_CE") == "1":
raise
_HAS_FUSED_CE = False
# Log the failure reason once (on rank 0 if distributed is initialized).
log_on_this_rank = True
try:
if dist.is_available() and dist.is_initialized():
log_on_this_rank = dist.get_rank() == 0
except Exception:
# If rank cannot be queried, default to logging.
log_on_this_rank = True
if log_on_this_rank:
print(
"WARNING: fused softcap cross-entropy extension could not be compiled; "
"falling back to PyTorch implementation. Exception: "
f"{exc}",
file=sys.stderr,
)

Copilot uses AI. Check for mistakes.
"training_time_seconds": 600,
"val_bpb": 0.9642,
"val_loss": 1.6279,
"bytes_total": 15953596,
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_total appears inconsistent with the script’s own size accounting (it logs total = bytes_model + bytes_code). Given bytes_code=67048, bytes_total should match the generated artifact size from the run logs; please recompute/update this field so it reflects the actual submission size used for the 16MB cap checks.

Suggested change
"bytes_total": 15953596,
"bytes_total": 16020644,

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +33
#define CAP 30.0f
#define INV_CAP 0.03333333333333333f
__device__ __forceinline__ float _sc(float x){return CAP*tanhf(x*INV_CAP);}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fused softcap+CE kernel hard-codes CAP=30.0 (and INV_CAP accordingly). If LOGIT_SOFTCAP is changed via env/args, the fused path will compute a different loss than the model’s configured softcap. Consider either generating the kernel for args.logit_softcap (or passing cap into the kernel) or disabling use_fused unless args.logit_softcap==30.0.

Copilot uses AI. Check for mistakes.
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Fused Softcap+CE Megakernel + N-gram Backoff

BPB: 0.9642 post-ngram (3-seed mean, std 0.0002) / 1.1225 pure-neural stride-64 | Seeds: 3 | Artifact: 15,981,848 B (seed 1337) | Track: non-record | Compliance: N-gram PASS, eval-budget FLAG

What this does: Takes the author's PR #175-style 11L / 512d / VRL / LeakyReLU(0.5)^2 / GQA / SP1024 stack, and stacks two independently-motivated pieces on top: (1) a Makora-generated fused softcap + cross_entropy CUDA kernel (compiled at startup via torch.utils.cpp_extension.load_inline, with a pure-PyTorch fallback at L67-71); (2) a multi-order (2-7gram) backoff cache applied after sliding-window eval, with an entropy-adaptive mixing weight alpha = 0.05 + 0.55 * sigmoid(2*(H - 4)). Filed correctly as non-record.

What I found in the code (records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py, SHA 15a5cb8c):

  • Fused kernel (L23-71): bf16 input, fp32 accumulation, warp-level online softmax reduction via __shfl_xor_sync, one block per 4 rows. Computes logf(se)+mx-tc where tc is the capped logit at the target. Numerically consistent with the 30*tanh(x/30) + F.cross_entropy fallback. The try/except at L23-71 is a clean "compile or fall back" pattern — works on CPU (gauntlet below verified the fallback). Used only on the forward/eval path (L961 calls it inside eval_val, training still uses the standard loss). Per Issue Are HW optimization solutions also welcome? #1409, custom/fused kernels are explicitly permitted.
  • N-gram cache (NgramBackoffCache, L988-1031): 7 orders, 4M-bucket hash tables per order, XOR-with-primes hashing, min_count=2, raw count ratios (no smoothing).
  • Strict prefix on context (L1001-1005): _hash_ctx reads tokens[pos - ctx_w + k] for k in [0, ctx_w). With pos = ws + t + 1 (the absolute token index of the target, L1088) and ctx_w >= 1, the hashed indices are [pos-ctx_w, ..., pos-1] — strictly before the target. No oracle / hindsight lookup.
  • predict(tokens, pos, target) (L1019-1032): target is only used to look up full_count in the context-conditional hash table — it computes P(target | context), not argmax over targets. This is standard n-gram prediction, not oracle selection. Per Issue Illegal submissions megathread #677 (valerio-oai, 2026-03-27), eval caches must be backward-looking with no oracle selection — this code is.
  • Score-before-update at window granularity (L1087-1112): The scoring loop finishes for a whole window, and only then does cache.update(all_tokens, scored_up_to, new_end) (L1111) add the window's tokens to the tables; scored_up_to starts at the first assigned window's left edge. So during any token's scoring the cache only contains tokens from strictly-earlier windows. No token ever contributes to its own prediction. Legal under Issue Invalid submissions due to information leakage during TTT #402 / Illegal submissions megathread #677.
  • Mixing (L1094-1098): mixed_p = (1 - alpha)*model_p + alpha*ng_p, floored at 1e-12, -log taken for NLL. Linear in probability space, standard.

Why is the BPB 0.9642 and not ~1.08?

The README is transparent about this, but it's worth stating plainly for the leaderboard context: 0.9642 is not comparable to the sliding-window stride-64 numbers on the main leaderboard. From train_seed1337.log:

final_int6_sliding_window val_loss:1.8953 val_bpb:1.1225 stride:64 eval_time:102169ms
final_ngram        val_loss:1.6277 val_bpb:0.9640 ngram_eval_time:895349ms

The pure-neural stride-64 number is 1.1225, which is exactly in the range expected for an SP1024 11L VRL stack at 6,765 training steps. The 0.9642 is a post-processing number that layers a 2-7gram backoff cache over the neural eval. So the "gap to SOTA" isn't a gap at all — the neural model is at ~1.12 like the rest of the SP1024 pack; the cache buys ~0.16 BPB on top for free (in params/bytes), but at a big time cost.

Main flag — eval-budget compliance:

  • ngram_eval_time: 895,349 ms (seed 1337) ≈ 14.9 minutes on 8×H100 SXM. The stride-64 neural eval alone is already 102,169 ms (~1.7 min). Per the README and Issue Environment Clarifications (pytorch, CUDA, H100) #17, eval wallclock on 8×H100 SXM is 10 min.
  • The submission README (at records/track_10min_16mb/.../README.md, "Compliance" section) frames this as "~15 min on 8×H100 SXM (under 10 min per-GPU)". That "per-GPU" reading is not how I read the rules — the budget is total wallclock on the 8-GPU eval image, not per-device. I'd want a mod ruling on this before treating 0.9642 as a comparable number, even for non-record.
  • This is the single biggest caveat on the submission. Everything else checks out.

Gauntlet (CPU pre-flight on the PR head):

[PASS] Import, Hyperparameters (dim=512, layers=11, heads=8, vocab=1024)
[PASS] Model: 26,993,766 params
[PASS] Forward pass: loss=6.9368
[PASS] Artifact: 4,649,574 B (29.1% of 16MB) via int6+lzma on freshly-initialized weights
[INFO] Code size (as downloaded from GitHub raw): 70,446 B
       Submission.json reports bytes_code=67,048 B — 3.4KB delta is GitHub raw encoding vs author's local encode; author's own code-size measurement is what counts per README
[INFO] Est. 8×H100: 45.9 ms/step, 13,058 steps in 10 min

Gauntlet PASS; the kernel's except Exception: fallback path (L67-71) is what made CPU execution possible — clean engineering.

Seed coverage / artifact sizes (per README table, verified against train_seed1337.log):

  • seed 1337: 15,981,848 B, post-ngram 0.9640
  • seed 42: 15,904,632 B, post-ngram 0.9641
  • seed 2025: 15,974,308 B, post-ngram 0.9644
  • All under 16,000,000 B. Mean 0.9642, std 0.0002. Pre-ngram stride-64 is 1.1225 / 1.1224 / 1.1231 respectively — tight.

Questions / flags:

  1. Eval wallclock. 895s ngram stage + 102s stride-64 stage is ~16.6 min total on 8×H100 SXM. The author's "10 min per-GPU" reading is non-standard. I'd want @0hq / @valerio-oai to weigh in on whether non-record submissions are bound by the 10-min eval wallclock. If they are, the ngram stage as written needs to be either sped up (Numba/C++/CUDA the inner predict/update — right now it's pure Python over a NumPy buffer) or dropped.
  2. Prior-art credit. README credits PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0) for the n-gram backoff approach, PR Record: 11L EMA + GPTQ-lite + warmdown3500 + [email protected] (val_bpb=1.1233) #414 for the neural base, PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493/Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 for LeakyReLU², and PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 for XSA — good attribution discipline, no concerns.
  3. Makora disclosure. The kernel is explicitly labeled as Makora-generated with a link, which is the right level of disclosure. No issue.

Verdict: NEEDS CLARIFICATION — on eval-budget interpretation. The technique is compliant (n-gram is backward-looking, score-first, no oracle; kernel is a legal fused op per #1409) and the engineering is clean, but the non-record filing only makes sense if ~15-minute evals are acceptable on this track.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

  • MERGE (non-record) on the strength of the megakernel contribution alone (this was explicitly README-requested; it's a clean 1.94× vs torch.compile result, correctness verified on H100, CPU fallback is drop-in). The fused kernel is worth merging independent of the n-gram result.
  • HOLD on treating 0.9642 as a comparable n-gram BPB until a mod clarifies the 10-min eval budget for non-record submissions with post-processing caches (Issue Environment Clarifications (pytorch, CUDA, H100) #17 / README). A one-line ruling would unblock this and ~30 other ngram-cache PRs in the queue.
  • If the budget is 10 min wallclock, I'd ask the author to either (a) profile/port the n-gram inner loops out of Python, or (b) re-file the fused kernel as its own standalone non-record and drop the ngram stage from this PR.

Reviewed by @MatoTeziTankaThe Agora. Gauntlet ran clean on CPU: all 10 checks PASS, artifact budget 29.1%, fused-CE fallback path exercised successfully. AI tooling: review drafted with Claude Code (Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 15a5cb8c2a5ef20c8498e22dcdec66db7c344330.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants