Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff by anthony-maio · Pull Request #915 · openai/parameter-golf

anthony-maio · 2026-03-27T01:06:34Z

Non-Record Submission: Makora-Generated Fused CUDA Kernel

Answers OpenAI's explicit request for megakernels.

Fused Softcap + Cross Entropy CUDA Kernel

Generated via Makora automated kernel generation. Fuses 30*tanh(logits/30) + cross_entropy into a single CUDA launch:

1.94x faster than torch.compile
7.51x faster than eager PyTorch
Warp-level online softmax reduction
bf16 input, float32 accumulation
Numerically correct to 5 decimal places (max diff: 0.00001431)

Compiled via torch.utils.cpp_extension.load_inline at startup — zero external dependencies.

Architecture

Same 11L VRL+LeakyReLU² stack as PR #889 (0.9642 bpb) with the fused kernel integrated into the sliding window eval path. Training uses standard PyTorch (kernel is forward-only).

Validation

Kernel compiled and verified on H100
fused_softcap_ce:True confirmed in training logs
Correctness: max absolute diff vs reference = 0.00001431
8xH100 full run pending (GPUs currently unavailable)

Why This Matters

Custom CUDA kernels at this model scale typically lose to torch.compile (we proved this ourselves with 8 Makora kernels on Day 1 — all added overhead). The softcap+CE fusion is the exception because it eliminates a large intermediate tensor (B*T × V float32 capped logits) that torch.compile cannot optimize away.

Credits

Kernel generation: Makora
N-gram backoff: PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 by @Asukabot0
Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + [email protected] (val_bpb=1.1233) #414 by @signalrush

Makora-generated kernel fuses 30*tanh(x/30) + cross_entropy into one CUDA launch. Warp-level reduction, online softmax, bf16 input. Compiled via load_inline at startup. Falls back to standard PyTorch if compilation fails. Currently loaded but not yet wired into eval path — needs forward_logits to expose pre-softcap logits. Included for documentation and future integration. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- forward_logits_raw() returns pre-softcap logits - Eval uses fused_softcap_ce(raw_logits, targets) when available - Falls back to standard forward_logits + F.cross_entropy if not - USE_FUSED_CE=0 to disable - Logs fused_softcap_ce:True/False at startup Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Copilot

Pull request overview

Adds a new 10min/16MB record submission that integrates an n‑gram backoff evaluator and a Makora-generated fused CUDA softcap+cross-entropy path to speed up sliding-window evaluation.

Changes:

Introduces an inline-compiled CUDA extension to fuse 30*tanh(logits/30) + cross-entropy into a single kernel call for sliding-window eval.
Adds multi-order (2–7) n-gram backoff evaluation with entropy-adaptive mixing.
Adds record metadata/artifacts (README, submission.json, training log) and adjusts .gitignore log handling.

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py	New record training/eval script with fused softcap+CE CUDA extension + n-gram backoff eval.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_seed1337.log	Added training/eval log artifact for reproducibility.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/submission.json	Adds submission metadata for the record entry.
records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/README.md	Documents results, method, and reproduction command for the record.
.gitignore	Changes log ignore behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T01:14:18Z

+except Exception:
+ _HAS_FUSED_CE = False
+ def fused_softcap_ce(logits, targets):
+  capped = 30.0 * torch.tanh(logits.float() / 30.0)


The Python fallback fused_softcap_ce path also hard-codes the 30.0 softcap. This will silently diverge from the configured model softcap if LOGIT_SOFTCAP is changed, even when the CUDA extension fails to build. Consider using args.logit_softcap (or a module-level constant derived from Hyperparameters.logit_softcap) instead of a literal 30.0.

Suggested change

capped = 30.0 * torch.tanh(logits.float() / 30.0)

softcap = getattr(Hyperparameters, "logit_softcap", 30.0)

capped = softcap * torch.tanh(logits.float() / softcap)

Copilot · 2026-03-27T01:14:19Z

+ fused_sc_ce<<<(B+3)/4,128>>>((const __nv_bfloat16*)L.data_ptr<at::BFloat16>(),
+  T.data_ptr<int64_t>(),O.data_ptr<float>(),B,V);
+ return O;


The CUDA extension launches fused_sc_ce on the default stream and doesn’t check for launch errors. In PyTorch extensions this can cause incorrect stream semantics (race with surrounding ops on the current stream) and makes kernel failures hard to diagnose. Consider launching on PyTorch’s current CUDA stream (and/or passing an explicit stream in the <<<...>>> launch) and adding a kernel launch error check (e.g., C10_CUDA_KERNEL_LAUNCH_CHECK / AT_CUDA_CHECK).

Copilot · 2026-03-27T01:14:19Z

+_HAS_FUSED_CE = False
+try:
+ from torch.utils.cpp_extension import load_inline as _load_inline
+ _FUSED_CE_SRC = r"""
+#include <torch/extension.h>
+#include <cuda_runtime.h>
+#include <cuda_bf16.h>
+#include <math.h>
+#define CAP 30.0f
+#define INV_CAP 0.03333333333333333f
+__device__ __forceinline__ float _sc(float x){return CAP*tanhf(x*INV_CAP);}
+extern "C" __global__ void __launch_bounds__(128) fused_sc_ce(
+ const __nv_bfloat16* __restrict__ L,const int64_t* __restrict__ T,
+ float* __restrict__ O,int B,int V){
+ int tid=threadIdx.x,lane=tid&31,wid=tid>>5,row=blockIdx.x*4+wid;
+ if(row>=B)return;
+ const __nv_bfloat16* rp=L+(size_t)row*V;int tgt=(int)T[row];
+ float mx=-1e38f,se=0.f,tc=0.f;
+ for(int i=lane;i<V;i+=32){
+  float c=_sc(__bfloat162float(rp[i]));
+  if(i==tgt)tc=c;
+  if(c>mx){se=se*expf(mx-c)+1.f;mx=c;}else{se+=expf(c-mx);}
+ }
+ for(int o=16;o>0;o>>=1){
+  float om=__shfl_xor_sync(0xffffffff,mx,o),os=__shfl_xor_sync(0xffffffff,se,o),
+        ot=__shfl_xor_sync(0xffffffff,tc,o);
+  float nm=fmaxf(mx,om);se=se*expf(mx-nm)+os*expf(om-nm);mx=nm;tc+=ot;
+ }
+ if(lane==0)O[row]=logf(se)+mx-tc;
+}
+torch::Tensor fused_sc_ce_call(torch::Tensor L,torch::Tensor T){
+ int B=L.size(0),V=L.size(1);
+ auto O=torch::empty({B},L.options().dtype(torch::kFloat32));
+ fused_sc_ce<<<(B+3)/4,128>>>((const __nv_bfloat16*)L.data_ptr<at::BFloat16>(),
+  T.data_ptr<int64_t>(),O.data_ptr<float>(),B,V);
+ return O;
+}
+"""
+ _fused_ce_lib = _load_inline(name="fused_sc_ce",cpp_sources="torch::Tensor fused_sc_ce_call(torch::Tensor,torch::Tensor);",
+  cuda_sources=_FUSED_CE_SRC,functions=["fused_sc_ce_call"],verbose=False,extra_cuda_cflags=["-O3","--use_fast_math"])
+ def fused_softcap_ce(logits, targets):
+  if not logits.is_contiguous(): logits = logits.contiguous()
+  return _fused_ce_lib.fused_sc_ce_call(logits.bfloat16(), targets)
+ _HAS_FUSED_CE = True


load_inline is executed at import time, so under torchrun each rank will attempt to build/load the CUDA extension concurrently before dist is initialized. This can add large startup overhead and can be flaky due to build-directory contention. Consider deferring compilation until after rank/world_size are known and only compiling on rank 0 (with a barrier), or using a file lock/unique build directory per rank.

Copilot · 2026-03-27T01:14:19Z

+except Exception:
+ _HAS_FUSED_CE = False


The broad except Exception around extension compilation silently disables the fused path without logging the failure reason. This makes it hard to tell whether a run is actually using the fused kernel (and why it isn’t). Consider logging the exception (ideally only on rank 0) or re-raising when USE_FUSED_CE is explicitly requested.

Suggested change

except Exception:

_HAS_FUSED_CE = False

except Exception as exc:

# Extension compilation failed; fall back to PyTorch implementation.

# If fused CE is explicitly requested, surface the error instead of silently disabling it.

if os.environ.get("USE_FUSED_CE") == "1":

raise

_HAS_FUSED_CE = False

# Log the failure reason once (on rank 0 if distributed is initialized).

log_on_this_rank = True

try:

if dist.is_available() and dist.is_initialized():

log_on_this_rank = dist.get_rank() == 0

except Exception:

# If rank cannot be queried, default to logging.

log_on_this_rank = True

if log_on_this_rank:

print(

"WARNING: fused softcap cross-entropy extension could not be compiled; "

"falling back to PyTorch implementation. Exception: "

f"{exc}",

file=sys.stderr,

)

Copilot · 2026-03-27T01:14:20Z

+    "training_time_seconds": 600,
+    "val_bpb": 0.9642,
+    "val_loss": 1.6279,
+    "bytes_total": 15953596,


bytes_total appears inconsistent with the script’s own size accounting (it logs total = bytes_model + bytes_code). Given bytes_code=67048, bytes_total should match the generated artifact size from the run logs; please recompute/update this field so it reflects the actual submission size used for the 16MB cap checks.

Suggested change

"bytes_total": 15953596,

"bytes_total": 16020644,

Copilot · 2026-03-27T01:14:20Z

+#define CAP 30.0f
+#define INV_CAP 0.03333333333333333f
+__device__ __forceinline__ float _sc(float x){return CAP*tanhf(x*INV_CAP);}


The fused softcap+CE kernel hard-codes CAP=30.0 (and INV_CAP accordingly). If LOGIT_SOFTCAP is changed via env/args, the fused path will compute a different loss than the model’s configured softcap. Consider either generating the kernel for args.logit_softcap (or passing cap into the kernel) or disabling use_fused unless args.logit_softcap==30.0.

MatoTeziTanka · 2026-04-11T13:11:08Z

Community Review — Fused Softcap+CE Megakernel + N-gram Backoff

BPB: 0.9642 post-ngram (3-seed mean, std 0.0002) / 1.1225 pure-neural stride-64 | Seeds: 3 | Artifact: 15,981,848 B (seed 1337) | Track: non-record | Compliance: N-gram PASS, eval-budget FLAG

What this does: Takes the author's PR #175-style 11L / 512d / VRL / LeakyReLU(0.5)^2 / GQA / SP1024 stack, and stacks two independently-motivated pieces on top: (1) a Makora-generated fused softcap + cross_entropy CUDA kernel (compiled at startup via torch.utils.cpp_extension.load_inline, with a pure-PyTorch fallback at L67-71); (2) a multi-order (2-7gram) backoff cache applied after sliding-window eval, with an entropy-adaptive mixing weight alpha = 0.05 + 0.55 * sigmoid(2*(H - 4)). Filed correctly as non-record.

What I found in the code (records/track_10min_16mb/2026-03-26_NgramBackoff_VRL_LeakyReLU2/train_gpt.py, SHA 15a5cb8c):

Fused kernel (L23-71): bf16 input, fp32 accumulation, warp-level online softmax reduction via __shfl_xor_sync, one block per 4 rows. Computes logf(se)+mx-tc where tc is the capped logit at the target. Numerically consistent with the 30*tanh(x/30) + F.cross_entropy fallback. The try/except at L23-71 is a clean "compile or fall back" pattern — works on CPU (gauntlet below verified the fallback). Used only on the forward/eval path (L961 calls it inside eval_val, training still uses the standard loss). Per Issue Are HW optimization solutions also welcome? #1409, custom/fused kernels are explicitly permitted.
N-gram cache (NgramBackoffCache, L988-1031): 7 orders, 4M-bucket hash tables per order, XOR-with-primes hashing, min_count=2, raw count ratios (no smoothing).
Strict prefix on context (L1001-1005): _hash_ctx reads tokens[pos - ctx_w + k] for k in [0, ctx_w). With pos = ws + t + 1 (the absolute token index of the target, L1088) and ctx_w >= 1, the hashed indices are [pos-ctx_w, ..., pos-1] — strictly before the target. No oracle / hindsight lookup.
predict(tokens, pos, target) (L1019-1032): target is only used to look up full_count in the context-conditional hash table — it computes P(target | context), not argmax over targets. This is standard n-gram prediction, not oracle selection. Per Issue Illegal submissions megathread #677 (valerio-oai, 2026-03-27), eval caches must be backward-looking with no oracle selection — this code is.
Score-before-update at window granularity (L1087-1112): The scoring loop finishes for a whole window, and only then does cache.update(all_tokens, scored_up_to, new_end) (L1111) add the window's tokens to the tables; scored_up_to starts at the first assigned window's left edge. So during any token's scoring the cache only contains tokens from strictly-earlier windows. No token ever contributes to its own prediction. Legal under Issue Invalid submissions due to information leakage during TTT #402 / Illegal submissions megathread #677.
Mixing (L1094-1098): mixed_p = (1 - alpha)*model_p + alpha*ng_p, floored at 1e-12, -log taken for NLL. Linear in probability space, standard.

Why is the BPB 0.9642 and not ~1.08?

The README is transparent about this, but it's worth stating plainly for the leaderboard context: 0.9642 is not comparable to the sliding-window stride-64 numbers on the main leaderboard. From train_seed1337.log:

final_int6_sliding_window val_loss:1.8953 val_bpb:1.1225 stride:64 eval_time:102169ms
final_ngram        val_loss:1.6277 val_bpb:0.9640 ngram_eval_time:895349ms

The pure-neural stride-64 number is 1.1225, which is exactly in the range expected for an SP1024 11L VRL stack at 6,765 training steps. The 0.9642 is a post-processing number that layers a 2-7gram backoff cache over the neural eval. So the "gap to SOTA" isn't a gap at all — the neural model is at ~1.12 like the rest of the SP1024 pack; the cache buys ~0.16 BPB on top for free (in params/bytes), but at a big time cost.

Main flag — eval-budget compliance:

ngram_eval_time: 895,349 ms (seed 1337) ≈ 14.9 minutes on 8×H100 SXM. The stride-64 neural eval alone is already 102,169 ms (~1.7 min). Per the README and Issue Environment Clarifications (pytorch, CUDA, H100) #17, eval wallclock on 8×H100 SXM is 10 min.
The submission README (at records/track_10min_16mb/.../README.md, "Compliance" section) frames this as "~15 min on 8×H100 SXM (under 10 min per-GPU)". That "per-GPU" reading is not how I read the rules — the budget is total wallclock on the 8-GPU eval image, not per-device. I'd want a mod ruling on this before treating 0.9642 as a comparable number, even for non-record.
This is the single biggest caveat on the submission. Everything else checks out.

Gauntlet (CPU pre-flight on the PR head):

[PASS] Import, Hyperparameters (dim=512, layers=11, heads=8, vocab=1024)
[PASS] Model: 26,993,766 params
[PASS] Forward pass: loss=6.9368
[PASS] Artifact: 4,649,574 B (29.1% of 16MB) via int6+lzma on freshly-initialized weights
[INFO] Code size (as downloaded from GitHub raw): 70,446 B
       Submission.json reports bytes_code=67,048 B — 3.4KB delta is GitHub raw encoding vs author's local encode; author's own code-size measurement is what counts per README
[INFO] Est. 8×H100: 45.9 ms/step, 13,058 steps in 10 min

Gauntlet PASS; the kernel's except Exception: fallback path (L67-71) is what made CPU execution possible — clean engineering.

Seed coverage / artifact sizes (per README table, verified against train_seed1337.log):

seed 1337: 15,981,848 B, post-ngram 0.9640
seed 42: 15,904,632 B, post-ngram 0.9641
seed 2025: 15,974,308 B, post-ngram 0.9644
All under 16,000,000 B. Mean 0.9642, std 0.0002. Pre-ngram stride-64 is 1.1225 / 1.1224 / 1.1231 respectively — tight.

Questions / flags:

Eval wallclock. 895s ngram stage + 102s stride-64 stage is ~16.6 min total on 8×H100 SXM. The author's "10 min per-GPU" reading is non-standard. I'd want @0hq / @valerio-oai to weigh in on whether non-record submissions are bound by the 10-min eval wallclock. If they are, the ngram stage as written needs to be either sped up (Numba/C++/CUDA the inner predict/update — right now it's pure Python over a NumPy buffer) or dropped.
Prior-art credit. README credits PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0) for the n-gram backoff approach, PR Record: 11L EMA + GPTQ-lite + warmdown3500 + [email protected] (val_bpb=1.1233) #414 for the neural base, PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493/Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 for LeakyReLU², and PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 for XSA — good attribution discipline, no concerns.
Makora disclosure. The kernel is explicitly labeled as Makora-generated with a link, which is the right level of disclosure. No issue.

Verdict: NEEDS CLARIFICATION — on eval-budget interpretation. The technique is compliant (n-gram is backward-looking, score-first, no oracle; kernel is a legal fused op per #1409) and the engineering is clean, but the non-record filing only makes sense if ~15-minute evals are acceptable on this track.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica:

MERGE (non-record) on the strength of the megakernel contribution alone (this was explicitly README-requested; it's a clean 1.94× vs torch.compile result, correctness verified on H100, CPU fallback is drop-in). The fused kernel is worth merging independent of the n-gram result.
HOLD on treating 0.9642 as a comparable n-gram BPB until a mod clarifies the 10-min eval budget for non-record submissions with post-processing caches (Issue Environment Clarifications (pytorch, CUDA, H100) #17 / README). A one-line ruling would unblock this and ~30 other ngram-cache PRs in the queue.
If the budget is 10 min wallclock, I'd ask the author to either (a) profile/port the n-gram inner loops out of Python, or (b) re-file the fused kernel as its own standalone non-record and drop the ngram stage from this PR.

Reviewed by @MatoTeziTanka — The Agora. Gauntlet ran clean on CPU: all 10 checks PASS, artifact budget 29.1%, fused-CE fallback path exercised successfully. AI tooling: review drafted with Claude Code (Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 15a5cb8c2a5ef20c8498e22dcdec66db7c344330.

@0hq

…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

anthony-maio and others added 2 commits March 26, 2026 20:07

Copilot AI review requested due to automatic review settings March 27, 2026 01:06

Copilot started reviewing on behalf of anthony-maio March 27, 2026 01:06 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

MatoTeziTanka mentioned this pull request Apr 11, 2026

Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed mean) #889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff#915

Non-record: Fused Softcap+CE Megakernel (1.94x vs torch.compile) + N-gram Backoff#915
anthony-maio wants to merge 2 commits intoopenai:mainfrom
anthony-maio:submission/ngram-megakernel

anthony-maio commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	capped = 30.0 * torch.tanh(logits.float() / 30.0)
	softcap = getattr(Hyperparameters, "logit_softcap", 30.0)
	capped = softcap * torch.tanh(logits.float() / softcap)

-except Exception:
- _HAS_FUSED_CE = False
+except Exception as exc:
+ # Extension compilation failed; fall back to PyTorch implementation.
+ # If fused CE is explicitly requested, surface the error instead of silently disabling it.
+ if os.environ.get("USE_FUSED_CE") == "1":
+  raise
+ _HAS_FUSED_CE = False
+ # Log the failure reason once (on rank 0 if distributed is initialized).
+ log_on_this_rank = True
+ try:
+  if dist.is_available() and dist.is_initialized():
+   log_on_this_rank = dist.get_rank() == 0
+ except Exception:
+  # If rank cannot be queried, default to logging.
+  log_on_this_rank = True
+ if log_on_this_rank:
+  print(
+   "WARNING: fused softcap cross-entropy extension could not be compiled; "
+   "falling back to PyTorch implementation. Exception: "
+   f"{exc}",
+   file=sys.stderr,
+  )

Conversation

anthony-maio commented Mar 27, 2026

Non-Record Submission: Makora-Generated Fused CUDA Kernel

Fused Softcap + Cross Entropy CUDA Kernel

Architecture

Validation

Why This Matters

Credits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Fused Softcap+CE Megakernel + N-gram Backoff

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants