Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706)#1728
Open
mikeapedia wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record submission — Neural Base Model, No TTT
val_bpb = 1.07706 (sliding-window eval, seed 1337, non-casefold SP8192) | 15,962,729 B (~15.96 MB) | 8×H100 80GB SXM | 596s training + sliding-window eval
Summary
Positioning
Most competitive submissions on the main-track leaderboard include test-time training, which conflates architectural gains with test-time compute gains. This submission isolates the architectural contribution:
TTT_ENABLED=0in the shipped configTTT_ENABLED=1on the same model trends toward 1.074–1.075 in my experiments, short of clearing the 0.005-nat bar vs the casefold-track PR [Record] Stage 3 + SpinQuant V1 + MP-SGD-TTT — val_bpb 1.0759 #1695 (1.0759) — further TTT tuning is future work pending computeAs a reference point: PR #1493 (merged SOTA) reports sliding-only 1.0829 and post-TTT 1.0810. Our sliding-only is 1.07706, below their TTT-enabled number, and below their sliding-only by 0.00584 BPB.
Results (single seed)
Architecture
Inherited from PR #1674 (ours, non-record research submission):
x = A_bar * x + B_bar * x0with learned per-dimloop_log_A/loop_delta/loop_B.A_bar ∈ (0, 1)by construction (softplus on delta, exp of negative exp on log_A) enforces bounded-decay;B_barre-injects the original residual stream. Three per-dim scalars total.global_attn_layers=[4, 9, 10]get full causal attention + partial RoPE (rope_dims=16 / head_dim=64); remaining layers use sliding-window attention + full RoPE for positional precision within the window.mlp_up_bank(4:1 ratio) andmlp_down_bank. NS steps dropped 5 → 4 since the architecture no longer requires the extra refine step.Inherited from PR #1530 (@samacqua):
flash_attn_varlen_funcwithcu_seqlensboundaries; training, eval, and global-SGD TTT (when enabled) never attend across unrelated documents packed in the same flat batch.linear_xielu_kernelfuses the up-projection + xIELU activation + squaring into a single kernel (analogue of @samacqua'slinear_leaky_relu_square_kernelwith our xIELU activation).Inherited from PR #1693 (@dexhunter + @MarioPaerle) — gates used, TTT disabled for this submission:
sigmoid × 2gate on attention output, zero-init for identity-at-init, composed withfullgraph=Truecompile.New in this PR:
LOCAL_WINDOW_SIZE=512andLOCAL_WINDOW_SIZE=1024produced identical val_bpb, suggesting per-layer window size is a near-free dial. This PR splits: 512 tokens on locals{0, 1, 2, 3, 5}(early layers + the recurrence-loop tail at layer 5, where attention FLOPs are 2×-amplified bynum_loops=2), 1024 tokens on locals{6, 7, 8}(post-loop integration layers where wider context plausibly helps and isn't loop-amplified). Global layers{4, 9, 10}retain full attention. Zero compile-cost — each block'sattn.window_sizeis set once at init and baked as a per-subgraph constant.Dropped from PR #1674:
KV_TIE_GLOBAL=0). Freed V-weights are spent on more expressive global-layer attention rather than on looser quantization clipping.Quantization
EMBED_BITS=7, clip k=15.0)Total artifact at seed 1337: 15,962,729 B (compressed code + quantized state). Under the 16 MB cap by 37 KB.
Compliance (Issue #1017 Track B)
Since this submission runs the sliding-window eval path with no test-time adaptation, only the causality and normalization conditions apply:
flash_attn_3_func(..., causal=True, window_size=attn.window_size)on every attention call. SmearGate mixes with the previous token only (F.pad(x[:, :-1], (0, 0, 1, 0))).logit_softcap * tanh(logits / logit_softcap)applied uniformly (standard stabilization, not a selective modulation).Conditions 3 and 4 (score-before-update, single-pass) are TTT-specific and don't apply here.
Tokenizer: standard SP8192 (Kevin Clark's pre-tokenized dataset via PR #78 @mtybadger). No casefold — legality-independent of Issue #1604.
Note on logs and seeds
No training/eval log attached. The VM used for these runs went down before the seed-1337 log could be pushed to GitHub, and I no longer have GPU access to reproduce. The metrics (val_bpb 1.07706, artifact 15,962,729 B, train 596s) are recorded from my own observation of the run output during the session. I invite judges to reproduce using the command below and expect the numbers to be within normal seed-variance.
Single-seed result (seed 1337). Compute-constrained, consistent with the "non-record research submission" convention used by PR #1674 (ours, earlier). Seed was picked before the run, not hindsight-selected.
Reproduction
Expected log markers:
Total submission size quantized+brotli: ~15,962,729 bytesdiagnostic quantized_sliding_window val_loss:... val_bpb:1.07706...Running with
TTT_ENABLED=1additionally invokes the phased TTT path (ported from PR #1693), but this is not the submission metric.Lineage
Parallel track (with TTT):
Credits
Test plan
TTT_ENABLED=0is the shipped default — submission reproduces the 1.07706 number without TTTEVAL_EXTRA_LOOPS(extra recurrence iterations at eval time) — no improvement, regresses sliding bpb. Submission ships with defaultEVAL_EXTRA_LOOPS=0.