Skip to content

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706)#1728

Open
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:submission/neural-base-model-no-ttt
Open

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706)#1728
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:submission/neural-base-model-no-ttt

Conversation

@mikeapedia
Copy link
Copy Markdown

Non-record submission — Neural Base Model, No TTT

val_bpb = 1.07706 (sliding-window eval, seed 1337, non-casefold SP8192) | 15,962,729 B (~15.96 MB) | 8×H100 80GB SXM | 596s training + sliding-window eval

Summary

  • No test-time training. No LoRA adapters, no global-SGD. Pure architectural + quantization result.
  • Beats current merged non-casefold SOTA (PR #1493 @bigbag, 1.0810) by −0.00394 BPB (−0.00273 nats) — without any test-time adaptation.
  • Non-record submission. Doesn't attempt to clear the 0.005-nat bar vs merged SOTA (that's the full-TTT path, for which I'm out of compute credits). Positioning: establishes the base-model ceiling for standard SP8192 architectures and gives future TTT work a cleaner lift measurement.

Positioning

Most competitive submissions on the main-track leaderboard include test-time training, which conflates architectural gains with test-time compute gains. This submission isolates the architectural contribution:

As a reference point: PR #1493 (merged SOTA) reports sliding-only 1.0829 and post-TTT 1.0810. Our sliding-only is 1.07706, below their TTT-enabled number, and below their sliding-only by 0.00584 BPB.

Results (single seed)

Stage val_bpb Notes
Pre-quant EMA (bf16) 1.0699 End-of-training, post-EMA
Post-quant (int6+int7+brotli) sliding 1.07706 Submission number
Quantization cost +0.00716 Typical for int6 GPTQ
  • Training: 596s (wallclock cap), step 4602/20000
  • Artifact: 15,962,729 B (under 16,000,000 B cap by 37,271 B)
  • Seed: 1337 (single-seed result — compute-constrained, see Note on logs and seeds below)

Architecture

Inherited from PR #1674 (ours, non-record research submission):

  • Parcae Constrained Loop Injection — SSM-style boundary condition at each loop re-entry: x = A_bar * x + B_bar * x0 with learned per-dim loop_log_A / loop_delta / loop_B. A_bar ∈ (0, 1) by construction (softplus on delta, exp of negative exp on log_A) enforces bounded-decay; B_bar re-injects the original residual stream. Three per-dim scalars total.
  • Gemma-style Global / Local Attentionglobal_attn_layers=[4, 9, 10] get full causal attention + partial RoPE (rope_dims=16 / head_dim=64); remaining layers use sliding-window attention + full RoPE for positional precision within the window.
  • Gram Newton-Schulz for high-aspect-ratio MLP banks (α > 2.5) — reduces Newton-Schulz cost on mlp_up_bank (4:1 ratio) and mlp_down_bank. NS steps dropped 5 → 4 since the architecture no longer requires the extra refine step.

Inherited from PR #1530 (@samacqua):

  • Variable-length attentionflash_attn_varlen_func with cu_seqlens boundaries; training, eval, and global-SGD TTT (when enabled) never attend across unrelated documents packed in the same flat batch.
  • Fused MLP triton kernel — custom linear_xielu_kernel fuses the up-projection + xIELU activation + squaring into a single kernel (analogue of @samacqua's linear_leaky_relu_square_kernel with our xIELU activation).

Inherited from PR #1693 (@dexhunter + @MarioPaerle) — gates used, TTT disabled for this submission:

  • Attention Output Gate (PR #1667 @MarioPaerle) — per-head input-dependent sigmoid × 2 gate on attention output, zero-init for identity-at-init, composed with fullgraph=True compile.
  • SmearGate (@KellerJordan concept via modded-nanogpt; @MarioPaerle reintroduction) — input-dependent per-channel residual mixer blending current token with previous token (strictly causal, backward-looking by one position), zero-init lambda.

New in this PR:

  • Layered Local Sliding Windows — a prior uniform-window ablation on this architecture showed LOCAL_WINDOW_SIZE=512 and LOCAL_WINDOW_SIZE=1024 produced identical val_bpb, suggesting per-layer window size is a near-free dial. This PR splits: 512 tokens on locals {0, 1, 2, 3, 5} (early layers + the recurrence-loop tail at layer 5, where attention FLOPs are 2×-amplified by num_loops=2), 1024 tokens on locals {6, 7, 8} (post-loop integration layers where wider context plausibly helps and isn't loop-amplified). Global layers {4, 9, 10} retain full attention. Zero compile-cost — each block's attn.window_size is set once at init and baked as a per-subgraph constant.

Dropped from PR #1674:

Quantization

  • int6 GPTQ on matrix weights (Q/K/V/O banks, MLP banks) with SDClip (std-based clipping, k=12.85 for matrix), 16 calibration batches, 4s reserved from training budget
  • int7 GPTQ on embedding (EMBED_BITS=7, clip k=15.0)
  • Brotli on the quantized state dict
  • LZMA on the code

Total artifact at seed 1337: 15,962,729 B (compressed code + quantized state). Under the 16 MB cap by 37 KB.

Compliance (Issue #1017 Track B)

Since this submission runs the sliding-window eval path with no test-time adaptation, only the causality and normalization conditions apply:

  • Condition 1 (Causality): Sliding-window eval is strictly causal. flash_attn_3_func(..., causal=True, window_size=attn.window_size) on every attention call. SmearGate mixes with the previous token only (F.pad(x[:, :-1], (0, 0, 1, 0))).
  • Condition 2 (Normalized distribution): Standard softmax over full SP8192 vocabulary. Gates modulate hidden states, not logits. logit_softcap * tanh(logits / logit_softcap) applied uniformly (standard stabilization, not a selective modulation).

Conditions 3 and 4 (score-before-update, single-pass) are TTT-specific and don't apply here.

Tokenizer: standard SP8192 (Kevin Clark's pre-tokenized dataset via PR #78 @mtybadger). No casefold — legality-independent of Issue #1604.

Note on logs and seeds

No training/eval log attached. The VM used for these runs went down before the seed-1337 log could be pushed to GitHub, and I no longer have GPU access to reproduce. The metrics (val_bpb 1.07706, artifact 15,962,729 B, train 596s) are recorded from my own observation of the run output during the session. I invite judges to reproduce using the command below and expect the numbers to be within normal seed-variance.

Single-seed result (seed 1337). Compute-constrained, consistent with the "non-record research submission" convention used by PR #1674 (ours, earlier). Seed was picked before the run, not hindsight-selected.

Reproduction

# setup
uv sync                      # torch 2.11 cu130 + flash-attn-3 from pyproject.toml
nvidia-smi | head -20        # confirm 8x H100 80GB SXM

# run (all defaults — TTT off, sliding on, non-casefold SP8192)
ARTIFACT_DIR=runs/base_model_seed1337 SEED=1337 \
  uv run torchrun --standalone --nproc_per_node=8 --max-restarts=0 \
  train_gpt.py \
  > runs/base_model_seed1337/run.log 2>&1

Expected log markers:

  • Total submission size quantized+brotli: ~15,962,729 bytes
  • diagnostic quantized_sliding_window val_loss:... val_bpb:1.07706...
  • Total wallclock: ~700-800s (596s training + ~100-200s eval including quantization)

Running with TTT_ENABLED=1 additionally invokes the phased TTT path (ported from PR #1693), but this is not the submission metric.

Lineage

  • PR #1530 (@samacqua, varlen attention + fused MLP + doc-independent LoRA TTT) →
  • PR #1586 / PR #1648 (xIELU + QK-Gain) →
  • PR #1674 (ours: Parcae + Gemma-style attn + Gram NS + KV-tying, non-record) →
  • this PR (+ AttnOutGate + SmearGate + layered local windows, KV-tying dropped, no TTT)

Parallel track (with TTT):

Credits

Test plan

  • Single-seed training on 8×H100 SXM (seed 1337) — 596s train, under 600s cap
  • Artifact size 15,962,729 B — under 16,000,000 B cap
  • Sliding-window eval completes with val_bpb 1.07706
  • TTT_ENABLED=0 is the shipped default — submission reproduces the 1.07706 number without TTT
  • Standard SP8192 tokenizer, Track B conditions 1-2 satisfied (causality, normalized distribution)
  • Tested EVAL_EXTRA_LOOPS (extra recurrence iterations at eval time) — no improvement, regresses sliding bpb. Submission ships with default EVAL_EXTRA_LOOPS=0.
  • Judges verify reproducibility on their infrastructure (training/eval log not attached — see note above)

Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192.
Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without
any test-time adaptation. Single seed 1337; compute-constrained
non-record submission — VM went down before the run log could be pushed
so it is not attached. Metrics were observed during the session.

Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop
injection, Gemma-style global/local attention, Gram Newton-Schulz) +
PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel +
AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept,
@MarioPaerle reintroduction) + new layered local sliding windows
(512 on early/loop layers, 1024 on post-loop layers, split at index 6).

KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased
global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file
for experiments but is disabled by default for this submission.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant