openai · mikeapedia · Apr 19, 2026
diff --git a/...ds/track_non_record_16mb/2026-04-18_NeuralBaseModel_NoTTT_ParcaeGates/README.md b/...ds/track_non_record_16mb/2026-04-18_NeuralBaseModel_NoTTT_ParcaeGates/README.md
@@ -0,0 +1,136 @@
+# Non-record submission — Neural Base Model, No TTT
+
+**val_bpb = 1.07706** (sliding-window eval, seed 1337, non-casefold SP8192) | **15,962,729 B (~15.96 MB)** | 8×H100 80GB SXM | 596s training + sliding-window eval
+
+## Summary
+
+- **No test-time training.** No LoRA adapters, no global-SGD. Pure architectural + quantization result.
+- **Beats current merged non-casefold SOTA ([PR #1493](https://github.com/openai/parameter-golf/pull/1493) @bigbag, 1.0810)** by **−0.00394 BPB (−0.00273 nats)** — without any test-time adaptation.
+- **Non-record submission.** Doesn't attempt to clear the 0.005-nat bar vs merged SOTA (that's the full-TTT path, for which I'm out of compute credits). Positioning: establishes the *base-model ceiling* for standard SP8192 architectures and gives future TTT work a cleaner lift measurement.
+
+## Positioning
+
+Most competitive submissions on the main-track leaderboard include test-time training, which conflates *architectural* gains with *test-time compute* gains. This submission isolates the architectural contribution:
+
+- val_bpb = 1.07706 with `TTT_ENABLED=0` in the shipped config
+- TTT scaffolding (phased global-SGD + per-doc LoRA, ported from PR #1693 @dexhunter + @MarioPaerle) remains in the file for experiments but is disabled by default
+- Running with `TTT_ENABLED=1` on the same model trends toward 1.074–1.075 in my experiments, short of clearing the 0.005-nat bar vs the casefold-track PR #1695 (1.0759) — further TTT tuning is future work pending compute
+
+As a reference point: PR #1493 (merged SOTA) reports sliding-only 1.0829 and post-TTT 1.0810. Our sliding-only is 1.07706, **below their TTT-enabled number**, and below their sliding-only by **0.00584 BPB**.
+
+## Results (single seed)
+
+| Stage | val_bpb | Notes |
+|---|---|---|
+| Pre-quant EMA (bf16) | **1.0699** | End-of-training, post-EMA |
+| Post-quant (int6+int7+brotli) sliding | **1.07706** | **Submission number** |
+| Quantization cost | +0.00716 | Typical for int6 GPTQ |
+
+- Training: **596s** (wallclock cap), step 4602/20000
+- Artifact: **15,962,729 B** (under 16,000,000 B cap by 37,271 B)
+- Seed: 1337 (single-seed result — compute-constrained, see Note on logs and seeds below)
+
+## Architecture
+
+Inherited from [PR #1674](https://github.com/openai/parameter-golf/pull/1674) (ours, non-record research submission):
+
+- **Parcae Constrained Loop Injection** — SSM-style boundary condition at each loop re-entry: `x = A_bar * x + B_bar * x0` with learned per-dim `loop_log_A` / `loop_delta` / `loop_B`. `A_bar ∈ (0, 1)` by construction (softplus on delta, exp of negative exp on log_A) enforces bounded-decay; `B_bar` re-injects the original residual stream. Three per-dim scalars total.
+- **Gemma-style Global / Local Attention** — `global_attn_layers=[4, 9, 10]` get full causal attention + partial RoPE (`rope_dims=16 / head_dim=64`); remaining layers use sliding-window attention + full RoPE for positional precision within the window.
+- **Gram Newton-Schulz** for high-aspect-ratio MLP banks (α > 2.5) — reduces Newton-Schulz cost on `mlp_up_bank` (4:1 ratio) and `mlp_down_bank`. NS steps dropped 5 → 4 since the architecture no longer requires the extra refine step.
+
+Inherited from [PR #1530](https://github.com/openai/parameter-golf/pull/1530) (@samacqua):
+
+- **Variable-length attention** — `flash_attn_varlen_func` with `cu_seqlens` boundaries; training, eval, and global-SGD TTT (when enabled) never attend across unrelated documents packed in the same flat batch.
+- **Fused MLP triton kernel** — custom `linear_xielu_kernel` fuses the up-projection + xIELU activation + squaring into a single kernel (analogue of @samacqua's `linear_leaky_relu_square_kernel` with our xIELU activation).
+
+Inherited from [PR #1693](https://github.com/openai/parameter-golf/pull/1693) (@dexhunter + @MarioPaerle) — **gates used, TTT disabled for this submission**:
+
+- **Attention Output Gate** ([PR #1667](https://github.com/openai/parameter-golf/pull/1667) @MarioPaerle) — per-head input-dependent `sigmoid × 2` gate on attention output, zero-init for identity-at-init, composed with `fullgraph=True` compile.
+- **SmearGate** (@kellerjordan concept via modded-nanogpt; @MarioPaerle reintroduction) — input-dependent per-channel residual mixer blending current token with previous token (strictly causal, backward-looking by one position), zero-init lambda.
+
+**New in this PR:**
+
+- **Layered Local Sliding Windows** — a prior uniform-window ablation on this architecture showed `LOCAL_WINDOW_SIZE=512` and `LOCAL_WINDOW_SIZE=1024` produced identical val_bpb, suggesting per-layer window size is a near-free dial. This PR splits: 512 tokens on locals `{0, 1, 2, 3, 5}` (early layers + the recurrence-loop tail at layer 5, where attention FLOPs are 2×-amplified by `num_loops=2`), 1024 tokens on locals `{6, 7, 8}` (post-loop integration layers where wider context plausibly helps and isn't loop-amplified). Global layers `{4, 9, 10}` retain full attention. Zero compile-cost — each block's `attn.window_size` is set once at init and baked as a per-subgraph constant.
+
+Dropped from PR #1674:
+
+- **KV-tying on global attention layers** — present in PR #1674 as a memory/param-budget optimization, disabled here (`KV_TIE_GLOBAL=0`). Freed V-weights are spent on more expressive global-layer attention rather than on looser quantization clipping.
+
+## Quantization
+
+- **int6 GPTQ** on matrix weights (Q/K/V/O banks, MLP banks) with SDClip (std-based clipping, k=12.85 for matrix), 16 calibration batches, 4s reserved from training budget
+- **int7 GPTQ** on embedding (`EMBED_BITS=7`, clip k=15.0)
+- **Brotli** on the quantized state dict
+- **LZMA** on the code
+
+Total artifact at seed 1337: **15,962,729 B** (compressed code + quantized state). Under the 16 MB cap by 37 KB.
+
+## Compliance (Issue #1017 Track B)
+
+Since this submission runs the sliding-window eval path with no test-time adaptation, only the causality and normalization conditions apply:
+
+- **Condition 1 (Causality):** Sliding-window eval is strictly causal. `flash_attn_3_func(..., causal=True, window_size=attn.window_size)` on every attention call. SmearGate mixes with the *previous* token only (`F.pad(x[:, :-1], (0, 0, 1, 0))`).
+- **Condition 2 (Normalized distribution):** Standard softmax over full SP8192 vocabulary. Gates modulate hidden states, not logits. `logit_softcap * tanh(logits / logit_softcap)` applied uniformly (standard stabilization, not a selective modulation).
+
+Conditions 3 and 4 (score-before-update, single-pass) are TTT-specific and don't apply here.
+
+**Tokenizer:** standard SP8192 (Kevin Clark's pre-tokenized dataset via [PR #78](https://github.com/openai/parameter-golf/pull/78) @mtybadger). No casefold — legality-independent of Issue #1604.
+
+## Note on logs and seeds
+
+**No training/eval log attached.** The VM used for these runs went down before the seed-1337 log could be pushed to GitHub, and I no longer have GPU access to reproduce. The metrics (val_bpb 1.07706, artifact 15,962,729 B, train 596s) are recorded from my own observation of the run output during the session. I invite judges to reproduce using the command below and expect the numbers to be within normal seed-variance.
+
+**Single-seed result (seed 1337).** Compute-constrained, consistent with the "non-record research submission" convention used by [PR #1674](https://github.com/openai/parameter-golf/pull/1674) (ours, earlier). Seed was picked before the run, not hindsight-selected.
+
+## Reproduction
+
+```bash
+# setup
+uv sync                      # torch 2.11 cu130 + flash-attn-3 from pyproject.toml
+nvidia-smi | head -20        # confirm 8x H100 80GB SXM
+
+# run (all defaults — TTT off, sliding on, non-casefold SP8192)
+ARTIFACT_DIR=runs/base_model_seed1337 SEED=1337 \
+  uv run torchrun --standalone --nproc_per_node=8 --max-restarts=0 \
+  train_gpt.py \
+  > runs/base_model_seed1337/run.log 2>&1
+```
+
+Expected log markers:
+- `Total submission size quantized+brotli: ~15,962,729 bytes`
+- `diagnostic quantized_sliding_window val_loss:... val_bpb:1.07706...`
+- Total wallclock: ~700-800s (596s training + ~100-200s eval including quantization)
+
+Running with `TTT_ENABLED=1` additionally invokes the phased TTT path (ported from PR #1693), but this is *not* the submission metric.
+
+## Lineage
+
+- [PR #1530](https://github.com/openai/parameter-golf/pull/1530) (@samacqua, varlen attention + fused MLP + doc-independent LoRA TTT) →
+- [PR #1586](https://github.com/openai/parameter-golf/pull/1586) / [PR #1648](https://github.com/openai/parameter-golf/pull/1648) (xIELU + QK-Gain) →
+- [PR #1674](https://github.com/openai/parameter-golf/pull/1674) (ours: Parcae + Gemma-style attn + Gram NS + KV-tying, non-record) →
+- **this PR** (+ AttnOutGate + SmearGate + layered local windows, KV-tying dropped, no TTT)
+
+Parallel track (with TTT):
+- [PR #1693](https://github.com/openai/parameter-golf/pull/1693) (@dexhunter + @MarioPaerle, casefold + gates + phased TTT) →
+- [PR #1695](https://github.com/openai/parameter-golf/pull/1695) (@X-Abhishek-X, non-casefold + gates + phased TTT + LoRA+ / layer-LR-alpha)
+
+## Credits
+
+- **@samacqua** — [PR #1530](https://github.com/openai/parameter-golf/pull/1530) base (varlen attention, fused MLP, document-independent TTT framework, compressed-artifact tooling)
+- **@bigbag** — [PR #1493](https://github.com/openai/parameter-golf/pull/1493) merged non-casefold SOTA (comparison baseline)
+- **@clarkkev** — [PR #1394](https://github.com/openai/parameter-golf/pull/1394) GPTQ SDClip pipeline and SP8192 tokenizer integration
+- **@MarioPaerle** — [PR #1667](https://github.com/openai/parameter-golf/pull/1667) Attention Output Gate; SmearGate reintroduction to parameter-golf
+- **@kellerjordan** — SmearGate concept (modded-nanogpt)
+- **@dexhunter** — Multi-Phase Global SGD TTT framework (PR #1610 / #1626 / #1670 / #1693) — present in the file (disabled here) for future TTT work
+- **@mtybadger** — [PR #78](https://github.com/openai/parameter-golf/pull/78) SP8192 tokenizer + pre-tokenized dataset
+- **@mikeapedia** — [PR #1674](https://github.com/openai/parameter-golf/pull/1674) base carried into this submission (Parcae loop, Gemma-style attention, Gram Newton-Schulz, xIELU activation); new here: layered local sliding windows, non-KV-tied global attention, file defaults for non-TTT submission
+
+## Test plan
+
+- [x] Single-seed training on 8×H100 SXM (seed 1337) — 596s train, under 600s cap
+- [x] Artifact size 15,962,729 B — under 16,000,000 B cap
+- [x] Sliding-window eval completes with val_bpb 1.07706
+- [x] `TTT_ENABLED=0` is the shipped default — submission reproduces the 1.07706 number without TTT
+- [x] Standard SP8192 tokenizer, Track B conditions 1-2 satisfied (causality, normalized distribution)
+- [x] Tested `EVAL_EXTRA_LOOPS` (extra recurrence iterations at eval time) — no improvement, regresses sliding bpb. Submission ships with default `EVAL_EXTRA_LOOPS=0`.
+- [ ] Judges verify reproducibility on their infrastructure (training/eval log not attached — see note above)
diff --git a/records/track_non_record_16mb/2026-04-18_NeuralBaseModel_NoTTT_ParcaeGates/submission.json b/records/track_non_record_16mb/2026-04-18_NeuralBaseModel_NoTTT_ParcaeGates/submission.json
@@ -0,0 +1,18 @@
+{
+  "author": "Mikeapedia",
+  "github_id": "mikeapedia",
+  "name": "Neural Base Model, No TTT: Parcae + Gates + Layered Windows",
+  "blurb": "Non-record base-model submission. val_bpb 1.07706 with sliding-window eval only (no TTT, no LoRA, no global-SGD). Beats merged non-casefold SOTA (PR #1493 at 1.0810) by 0.00394 BPB on architectural strength alone. Establishes a base-model ceiling reference for future TTT work. Builds on our PR #1674 (Parcae loop, Gemma-style attention, Gram NS) with layered local sliding windows (512/1024 split at layer 6), AttnOutGate and SmearGate from the PR #1693 lineage, and PR #1530's varlen attention. Tokenizer: standard SP8192 (not casefold).",
+  "date": "2026-04-18T00:00:00Z",
+  "val_bpb": 1.07706,
+  "seeds": [1337],
+  "seed_results": {
+    "1337": {"val_bpb": 1.07706, "artifact_bytes": 15962729}
+  },
+  "bytes_total": 15962729,
+  "record_type": "non-record",
+  "tokenizer": "SP8192",
+  "ttt_enabled": false,
+  "comparison_baseline_pr": 1493,
+  "comparison_baseline_bpb": 1.0810
+}