Skip to content

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item#1978

Open
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
EthanYangTW:submission/golfparty-allchecks
Open

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item#1978
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
EthanYangTW:submission/golfparty-allchecks

Conversation

@EthanYangTW
Copy link
Copy Markdown

@EthanYangTW EthanYangTW commented Apr 30, 2026

Summary

Single non-record submission that addresses all currently-unchecked items on OpenAI's Requests-for-PRs list (Universal Transformer, megakernels, SSM, E2E TTT, super long context, RLA, JEPA, text diffusion, H-net tokenization) as toggleable env vars on the PR #1953 base. Default config is byte-identical to PR #1953; toggles compose additively.

3-seed mean post-TTT val_bpb 1.07776 (std 0.00126) on 8×H100 SXM. All seeds within the 600s training cap.

Position: NOT a SOTA bid. This is a composability ablation + scaffolding for future record submissions in the directions OpenAI explicitly invited. Aligned with the README's "we strongly encourage participants to submit implementations for weird or out-of-the-box ideas, in-progress or unoptimized solutions, so long as they run successfully, or even interesting negative results."

What's in the box

Request-for-PRs item Env var Status
Universal transformer KS_UT_DEPTH Real — extends PR #1344 Loop4-5 by K extra cycles
Megakernels KS_MEGAKERNEL Real (already shipping) — surfaces fused LeakyReLU² MLP + softcapped CE Triton kernels
Super long context KS_LONG_CONTEXT + EVAL_SEQ_LEN=3072 Real
Random linear adapters TTT_RLA_ENABLED Real — frozen orthonormal A, only B learnable
Text diffusion KS_DIFFUSION_FRAC Real — training-time embedding-noise auxiliary
E2E TTT KS_E2E_TTT Wired but disabled — OOM at EVAL_SEQ_LEN=3072 + UT depth
JEPA KS_JEPA_WEIGHT Wired but disabled — GPTQ Hessian KeyError on aux head
State-space models KS_SSM_LAST_K Stub — Python-loop scan compile-toxic
H-net tokenization KS_HNET_CHUNK Stub — dynamic-shape padding compile-toxic

5 active in shipped 3-seed config; 2 wired-with-blocker; 2 stubs. All 9 documented in notes/.

Per-seed results

Seed Pre-quant Quant Post-TTT Eval s Artifact bytes
42 1.07594 1.08396 1.07631 359.6 16,008,464
1234 1.07726 1.08531 1.07860 353.2 16,003,972
0 1.07717 1.08508 1.07838 359.7 16,000,415
Mean 1.07679 1.08478 1.07776 357.5 16,004,284
Std 0.00073 0.00073 0.00126 3.7 4,030

vs current rank-1 PR #1855 (1.06108): +0.01668 BPB (regression — non-record).

Artifact size note: all 3 seeds came in 415-8464 bytes above the 16,000,000-byte cap. Driven by ~6KB compressed kitchen-sink scaffolding + ~5KB bf16 run-to-run variance. Trivially fixable (strip toy classes / bump WD); kept as-shipped to preserve full scaffolding visibility for review.

Test plan

Lineage

PR #1953#1945#1923#1908#1855#1797#1787#1729#1667#1530#1394#1344. Toy implementations of SSM, JEPA, diffusion, H-net introduced in this submission.

…-PRs item

Single non-record submission that addresses all currently-unchecked items on
OpenAI's Requests-for-PRs list (Universal Transformer, megakernels, SSM, E2E
TTT, super long context, RLA, JEPA, text diffusion, H-net tokenization) as
toggleable env vars on the PR openai#1953 base. Default config is byte-identical to
PR openai#1953; toggles compose additively.

3-seed mean post-TTT val_bpb 1.07776 (std 0.00126), all seeds within 600s
training cap on 8xH100 SXM. Each technique honestly labeled real /
wired-but-disabled-with-reason / stub-for-future-work in notes/.

Real wired toggles in shipped run: KS_UT_DEPTH, KS_LONG_CONTEXT (via
EVAL_SEQ_LEN=3072), KS_DIFFUSION_FRAC, TTT_RLA_ENABLED, KS_MEGAKERNEL.
Wired-but-disabled (with documented blocker): KS_E2E_TTT (OOM),
KS_JEPA_WEIGHT (GPTQ Hessian KeyError on aux head). Stubs (compile-toxic):
KS_SSM_LAST_K, KS_HNET_CHUNK.

Includes: per-seed train+eval logs, per-feature notes/, 3-seed launcher,
CaseOps SP8192 tokenizer.

Position: not a SOTA bid. Composability ablation + scaffolding for future
record submissions in the directions OpenAI explicitly invited.
@EthanYangTW EthanYangTW marked this pull request as ready for review April 30, 2026 11:51
Copilot AI review requested due to automatic review settings April 30, 2026 11:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new non-record submission (“GolfParty”) under records/track_10min_16mb/ that scaffolds multiple “Requests-for-PRs” techniques behind env-var toggles on top of the PR #1953 lineage, along with full 3-seed logs and writeups. The PR also includes an additional 2026-04-26_V2_PE_MinLR_AttnGate/ record directory, which appears unrelated and incomplete.

Changes:

  • Add 2026-04-30_GolfParty_AllChecks/ with a modified train_gpt.py, 3-seed logs, submission.json, reproduction script, tokenizer model, and per-feature notes.
  • Document 9 technique toggles in README.md and notes/*.md.
  • Add 2026-04-26_V2_PE_MinLR_AttnGate/ with a README and a wrapper train_gpt.py (but missing other required submission artifacts).

Reviewed changes

Copilot reviewed 14 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_gpt.py Implements technique toggles (UT depth recurrence, RLA, diffusion noise, JEPA aux loss wiring, etc.) on the PR #1953-style codebase.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/README.md Describes the submission, toggles, results, and reproduction guidance.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/submission.json Submission metadata and per-seed metrics.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/run_kitchen_3seed.sh 3-seed launcher script intended to reproduce the run.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_seed42.log Seed 42 training/quant/TTT log.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_seed1234.log Seed 1234 training/quant/TTT log.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_seed0.log Seed 0 training/quant/TTT log.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model Tokenizer model file referenced by the submission.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/universal.md Explains UT-depth toggle intent and limitations.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/megakernel.md Documents “megakernel” claim as surfacing existing fused kernels.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/long_context.md Documents long-context evaluation toggle.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/e2e_ttt.md Documents E2E TTT toggle and limitations.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/rla.md Documents Random Linear Adapter (RLA) toggle behavior.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/ssm.md Documents SSM stub and why it’s not compiled/wired.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/jepa.md Documents JEPA aux-loss wiring and blockers.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/diffusion.md Documents diffusion-inspired embedding noise feature.
records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/hnet.md Documents H-net pooling stub.
records/track_10min_16mb/2026-04-26_V2_PE_MinLR_AttnGate/train_gpt.py Adds a decompression wrapper script for another record folder.
records/track_10min_16mb/2026-04-26_V2_PE_MinLR_AttnGate/README.md Describes a separate “record” submission that appears incomplete in this PR.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +12 to +14
{"seed": 42, "post_ttt_val_bpb": 1.07631, "pre_quant_val_bpb": 1.07594, "quantized_val_bpb": 1.08396, "eval_seconds": 359.6, "artifact_bytes": 16008464, "stop_step": 4538},
{"seed": 1234, "post_ttt_val_bpb": 1.07860, "pre_quant_val_bpb": 1.07726, "quantized_val_bpb": 1.08531, "eval_seconds": 353.2, "artifact_bytes": 16003972, "stop_step": 4534},
{"seed": 0, "post_ttt_val_bpb": 1.07838, "pre_quant_val_bpb": 1.07717, "quantized_val_bpb": 1.08508, "eval_seconds": 359.7, "artifact_bytes": 16000415, "stop_step": 4533}
Comment thread records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/submission.json Outdated
Comment on lines +7 to +11
> **Position: not a SOTA bid.** This submission addresses every currently-
> unchecked item on OpenAI's "Requests for PRs" list as a *single composable
> recipe*, with each technique behind an env-var toggle. Default config is
> byte-identical to the parent **PR #1953** stack; toggles compose
> additively.
Comment on lines +1 to +14
# Record: SP8192 + PE + MIN_LR + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean)

**val_bpb = 1.0770** (3-seed mean, std 0.0004) | **~15.98 MB** | 8xH100 SXM

## 3-Seed Results

| Seed | Steps | Sliding BPB | **TTT BPB** | Artifact (bytes) |
|------|-------|-------------|-------------|-------------------|
| 1337 | 4631 | 1.0785 | **1.0772** | 15,982,989 |
| 42 | 4637 | 1.0777 | **1.0765** | 15,984,317 |
| 2024 | 4633 | 1.0784 | **1.0772** | 15,985,404 |
| **Mean** | **4634** | **1.0782** | **1.0770** | **15,984,237** |
| **Std** | | 0.0004 | **0.0004** | |

@@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -euo pipefail
cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-30_ParamGolfKitchen_AllChecks
Comment on lines +51 to +60
**Note on artifact size:** all three seeds came in slightly above the
16,000,000-byte cap (max 16,008,464, min 16,000,415). The overage is
~0.05% of the cap and is driven by (a) the kitchen-sink scaffolding
adding ~6 KB compressed code over the parent PR #1953 baseline, and
(b) bf16 non-determinism shifting model compressibility by ±5 KB
run-to-run. A trivial fix (strip the ToySSMBlock / ToyJEPAHead class
defs before serialization, or bump weight decay slightly) brings the
artifact comfortably under cap. *Not* applied in the as-shipped run
because we wanted to preserve the full kitchen-sink scaffolding visible
to anyone reading the train_gpt.py for review.
Comment on lines +1405 to +1409
# KS_DIFFUSION_FRAC: training-time embedding-noise auxiliary. Replace
# `frac` of token embeddings with Gaussian noise. Toy 1-step denoising
# signal — only fires when self.training and ks_diffusion_frac > 0.
if self.training and getattr(self, "ks_diffusion_frac", 0.0) > 0.0:
x, _diff_mask = ks_diffusion_perturb(x, self.ks_diffusion_frac)
"""
B, T, D = emb.shape
mask = (torch.rand(B, T, 1, device=emb.device, generator=generator) < frac).to(emb.dtype)
noise = torch.randn_like(emb) * emb.std()
Comment on lines +1881 to +1892
def ks_hnet_pool(h, chunk):
"""H-net hierarchical chunk pooling: mean-pool every `chunk` tokens
so a coarse-grained downstream attention pass can run cheaply over
summaries. Returns (coarse, gather_index) — coarse[b, t//chunk] is
the summary for the chunk containing t.
"""
B, T, D = h.shape
pad = (chunk - T % chunk) % chunk
if pad:
h = F.pad(h, (0, 0, 0, pad))
h2 = h.reshape(B, (T + pad) // chunk, chunk, D).mean(dim=2)
return h2 # (B, T_coarse, D)
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants