Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item#1978
Open
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
Open
Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item#1978EthanYangTW wants to merge 4 commits intoopenai:mainfrom
EthanYangTW wants to merge 4 commits intoopenai:mainfrom
Conversation
…770 (3-seed mean)
…-PRs item Single non-record submission that addresses all currently-unchecked items on OpenAI's Requests-for-PRs list (Universal Transformer, megakernels, SSM, E2E TTT, super long context, RLA, JEPA, text diffusion, H-net tokenization) as toggleable env vars on the PR openai#1953 base. Default config is byte-identical to PR openai#1953; toggles compose additively. 3-seed mean post-TTT val_bpb 1.07776 (std 0.00126), all seeds within 600s training cap on 8xH100 SXM. Each technique honestly labeled real / wired-but-disabled-with-reason / stub-for-future-work in notes/. Real wired toggles in shipped run: KS_UT_DEPTH, KS_LONG_CONTEXT (via EVAL_SEQ_LEN=3072), KS_DIFFUSION_FRAC, TTT_RLA_ENABLED, KS_MEGAKERNEL. Wired-but-disabled (with documented blocker): KS_E2E_TTT (OOM), KS_JEPA_WEIGHT (GPTQ Hessian KeyError on aux head). Stubs (compile-toxic): KS_SSM_LAST_K, KS_HNET_CHUNK. Includes: per-seed train+eval logs, per-feature notes/, 3-seed launcher, CaseOps SP8192 tokenizer. Position: not a SOTA bid. Composability ablation + scaffolding for future record submissions in the directions OpenAI explicitly invited.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new non-record submission (“GolfParty”) under records/track_10min_16mb/ that scaffolds multiple “Requests-for-PRs” techniques behind env-var toggles on top of the PR #1953 lineage, along with full 3-seed logs and writeups. The PR also includes an additional 2026-04-26_V2_PE_MinLR_AttnGate/ record directory, which appears unrelated and incomplete.
Changes:
- Add
2026-04-30_GolfParty_AllChecks/with a modifiedtrain_gpt.py, 3-seed logs,submission.json, reproduction script, tokenizer model, and per-feature notes. - Document 9 technique toggles in
README.mdandnotes/*.md. - Add
2026-04-26_V2_PE_MinLR_AttnGate/with a README and a wrappertrain_gpt.py(but missing other required submission artifacts).
Reviewed changes
Copilot reviewed 14 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_gpt.py | Implements technique toggles (UT depth recurrence, RLA, diffusion noise, JEPA aux loss wiring, etc.) on the PR #1953-style codebase. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/README.md | Describes the submission, toggles, results, and reproduction guidance. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/submission.json | Submission metadata and per-seed metrics. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/run_kitchen_3seed.sh | 3-seed launcher script intended to reproduce the run. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_seed42.log | Seed 42 training/quant/TTT log. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_seed1234.log | Seed 1234 training/quant/TTT log. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/train_seed0.log | Seed 0 training/quant/TTT log. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model | Tokenizer model file referenced by the submission. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/universal.md | Explains UT-depth toggle intent and limitations. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/megakernel.md | Documents “megakernel” claim as surfacing existing fused kernels. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/long_context.md | Documents long-context evaluation toggle. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/e2e_ttt.md | Documents E2E TTT toggle and limitations. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/rla.md | Documents Random Linear Adapter (RLA) toggle behavior. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/ssm.md | Documents SSM stub and why it’s not compiled/wired. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/jepa.md | Documents JEPA aux-loss wiring and blockers. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/diffusion.md | Documents diffusion-inspired embedding noise feature. |
| records/track_10min_16mb/2026-04-30_GolfParty_AllChecks/notes/hnet.md | Documents H-net pooling stub. |
| records/track_10min_16mb/2026-04-26_V2_PE_MinLR_AttnGate/train_gpt.py | Adds a decompression wrapper script for another record folder. |
| records/track_10min_16mb/2026-04-26_V2_PE_MinLR_AttnGate/README.md | Describes a separate “record” submission that appears incomplete in this PR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+12
to
+14
| {"seed": 42, "post_ttt_val_bpb": 1.07631, "pre_quant_val_bpb": 1.07594, "quantized_val_bpb": 1.08396, "eval_seconds": 359.6, "artifact_bytes": 16008464, "stop_step": 4538}, | ||
| {"seed": 1234, "post_ttt_val_bpb": 1.07860, "pre_quant_val_bpb": 1.07726, "quantized_val_bpb": 1.08531, "eval_seconds": 353.2, "artifact_bytes": 16003972, "stop_step": 4534}, | ||
| {"seed": 0, "post_ttt_val_bpb": 1.07838, "pre_quant_val_bpb": 1.07717, "quantized_val_bpb": 1.08508, "eval_seconds": 359.7, "artifact_bytes": 16000415, "stop_step": 4533} |
Comment on lines
+7
to
+11
| > **Position: not a SOTA bid.** This submission addresses every currently- | ||
| > unchecked item on OpenAI's "Requests for PRs" list as a *single composable | ||
| > recipe*, with each technique behind an env-var toggle. Default config is | ||
| > byte-identical to the parent **PR #1953** stack; toggles compose | ||
| > additively. |
Comment on lines
+1
to
+14
| # Record: SP8192 + PE + MIN_LR + SmearGate + AttnOutGate + 4ep TTT — val_bpb 1.0770 (3-seed mean) | ||
|
|
||
| **val_bpb = 1.0770** (3-seed mean, std 0.0004) | **~15.98 MB** | 8xH100 SXM | ||
|
|
||
| ## 3-Seed Results | ||
|
|
||
| | Seed | Steps | Sliding BPB | **TTT BPB** | Artifact (bytes) | | ||
| |------|-------|-------------|-------------|-------------------| | ||
| | 1337 | 4631 | 1.0785 | **1.0772** | 15,982,989 | | ||
| | 42 | 4637 | 1.0777 | **1.0765** | 15,984,317 | | ||
| | 2024 | 4633 | 1.0784 | **1.0772** | 15,985,404 | | ||
| | **Mean** | **4634** | **1.0782** | **1.0770** | **15,984,237** | | ||
| | **Std** | | 0.0004 | **0.0004** | | | ||
|
|
| @@ -0,0 +1,42 @@ | |||
| #!/usr/bin/env bash | |||
| set -euo pipefail | |||
| cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-30_ParamGolfKitchen_AllChecks | |||
Comment on lines
+51
to
+60
| **Note on artifact size:** all three seeds came in slightly above the | ||
| 16,000,000-byte cap (max 16,008,464, min 16,000,415). The overage is | ||
| ~0.05% of the cap and is driven by (a) the kitchen-sink scaffolding | ||
| adding ~6 KB compressed code over the parent PR #1953 baseline, and | ||
| (b) bf16 non-determinism shifting model compressibility by ±5 KB | ||
| run-to-run. A trivial fix (strip the ToySSMBlock / ToyJEPAHead class | ||
| defs before serialization, or bump weight decay slightly) brings the | ||
| artifact comfortably under cap. *Not* applied in the as-shipped run | ||
| because we wanted to preserve the full kitchen-sink scaffolding visible | ||
| to anyone reading the train_gpt.py for review. |
Comment on lines
+1405
to
+1409
| # KS_DIFFUSION_FRAC: training-time embedding-noise auxiliary. Replace | ||
| # `frac` of token embeddings with Gaussian noise. Toy 1-step denoising | ||
| # signal — only fires when self.training and ks_diffusion_frac > 0. | ||
| if self.training and getattr(self, "ks_diffusion_frac", 0.0) > 0.0: | ||
| x, _diff_mask = ks_diffusion_perturb(x, self.ks_diffusion_frac) |
| """ | ||
| B, T, D = emb.shape | ||
| mask = (torch.rand(B, T, 1, device=emb.device, generator=generator) < frac).to(emb.dtype) | ||
| noise = torch.randn_like(emb) * emb.std() |
Comment on lines
+1881
to
+1892
| def ks_hnet_pool(h, chunk): | ||
| """H-net hierarchical chunk pooling: mean-pool every `chunk` tokens | ||
| so a coarse-grained downstream attention pass can run cheaply over | ||
| summaries. Returns (coarse, gather_index) — coarse[b, t//chunk] is | ||
| the summary for the chunk containing t. | ||
| """ | ||
| B, T, D = h.shape | ||
| pad = (chunk - T % chunk) % chunk | ||
| if pad: | ||
| h = F.pad(h, (0, 0, 0, pad)) | ||
| h2 = h.reshape(B, (T + pad) // chunk, chunk, D).mean(dim=2) | ||
| return h2 # (B, T_coarse, D) |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Single non-record submission that addresses all currently-unchecked items on OpenAI's Requests-for-PRs list (Universal Transformer, megakernels, SSM, E2E TTT, super long context, RLA, JEPA, text diffusion, H-net tokenization) as toggleable env vars on the PR #1953 base. Default config is byte-identical to PR #1953; toggles compose additively.
3-seed mean post-TTT val_bpb 1.07776 (std 0.00126) on 8×H100 SXM. All seeds within the 600s training cap.
Position: NOT a SOTA bid. This is a composability ablation + scaffolding for future record submissions in the directions OpenAI explicitly invited. Aligned with the README's "we strongly encourage participants to submit implementations for weird or out-of-the-box ideas, in-progress or unoptimized solutions, so long as they run successfully, or even interesting negative results."
What's in the box
KS_UT_DEPTHKS_MEGAKERNELKS_LONG_CONTEXT+EVAL_SEQ_LEN=3072TTT_RLA_ENABLEDKS_DIFFUSION_FRACKS_E2E_TTTKS_JEPA_WEIGHTKS_SSM_LAST_KKS_HNET_CHUNK5 active in shipped 3-seed config; 2 wired-with-blocker; 2 stubs. All 9 documented in
notes/.Per-seed results
vs current rank-1 PR #1855 (1.06108): +0.01668 BPB (regression — non-record).
Artifact size note: all 3 seeds came in 415-8464 bytes above the 16,000,000-byte cap. Driven by ~6KB compressed kitchen-sink scaffolding + ~5KB bf16 run-to-run variance. Trivially fixable (strip toy classes / bump WD); kept as-shipped to preserve full scaffolding visibility for review.
Test plan
KS_*=0,TTT_RLA_ENABLED=0) is byte-identical to PR Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953notes/Lineage
PR #1953 → #1945 → #1923 → #1908 → #1855 → #1797 → #1787 → #1729 → #1667 → #1530 → #1394 → #1344. Toy implementations of SSM, JEPA, diffusion, H-net introduced in this submission.