Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean)#1941
Closed
MarioPaerle wants to merge 1 commit intoopenai:mainfrom
Conversation
…ht-learnable) — val_bpb 1.06872 (3-seed mean) Adds a per-block MLP output gate (Linear(12,1) with bias=+5 init) plus a CONTROL_TENSOR_NAME_PATTERNS fix (adds 'mlp_gate' so the (1,12) weight gets routed to scalar AdamW instead of being silently frozen at zero-init). Diff vs PR openai#1886 train_gpt.py: ~22 lines, isolated to Block.__init__, forward (4 sites: Block.forward, _parallel_block, _block_with_lora, _parallel_block_with_lora), the _init_weights _pos_bias_init branch, and the CONTROL_TENSOR_NAME_PATTERNS string. 3-seed mean val_bpb 1.06872454, beating PR openai#1886 published (1.06957227) by 0.00085 BPB on every individual seed (seeds 42 / 1337 / 314 → 1.06794764 / 1.06931760 / 1.06890838). All compliance budgets cleared: train_time 599.4-599.8s (<600s), eval_time 398-510s (<600s), artifact 15.98 MB (<16 MB decimal).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean)
val_bpb: 1.06872454 (3-seed mean, std ~0.00070), -0.01228 from SOTA
Results (8×H100 80GB SXM, full pipeline with phased TTT, 10-min train / 10-min eval)
All 3 seeds clear the 600s train, 600s eval, and 16 MB decimal artifact budgets.
Head-to-head vs PR #1886 (matched seeds)
Every individual seed beats its matched PR #1886 counterpart.
Novel contribution: per-block MLP output gate, input-dependent, weight-learnable
This idea came directly from the Attention Gate previously added by our team in PR #1667, and even if this adds minimal gain, it closes the Gate Research Story on Parameter Golf.
Attention gate has much more effect, but the MLP_gate is adding only 143 additional params and gating token wise not headwise, so even if its effect is small, it's telling us a lot about what we could do without the specific constraints of this competition.
Initialization: weight=0, bias=+5 →
sigmoid(+5) ≈ 0.993, ≈ identity at start (do-no-harm bias init).Total new parameters: 11 layers × (12 + 1) = 143 parameters (negligible vs 35.99M model parameters).
Rule compliance
stopping_early: wallclock_cap).inference_mode()before any LoRA update.fineweb_train_*.binshards.train_gpt.py+ tokenizer.Reproduction
The MLP_GATE_OUT addition is hard-coded in
train_gpt.py(no env-var flag — the gate is always present, since this is the record's defining change). All other env vars match the PR #1886 stack defaults; the explicit list below is conservative.Environment setup
pip install brotli sentencepiece python-minifier # FlashAttention-3 pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/Dataset
Training (3 seeds)
Hardware
Trained on RunPod 8×H100 80GB SXM. PyTorch 2.9.1+cu128, FA3, Triton 3.5.1. Identical SP8192 SentencePiece tokenizer and FineWeb document selection as upstream
kevclark/parameter-golf(the canonical PGparameter-golfvalidation split). No tokenizer mods.Lineage
Credits
This work was also possible thanks to the support provided by Paradigma and the use of Flywheel: their infrastructure for research.