Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions records/track_10min_16mb/2026-04-29_PR1886Base_MLPGateOut/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean)

**val_bpb: 1.06872454** (3-seed mean, std ~0.00070), -0.01228 from SOTA

## Results (8×H100 80GB SXM, full pipeline with phased TTT, 10-min train / 10-min eval)

| Seed | Steps | Pre-EMA last val | Post-EMA pre-quant | Quantized (no TTT) | **Post-TTT** | Train | Eval | Artifact |
| -------- | -------: | ---------------: | -----------------: | -----------------: | -------------: | ---------: | ---------: | --------: |
| 42 | 4825 | 1.0795 | 1.06916 | 1.07938 | **1.06794764** | 599.4s | 398.4s | ~15.9 MB |
| 1337 | 4827 | 1.0806 | 1.07070 | 1.08066 | **1.06931760** | 599.8s | 510.0s | ~15.9 MB |
| 314 | 4825 | 1.0803 | 1.07014 | 1.08014 | **1.06890838** | 599.6s | 446.7s | ~15.9 MB |
| **Mean** | **4826** | **1.0801** | **1.07000** | **1.08006** | **1.06872454** | **599.6s** | **451.7s** | |


All 3 seeds clear the 600s train, 600s eval, and 16 MB decimal artifact budgets.

### Head-to-head vs PR #1886 (matched seeds)

| Seed | This PR | PR #1886 | Δ (mBPB) |
| -------- | ----------: | ----------: | --------: |
| 42 | 1.06795 | 1.06920 | **−1.25** |
| 1337 | 1.06932 | 1.07010 | **−0.78** |
| 314 | 1.06891 | 1.06942 | **−0.51** |
| **Mean** | **1.06872** | **1.06957** | **−0.85** |

Every individual seed beats its matched PR #1886 counterpart.

## Novel contribution: per-block MLP output gate, input-dependent, weight-learnable

This idea came directly from the Attention Gate previously added by our team in PR #1667, and even if this adds minimal gain, it closes the Gate Research Story on Parameter Golf.
Attention gate has much more effect, but the MLP_gate is adding only 143 additional params and gating token wise not headwise, so even if its effect is small, it's telling us a lot about what we could do without the specific constraints of this competition.

Additionally we're presenting this PR as a sort of update to the previous #1667 that was the first to introduce the Attention Gate (which was largely adopted by other PRs), but had a problem with the Smear Gate emerging from the discussion between @cocohearts, @msisovic, (cross document leakage) in PR #1797


```python
# In Block.__init__:
self.mlp_gate_out = CastedLinear(12, 1, bias=True) # 13 params per block
self.mlp_gate_out._pos_bias_init = 5.0 # init: w=0, b=+5

# In Block.forward, after self.mlp(...):
mlp_out = self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor, up_w, down_w)
gate = torch.sigmoid(self.mlp_gate_out(x_out[..., :12].contiguous())) # (B, T, 1)
mlp_out = mlp_out * gate
x_out = x_out + self.mlp_scale[None, None, :] * mlp_out
```

**Initialization**: weight=0, bias=+5 → `sigmoid(+5) ≈ 0.993`, ≈ identity at start (do-no-harm bias init).

**Total new parameters**: 11 layers × (12 + 1) = **143 parameters** (negligible vs 35.99M model parameters).

## Rule compliance

- **Artifact ≤ 16,000,000 bytes DECIMAL** (README FAQ + Issue #1017 §II.1): ✅ all seeds ≤ 16 MB.
- **train_time ≤ 600s**: ✅ all seeds 599.4–599.8s (`stopping_early: wallclock_cap`).
- **total_eval_time ≤ 600s**: ✅ all seeds 398.4–510.0s.
- **Issue #1017 Condition 3 (score-before-update)**: phased TTT unchanged — every chunk is scored under `inference_mode()` before any LoRA update.
- **No val data during training**: training uses only `fineweb_train_*.bin` shards.
- **No external network during eval**: self-contained `train_gpt.py` + tokenizer.
- **Reproducibility**: all hyperparameters set via env vars in the run command below.

## Reproduction

The MLP_GATE_OUT addition is hard-coded in `train_gpt.py` (no env-var flag — the gate is always present, since this is the record's defining change). All other env vars match the PR #1886 stack defaults; the explicit list below is conservative.

### Environment setup

```bash
pip install brotli sentencepiece python-minifier

# FlashAttention-3
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
```

### Dataset

```bash
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192
```

### Training (3 seeds)

```bash
export DATA_DIR=/path/to/parameter-golf/data

for SEED in 42 1337 314; do
NCCL_NET=Socket \
GATED_ATTN_ENABLED=1 \
PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=12.0 \
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
MATRIX_LR=0.026 MIN_LR=0.10 \
FUSED_CE_ENABLED=1 \
TTT_WARM_START_A=1 TTT_WEIGHT_DECAY=2.0 \
TTT_LORA_ALPHA=144 TTT_LORA_RANK=128 \
GPTQ_RESERVE_SECONDS=0.5 GPTQ_CALIBRATION_BATCHES=16 \
SEED=$SEED \
torchrun --standalone --nproc_per_node=8 train_gpt.py \
> train_seed${SEED}.log 2>&1
done
```

## Hardware

Trained on **RunPod 8×H100 80GB SXM**. PyTorch 2.9.1+cu128, FA3, Triton 3.5.1. Identical SP8192 SentencePiece tokenizer and FineWeb document selection as upstream `kevclark/parameter-golf` (the canonical PG `parameter-golf` validation split). No tokenizer mods.

## Lineage

- @nprime06 — PR #1787 (FusedCE / PolarNS / MIN_LR / SparseAttnGate base)
- @renqianluo — PR #1767 (warm-start LoRA), PR #1768 (GatedAttn), PR #1886 (WD=2.0 stability)
- @dexhunter — PR #1626 / PR #1736 (Multi-phase SGD, GPTQ trim, GatedAttn baseline)
- @samacqua — PR #1530 (VarLen + Fused MLP + doc-independent TTT)
- @bigbag — PR #1493 (3-layer recurrence + parallel residuals base)
- @MarioPaerle — PR #1667 (per-head attention output gate pattern, prior art for the "narrow gate Linear(12→1) + bias=+5" idiom used here on the MLP output)
- This submission — adds the per-block MLP output gate to the modern stack with the bug fix that makes the gate weight-learnable.

## Credits
This work was also possible thanks to the support provided by [Paradigma](https://paradigma.inc/) and the use of [Flywheel](https://flywheel.paradigma.inc/): their infrastructure for research.

- @MarioPaerle, @GabrieleCirillo, @CerovazS
- @renqianluo — PR #1886 base stack
- All upstream contributors as listed in lineage
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"author": "MarioPaerle",
"github_id": "MarioPaerle",
"name": "PR #1886 base + per-block MLP output gate (Linear, weight-learnable)",
"date": "2026-04-29",
"track": "10min_16mb",
"val_bpb": 1.06872454,
"val_bpb_std": 0.00070,
"seeds": [42, 1337, 314],
"seed_results": {
"42": {"val_bpb": 1.06794764, "artifact_bytes": 15982732},
"1337": {"val_bpb": 1.06931760, "artifact_bytes": 15978886},
"314": {"val_bpb": 1.06890838, "artifact_bytes": 15980124}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "PR #1886 stack + per-block MLP output gate: Linear(12,1, bias=True) with weight=0/bias=+5 init applied multiplicatively to mlp(...) before residual add (143 new params total, 11 layers x 13). Routing fix: 'mlp_gate' added to CONTROL_TENSOR_NAME_PATTERNS so the (1,12) weight reaches the scalar AdamW group instead of being frozen at zero-init.",
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
},
"attribution": {
"mlp_gate_out_linear_with_routing_fix": "@MarioPaerle (this submission)",
"narrow_gate_linear12_bias5_idiom": "@MarioPaerle (PR #1667)",
"gated_attention": "@MarioPaerle (PR #1667) / @ClassicLarry (Modded Nano GPT)",
"fused_softcap_ce_kernel": "@nprime06 (PR #1787)",
"ttt_weight_decay_2": "@renqianluo (PR #1886)",
"warm_start_lora_alpha_144_rank_128": "@renqianluo (PR #1767)",
"gated_attn_per_head_int8": "@renqianluo (PR #1768), @dexhunter (PR #1736)",
"multi_phase_sgd_trimmed_gptq_matrix_lr_026": "@dexhunter (PR #1626)",
"varlen_attention_fused_mlp_doc_ttt": "@samacqua (PR #1530)",
"phased_ttt": "@romeerp (PR #1610), @dexhunter",
"triple_recurrence_parallel_residuals": "@bigbag (PR #1493), @EthanYangTW (PR #1523)",
"legal_ttt_framework": "@abaybektursun (PR #549)"
}
}
Loading