Non-record: Initial small random mini-MoE by joshEng1 · Pull Request #1692 · openai/parameter-golf

joshEng1 · 2026-04-17T09:12:41Z

Summary

This PR is an initial follow-up to the small random mini-MoE idea described in PR #1228.

PR #1228 showed that selected MLP up projections can be replaced with seeded frozen QR random feature maps while keeping the rest of the MLP trainable. More importantly for this PR, it also noted that a small “mini-MoE” over multiple random up projections appeared promising, but was removed from the final H100 submission because the throughput cost did not survive an iso-wallclock comparison.

This PR takes that observation as the starting point.

The goal here is not to present a finished result or a fully validated submission. The goal is to build a clean initial implementation of that proposed follow-up idea inside the root trainer and make it easy to test on H100.

What is implemented

This submission starts from the current root train_gpt.py and adds narrow random MLP up ablations in a separate non-record folder.

Selected MLP up projections can now use:

seeded frozen QR random feature maps
learned per-feature gain
optional low-rank correction
optional small routed multi-basis expert mixing

The main new addition is the small random mini-MoE style path.

For selected MLP layers:

the hidden width is partitioned into multiple random expert subspaces
each expert subspace gets its own seeded frozen QR random basis
those expert bases are concatenated into one frozen random matrix
the full hidden expansion is produced in one F.linear call
a small token-dependent router rescales the resulting expert chunks

This keeps the expensive random up projection in one pass while still adding token-dependent routing over multiple random expert subspaces.

Computation

The small random mini-MoE path is implemented so that the expensive projection still happens in one F.linear call.

For one selected MLP layer, let the input be x ∈ R^d. Split the hidden width into E expert subspaces with widths h_1, ..., h_E, and let each expert have its own seeded frozen QR random basis

R_e ∈ R^{h_e × d}

Concatenate those expert bases row-wise into one frozen matrix

R = [R_1; R_2; ...; R_E] ∈ R^{h × d}, where h = Σ_e h_e

Then the full random feature expansion is computed in one pass:

u = F.linear(x, R) = [u_1 | u_2 | ... | u_E]

where each expert chunk is

u_e = F.linear(x, R_e)

A learned per-feature gain is applied across the concatenated hidden state:

û = u ⊙ a

where a ∈ R^h is the learned gain vector.

The token router then produces expert weights

g(x) = softmax(W_router x) ∈ R^E

and each expert chunk is scaled by its token-dependent router weight:

base(x) = [g_1(x) · û_1 | g_2(x) · û_2 | ... | g_E(x) · û_E]

So the key approximation is that expert specialization is introduced after a single shared random projection over concatenated expert subspaces, rather than by running several separate dense up-projections.

For the configs with a learned correction path, the final hidden state is

output(x) = base(x) + U(Vx)

where V ∈ R^{r × d} and U ∈ R^{h × r} define the low-rank correction.

This gives two useful properties:

the frozen random expert bases are regenerated from seeds and do not consume artifact bytes
the heavy random up-projection remains a single F.linear call, which is the main reason this version is cheaper than a naive multi-projection mini-MoE

Continuity

This PR is intended as a direct continuation of the specific idea proposed in PR #1228, not as a general MoE exploration.

PR #1228 established the random MLP up baseline
PR #1228 explicitly suggested a small random mini-MoE as a potentially useful extension
this PR implements a cheaper version of that idea so it can be tested more directly

The implementation is designed to preserve the artifact-saving property of seeded frozen random weights while making the mini-MoE path cheaper than a naive multi-projection version.

This also fits naturally within the broader random-linear-map work in PR #1301, which showed that selective randomization in the MLP is a serious direction and that small learned corrective paths on top of frozen random structure are worth testing.

Why this is interesting

The random MLP construction from PR #1228 has one obvious limitation: a single frozen random basis may be too rigid.

A small random MoE is interesting because it offers a very targeted way to relax that rigidity.

Instead of forcing every token through one frozen basis, the model gets several random feature subspaces and learns token-dependent mixing over them. If this works, it would mean the random MLP idea can be made more expressive without giving up the seed-regenerated weight trick that makes it attractive under the 16 MB constraint.

This is exactly the unresolved question left open by PR #1228, and this submission is designed to answer that question more directly.

The design is also intentionally cost-aware. Prior MoE work in the repo, especially PR #660, makes it clear that routing quality alone is not enough if throughput collapses. That is why this implementation keeps the heavy random up projection in one pass and applies expert routing over subspaces afterward rather than duplicating a full expert MLP path.

Included configs

This PR currently includes the following configs:

baseline_12l
random_up_12l_5layers_rank16
random_up_moe_12l_5layers_e2
random_up_moe_12l_5layers_e2_rank8

The two MoE configs are the new part of the experiment.

`random_up_moe_12l_5layers_e2`

layers 0,1,2,3,4 use frozen random MLP up projections
each selected layer is split into 2 routed random expert subspaces
no low-rank correction is used

This isolates the small random mini-MoE path itself.

`random_up_moe_12l_5layers_e2_rank8`

same 2 expert routed random basis construction
adds a small rank-8 learned correction on top

This is meant to test whether the routed random mini-MoE benefits from a small learned corrective path.

Why this may be useful

Even before H100 testing, this PR provides a clean scaffold for evaluating one of the most natural follow-ups to PR #1228.

It is built from the root trainer, so it preserves the standard repo flow for:

data loading
tokenizer-aware BPB accounting
optimizer grouping
quantization export
roundtrip eval
sliding-window eval

That should make it easier to test and extend than a larger custom research branch.

Current status

This is an initial implementation PR before full H100 evaluation.

So far, the following have been verified locally:

deterministic regeneration of frozen random weights from seed
frozen random weights excluded from state_dict
save and load roundtrip preserves outputs
the routed expert path builds and forwards correctly
router parameters stay out of the Muon matrix group
final exact eval and sliding eval remain wired into the script

The next step is H100 testing of the included configs under matched wallclock settings.

Reproduction

bash run.sh baseline_12l
bash run.sh random_up_12l_5layers_rank16
bash run.sh random_up_moe_12l_5layers_e2
bash run.sh random_up_moe_12l_5layers_e2_rank8

…hanism

…formance trade-offs

joshEng1 added 5 commits April 4, 2026 00:28

Add non-record random MLP up adapter ablation experiment

0d60330

cheap MoE style random MLP variant

e5fdc7c

Add support for GQA in scaled dot-product attention with fallback mec…

506a398

…hanism

Updated with early smoke results

c5bcaf3

Update README.md with additional insights on model efficiency and per…

79c7894

…formance trade-offs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Initial small random mini-MoE#1692

Non-record: Initial small random mini-MoE#1692
joshEng1 wants to merge 5 commits intoopenai:mainfrom
joshEng1:main

joshEng1 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshEng1 commented Apr 17, 2026

Summary

What is implemented

Computation

Continuity

Why this is interesting

Included configs

random_up_moe_12l_5layers_e2

random_up_moe_12l_5layers_e2_rank8

Why this may be useful

Current status

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`random_up_moe_12l_5layers_e2`

`random_up_moe_12l_5layers_e2_rank8`