Non-record: Initial small random mini-MoE#1692
Open
joshEng1 wants to merge 5 commits intoopenai:mainfrom
Open
Conversation
…formance trade-offs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR is an initial follow-up to the small random mini-MoE idea described in PR #1228.
PR #1228 showed that selected MLP up projections can be replaced with seeded frozen QR random feature maps while keeping the rest of the MLP trainable. More importantly for this PR, it also noted that a small “mini-MoE” over multiple random up projections appeared promising, but was removed from the final H100 submission because the throughput cost did not survive an iso-wallclock comparison.
This PR takes that observation as the starting point.
The goal here is not to present a finished result or a fully validated submission. The goal is to build a clean initial implementation of that proposed follow-up idea inside the root trainer and make it easy to test on H100.
What is implemented
This submission starts from the current root
train_gpt.pyand adds narrow random MLP up ablations in a separate non-record folder.Selected MLP up projections can now use:
The main new addition is the small random mini-MoE style path.
For selected MLP layers:
F.linearcallThis keeps the expensive random up projection in one pass while still adding token-dependent routing over multiple random expert subspaces.
Computation
The small random mini-MoE path is implemented so that the expensive projection still happens in one
F.linearcall.For one selected MLP layer, let the input be
x ∈ R^d. Split the hidden width intoEexpert subspaces with widthsh_1, ..., h_E, and let each expert have its own seeded frozen QR random basisR_e ∈ R^{h_e × d}Concatenate those expert bases row-wise into one frozen matrix
R = [R_1; R_2; ...; R_E] ∈ R^{h × d}, whereh = Σ_e h_eThen the full random feature expansion is computed in one pass:
u = F.linear(x, R) = [u_1 | u_2 | ... | u_E]where each expert chunk is
u_e = F.linear(x, R_e)A learned per-feature gain is applied across the concatenated hidden state:
û = u ⊙ awhere
a ∈ R^his the learned gain vector.The token router then produces expert weights
g(x) = softmax(W_router x) ∈ R^Eand each expert chunk is scaled by its token-dependent router weight:
base(x) = [g_1(x) · û_1 | g_2(x) · û_2 | ... | g_E(x) · û_E]So the key approximation is that expert specialization is introduced after a single shared random projection over concatenated expert subspaces, rather than by running several separate dense up-projections.
For the configs with a learned correction path, the final hidden state is
output(x) = base(x) + U(Vx)where
V ∈ R^{r × d}andU ∈ R^{h × r}define the low-rank correction.This gives two useful properties:
F.linearcall, which is the main reason this version is cheaper than a naive multi-projection mini-MoEContinuity
This PR is intended as a direct continuation of the specific idea proposed in PR #1228, not as a general MoE exploration.
The implementation is designed to preserve the artifact-saving property of seeded frozen random weights while making the mini-MoE path cheaper than a naive multi-projection version.
This also fits naturally within the broader random-linear-map work in PR #1301, which showed that selective randomization in the MLP is a serious direction and that small learned corrective paths on top of frozen random structure are worth testing.
Why this is interesting
The random MLP construction from PR #1228 has one obvious limitation: a single frozen random basis may be too rigid.
A small random MoE is interesting because it offers a very targeted way to relax that rigidity.
Instead of forcing every token through one frozen basis, the model gets several random feature subspaces and learns token-dependent mixing over them. If this works, it would mean the random MLP idea can be made more expressive without giving up the seed-regenerated weight trick that makes it attractive under the 16 MB constraint.
This is exactly the unresolved question left open by PR #1228, and this submission is designed to answer that question more directly.
The design is also intentionally cost-aware. Prior MoE work in the repo, especially PR #660, makes it clear that routing quality alone is not enough if throughput collapses. That is why this implementation keeps the heavy random up projection in one pass and applies expert routing over subspaces afterward rather than duplicating a full expert MLP path.
Included configs
This PR currently includes the following configs:
baseline_12lrandom_up_12l_5layers_rank16random_up_moe_12l_5layers_e2random_up_moe_12l_5layers_e2_rank8The two MoE configs are the new part of the experiment.
random_up_moe_12l_5layers_e20,1,2,3,4use frozen random MLP up projections2routed random expert subspacesThis isolates the small random mini-MoE path itself.
random_up_moe_12l_5layers_e2_rank8This is meant to test whether the routed random mini-MoE benefits from a small learned corrective path.
Why this may be useful
Even before H100 testing, this PR provides a clean scaffold for evaluating one of the most natural follow-ups to PR #1228.
It is built from the root trainer, so it preserves the standard repo flow for:
That should make it easier to test and extend than a larger custom research branch.
Current status
This is an initial implementation PR before full H100 evaluation.
So far, the following have been verified locally:
state_dictThe next step is H100 testing of the included configs under matched wallclock settings.
Reproduction