Skip to content

Non-record: Initial small random mini-MoE#1692

Open
joshEng1 wants to merge 5 commits intoopenai:mainfrom
joshEng1:main
Open

Non-record: Initial small random mini-MoE#1692
joshEng1 wants to merge 5 commits intoopenai:mainfrom
joshEng1:main

Conversation

@joshEng1
Copy link
Copy Markdown

Summary

This PR is an initial follow-up to the small random mini-MoE idea described in PR #1228.

PR #1228 showed that selected MLP up projections can be replaced with seeded frozen QR random feature maps while keeping the rest of the MLP trainable. More importantly for this PR, it also noted that a small “mini-MoE” over multiple random up projections appeared promising, but was removed from the final H100 submission because the throughput cost did not survive an iso-wallclock comparison.

This PR takes that observation as the starting point.

The goal here is not to present a finished result or a fully validated submission. The goal is to build a clean initial implementation of that proposed follow-up idea inside the root trainer and make it easy to test on H100.

What is implemented

This submission starts from the current root train_gpt.py and adds narrow random MLP up ablations in a separate non-record folder.

Selected MLP up projections can now use:

  1. seeded frozen QR random feature maps
  2. learned per-feature gain
  3. optional low-rank correction
  4. optional small routed multi-basis expert mixing

The main new addition is the small random mini-MoE style path.

For selected MLP layers:

  1. the hidden width is partitioned into multiple random expert subspaces
  2. each expert subspace gets its own seeded frozen QR random basis
  3. those expert bases are concatenated into one frozen random matrix
  4. the full hidden expansion is produced in one F.linear call
  5. a small token-dependent router rescales the resulting expert chunks

This keeps the expensive random up projection in one pass while still adding token-dependent routing over multiple random expert subspaces.

Computation

The small random mini-MoE path is implemented so that the expensive projection still happens in one F.linear call.

For one selected MLP layer, let the input be x ∈ R^d. Split the hidden width into E expert subspaces with widths h_1, ..., h_E, and let each expert have its own seeded frozen QR random basis

R_e ∈ R^{h_e × d}

Concatenate those expert bases row-wise into one frozen matrix

R = [R_1; R_2; ...; R_E] ∈ R^{h × d}, where h = Σ_e h_e

Then the full random feature expansion is computed in one pass:

u = F.linear(x, R) = [u_1 | u_2 | ... | u_E]

where each expert chunk is

u_e = F.linear(x, R_e)

A learned per-feature gain is applied across the concatenated hidden state:

û = u ⊙ a

where a ∈ R^h is the learned gain vector.

The token router then produces expert weights

g(x) = softmax(W_router x) ∈ R^E

and each expert chunk is scaled by its token-dependent router weight:

base(x) = [g_1(x) · û_1 | g_2(x) · û_2 | ... | g_E(x) · û_E]

So the key approximation is that expert specialization is introduced after a single shared random projection over concatenated expert subspaces, rather than by running several separate dense up-projections.

For the configs with a learned correction path, the final hidden state is

output(x) = base(x) + U(Vx)

where V ∈ R^{r × d} and U ∈ R^{h × r} define the low-rank correction.

This gives two useful properties:

  1. the frozen random expert bases are regenerated from seeds and do not consume artifact bytes
  2. the heavy random up-projection remains a single F.linear call, which is the main reason this version is cheaper than a naive multi-projection mini-MoE

Continuity

This PR is intended as a direct continuation of the specific idea proposed in PR #1228, not as a general MoE exploration.

  1. PR #1228 established the random MLP up baseline
  2. PR #1228 explicitly suggested a small random mini-MoE as a potentially useful extension
  3. this PR implements a cheaper version of that idea so it can be tested more directly

The implementation is designed to preserve the artifact-saving property of seeded frozen random weights while making the mini-MoE path cheaper than a naive multi-projection version.

This also fits naturally within the broader random-linear-map work in PR #1301, which showed that selective randomization in the MLP is a serious direction and that small learned corrective paths on top of frozen random structure are worth testing.

Why this is interesting

The random MLP construction from PR #1228 has one obvious limitation: a single frozen random basis may be too rigid.

A small random MoE is interesting because it offers a very targeted way to relax that rigidity.

Instead of forcing every token through one frozen basis, the model gets several random feature subspaces and learns token-dependent mixing over them. If this works, it would mean the random MLP idea can be made more expressive without giving up the seed-regenerated weight trick that makes it attractive under the 16 MB constraint.

This is exactly the unresolved question left open by PR #1228, and this submission is designed to answer that question more directly.

The design is also intentionally cost-aware. Prior MoE work in the repo, especially PR #660, makes it clear that routing quality alone is not enough if throughput collapses. That is why this implementation keeps the heavy random up projection in one pass and applies expert routing over subspaces afterward rather than duplicating a full expert MLP path.

Included configs

This PR currently includes the following configs:

  1. baseline_12l
  2. random_up_12l_5layers_rank16
  3. random_up_moe_12l_5layers_e2
  4. random_up_moe_12l_5layers_e2_rank8

The two MoE configs are the new part of the experiment.

random_up_moe_12l_5layers_e2

  1. layers 0,1,2,3,4 use frozen random MLP up projections
  2. each selected layer is split into 2 routed random expert subspaces
  3. no low-rank correction is used

This isolates the small random mini-MoE path itself.

random_up_moe_12l_5layers_e2_rank8

  1. same 2 expert routed random basis construction
  2. adds a small rank-8 learned correction on top

This is meant to test whether the routed random mini-MoE benefits from a small learned corrective path.

Why this may be useful

Even before H100 testing, this PR provides a clean scaffold for evaluating one of the most natural follow-ups to PR #1228.

It is built from the root trainer, so it preserves the standard repo flow for:

  1. data loading
  2. tokenizer-aware BPB accounting
  3. optimizer grouping
  4. quantization export
  5. roundtrip eval
  6. sliding-window eval

That should make it easier to test and extend than a larger custom research branch.

Current status

This is an initial implementation PR before full H100 evaluation.

So far, the following have been verified locally:

  1. deterministic regeneration of frozen random weights from seed
  2. frozen random weights excluded from state_dict
  3. save and load roundtrip preserves outputs
  4. the routed expert path builds and forwards correctly
  5. router parameters stay out of the Muon matrix group
  6. final exact eval and sliding eval remain wired into the script

The next step is H100 testing of the included configs under matched wallclock settings.

Reproduction

bash run.sh baseline_12l
bash run.sh random_up_12l_5layers_rank16
bash run.sh random_up_moe_12l_5layers_e2
bash run.sh random_up_moe_12l_5layers_e2_rank8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant