-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Oli mentioned this in a convo earlier "Couldn't we just combine the up_proj and gate_proj weight matrices in a target model SWIGLU and decompose it as a single matrix, and then later split it up?" This is similar to how we have the choice of whether to decompose QKV as a single matrix (or QK/QV/KV) or to decompose Q, K, V separately, or even to decompose each head separately.
I never really appreciated the optionality we have here. I think we can utilise this optionality in order to test the method. SPD + clustering should arrive at the same results regardless of the way you split these things up. If it doesn't, finding out where and why it doesn't would be helpful in debugging the methods.
This is probably more of a future investigation once we have some confidence that our decomp method (+ clustering) is reasonable.
The easiest way to manage this would probably be to just train new target models with the different architectures using https://github.com/goodfire-ai/simple_stories_train/tree/dev .