Investigate different options for splitting up target model matrices

Oli mentioned this in a convo earlier "Couldn't we just combine the up_proj and gate_proj weight matrices in a target model SWIGLU and decompose it as a single matrix, and then later split it up?" This is similar to how we have the choice of whether to decompose QKV as a single matrix (or QK/QV/KV) or to decompose Q, K, V separately, or even to decompose each head separately.

I never really appreciated the optionality we have here. I think we can utilise this optionality in order to test the method. SPD + clustering should arrive at the same results regardless of the way you split these things up. If it doesn't, finding out where and why it doesn't would be helpful in debugging the methods.

This is probably more of a future investigation once we have some confidence that our decomp method (+ clustering) is reasonable.

The easiest way to manage this would probably be to just train new target models with the different architectures using https://github.com/goodfire-ai/simple_stories_train/tree/dev .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate different options for splitting up target model matrices #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate different options for splitting up target model matrices #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions