Simple speedups to decomposition

bfloat16 everywhere is ~3x faster, but the runs seem to fail:
- [run](https://wandb.ai/goodfire/spd/runs/xj4xicfo). bf16. ss_llama_simple_mlp-1L (same config as [here](https://wandb.ai/goodfire/spd/runs/5cr21lbs))
- [run](https://wandb.ai/goodfire/spd/runs/xmn1donp). bf16. ss_llama_simple_mlp-1.25M (same config as [here](https://wandb.ai/goodfire/spd/runs/vjbol27n))

Mixed precision seems to be about 50% faster. Runs pending to check that it works well:
- [run](https://wandb.ai/goodfire/spd/runs/281dcpjv). 1L mixed_precision
- [run](https://wandb.ai/goodfire/spd/runs/fes4u5ss). 1.25M mixed precision

We should at least use mixed precision if it works well.

We should also:
- torch.compile() the target model (compiling other things didn't seem to help/weren't possible)
- Use regular torch ops rather than einops inside the slow parts (maybe just LinearComponents.forward()).
- Turning off use_delta_components. It's 13% faster without them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple speedups to decomposition #310

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Simple speedups to decomposition #310

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions