OSP paper claims with a few modifications to standard transformer training you can enable 4bit quantization with less performance degradation. The proposed changes in the paper are:
- Switch to Muon everywhere except embeddings
- Use RMSNorm instead of LayerNorm, and with only a scale parameter
- Single Scale RMS Norm: don't learn vector gain/scale, just learn a single scalar (i.e. not channel-wise)
- All projections (i.e. timestep embedding, action embedding, patch embedding, text embedding) have an extra EMBPROJ layer added
- EMBPROJ is normal linear, but no bias, and using orthogonal init
- Monitor kurtosis of attn/mlp outputs
OSP paper claims with a few modifications to standard transformer training you can enable 4bit quantization with less performance degradation. The proposed changes in the paper are: