reproduce DiffCR results

Hi, @ranery , This work is very promising ,and currently I apply DiffCR to Sana, which is a linear attention-based model(https://github.com/NVlabs/Sana), and now I only test the routing with a fix compression ration for all the layer, , and after the training, the results is noisy. Can you give me some suggestion?
 Here is my code (same with your pesudo code.) 

<img width="1144" alt="Image" src="https://github.com/user-attachments/assets/81dda41c-5526-4093-9e3d-810f49f3e2df" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproduce DiffCR results #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

reproduce DiffCR results #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions