You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to your arxiv paper (https://arxiv.org/pdf/2310.12109.pdf), there is no activation in sequence mixing in formula (2). However, the appendix includes the code for MonarchMixerLayer and it includes the ReLU in the sequence mixing layer.
The text was updated successfully, but these errors were encountered:
Plus, equation (3) seems to be an MLP operation (I might be wrong). I don't fully understand why it is said to be MLP-free in "The resulting architecture is entirely attention- and MLP-free."
This is a typo - we used to say that the sequence mixing had an "optional" activation function that we would set to identity for the sequence mixer. We updated the equation but not the pseudocode -- will fix it the next time we update the arXiv!
equation (3) seems to be a MLP operation
Ah, this is helpful feedback! The distinction we intend to say is that an MLP is quadratic in $d$, and the M2 version is sub-quadratic.
Specifically -- in an MLP, there are linear layers that take quadratic compute ($O(d^2)$ for dimension $d$). In M2, we replace these linear layers with Monarch matrices, which can be computed in sub-quadratic time. So it has the similar structure of an MLP, but is sub-quadratic in $d$.
We'll clarify the language in the next arXiv update, thank you!
According to your arxiv paper (https://arxiv.org/pdf/2310.12109.pdf), there is no activation in sequence mixing in formula (2). However, the appendix includes the code for MonarchMixerLayer and it includes the ReLU in the sequence mixing layer.
The text was updated successfully, but these errors were encountered: