there is no causal mask in the attention layer. Is it because the model is designed for classification rather than generation?
there is no causal mask in the attention layer. Is it because the model is designed for classification rather than generation?