Possible bug in `DisentangledSelfAttention`?

Thanks for the great work and open-sourced codebase for your model!

I've been working on a project that involves fine-tuning your model, and ran into one issue. I was fine-tuning using DDP and getting errors about some parameters being unused, and I found that all the modules that require gradients but aren't receiving any are the `ss_q_proj` layers in each encoder layer's attention block (i.e. `prosst.encoder.layer[0-11].attention.self.ss_q_proj.weight` and  `prosst.encoder.layer[0-11].attention.self.ss_q_proj.bias`).

Looking into the code [here](https://huggingface.co/AI4Protein/ProSST-2048/blob/main/modeling_prosst.py#L500), I noticed that in this block:
```
if "ss2aa" in self.pos_att_type:
  assert ss_hidden_states is not None
  ss_query_layer = self.ss_q_proj(ss_hidden_states)
  ss_query_layer = self.transpose_for_scores(ss_query_layer)
  ss_query_layer /= torch.sqrt(
    torch.tensor(ss_query_layer.size(-1), dtype=torch.float)
    * scale_factor
  )
  ss2aa_att = torch.matmul(
    key_layer, query_layer.transpose(-1, -2).to(dtype=key_layer.dtype)
  )
  score += ss2aa_att
```
it looks like `ss_query_layer` is created but not used. Should the line ` key_layer, query_layer.transpose(-1, -2).to(dtype=key_layer.dtype)` instead be ` key_layer, ss_query_layer.transpose(-1, -2).to(dtype=key_layer.dtype)`? It seems like that would be more in line with what it looks like the code was intended to do.

Apologies if I'm misunderstanding the codebase or method! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in `DisentangledSelfAttention`? #15

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Possible bug in DisentangledSelfAttention? #15

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Possible bug in `DisentangledSelfAttention`? #15