Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniprot #470

Merged
merged 47 commits into from
Nov 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
d079c4f
example processing dataset
AdrianM0 Oct 30, 2023
cf676cc
processing notebook
AdrianM0 Nov 1, 2023
700cd1e
feat: minor dataframe construction adjustments
AdrianM0 Nov 1, 2023
58639c5
first version templates
AdrianM0 Nov 1, 2023
56fecb5
remove lint
Nov 1, 2023
eae4fd0
feat: working templates + working data extraction
AdrianM0 Nov 2, 2023
3b824d6
fix: data link
AdrianM0 Nov 2, 2023
ed6b532
lint
AdrianM0 Nov 2, 2023
75c0273
fix lines
AdrianM0 Nov 2, 2023
388e822
pre-commit
AdrianM0 Nov 2, 2023
e511385
separate files and read from hugging face
AdrianM0 Nov 3, 2023
d79774a
remove full dataset files
AdrianM0 Nov 3, 2023
1e915c1
Update data/tabular/uniprot_organisms/meta.yaml
AdrianM0 Nov 8, 2023
526f866
add sentences clean-up regex + new text template
AdrianM0 Nov 8, 2023
4461e4f
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
e3c1c25
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
14af762
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
ea25d27
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
3584439
fix space
AdrianM0 Nov 9, 2023
b086b61
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 14, 2023
b333c2c
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 14, 2023
d097e11
Update data/tabular/uniprot_reactions/meta.yaml
AdrianM0 Nov 14, 2023
0e79a7b
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 14, 2023
c2a191d
hugging face downloader
AdrianM0 Nov 15, 2023
6ed8ca6
Merge branch 'uniprot' of https://github.com/AdrianM0/chemnlp into un…
AdrianM0 Nov 15, 2023
6e9f2bd
Update data/tabular/uniprot_organisms/meta.yaml
kjappelbaum Nov 17, 2023
d515a0a
Update data/tabular/uniprot_binding_sites/meta.yaml
kjappelbaum Nov 17, 2023
c87127e
fix: exclude_from_standard_tabular_text_templates
MicPie Nov 17, 2023
aead83f
fix: templates uniprot_binding_sites
MicPie Nov 17, 2023
b4361c9
fix: templates uniprot_organisms
MicPie Nov 17, 2023
30eb7fb
fix: templates uniprot_binding_sites 2
MicPie Nov 17, 2023
e03704a
fix: templates uniprot_reactions
MicPie Nov 17, 2023
fe04f4e
fix: templates uniprot_sentences
MicPie Nov 17, 2023
c8d56f7
feat: inverse design template uniprot_reactions
MicPie Nov 17, 2023
3d593c8
Merge branch 'main' into uniprot
MicPie Nov 17, 2023
b33878f
Apply suggestions from code review
MicPie Nov 17, 2023
073f6ab
feat: add uniprot benchmarking tasks
MicPie Nov 17, 2023
c37f116
additional clean-up
AdrianM0 Nov 17, 2023
04b36d2
clean By . similarity strings
AdrianM0 Nov 17, 2023
a1fd93a
fix column id for identifier
AdrianM0 Nov 17, 2023
52dd9af
fix column id for identifier
AdrianM0 Nov 17, 2023
8f3eff9
file cleanup
AdrianM0 Nov 17, 2023
40a66dc
replace templates and transform with new tasks/data
AdrianM0 Nov 18, 2023
0c3c5d8
type change to int in column
AdrianM0 Nov 18, 2023
f309c28
final fixing, working templates
AdrianM0 Nov 18, 2023
d5c4ca6
updates templates and types
Nov 18, 2023
c6efe2e
remove lint
Nov 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions data/tabular/uniprot_binding_sites/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
name: uniprot_binding_sites
description: |-
Binding sites of a molecule in protein sequences.
targets:
- id: start_binding_site
description: index for start of the binding sites of a protein
type: text
names:
- noun: start binding site
- id: end_binding_site
description: index for emd of the binding sites of a protein
type: text
names:
- noun: end binding site
- id: SMILES
description: SMILES
type: SMILES
names:
- noun: SMILES
identifiers:
- id: sequence
type: AS_SEQUENCE
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 780449
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
Question: What are the binding sites of the {#molecule|chemical|compound!} with {SMILES__description} {SMILES#} in this {#AA|amino acid!} sequence {sequence#}?
Answer: The binding site for the {#molecule|chemical|compound!} with the SMILES {SMILES#} in the given {#AA|amino acid!} sequence is: {start_binding_site#}-{end_binding_site#}.
- |-
Question: What molecule can bind in the binding site {start_binding_site#}-{end_binding_site#} in the amino acid sequence below?
{#AA|amino acid!} sequence: {sequence#}.
Answer: {SMILES#}
- |-
Task: Design a binding site in the {#AA|amino acid!} sequence {sequence#}, in which the {#molecule|chemical|compound!} with {SMILES__description} {SMILES#} can bind.
Answer: {start_binding_site#}-{end_binding_site#}
- |-
Task: Design a {#molecule|chemical|compound!} that binds to a given site in the {#AA|amino acid!} sequence {sequence#}.
Description: The binding site is {start_binding_site#}-{end_binding_site#}.
Answer: {SMILES#}
25 changes: 25 additions & 0 deletions data/tabular/uniprot_binding_sites/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import pandas as pd
from huggingface_hub import hf_hub_download

DATA = "uniprot_binding_sites"


def load_dataset() -> pd.DataFrame:
uniprot = hf_hub_download(
repo_id="chemnlp/uniprot",
filename=f"{DATA}/data_clean.csv",
repo_type="dataset",
)
uniprot = pd.read_csv(uniprot)
uniprot.end_binding_site = uniprot.end_binding_site.astype(int)
uniprot.drop_duplicates(
inplace=True,
)
print(f"Successfully loaded {DATA}! {len(uniprot)} rows")
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {DATA}!")
return uniprot


if __name__ == "__main__":
load_dataset()
47 changes: 47 additions & 0 deletions data/tabular/uniprot_organisms/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: uniprot_organisms
description: |-
Organisms in which a amino-acid sequence can be found.
targets:
- id: organisms
description: organisms in which a protein can be found
type: text
names:
- noun: organisms
identifiers:
- id: other
type: AS_SEQUENCE
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 559428
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
The protein with the {#amino acid sequence|AA sequence!} {other#} can be found in {#the organism |!}{organisms#}.
- |-
Task: {#Predict|Identify!} the organism in which this {#protein|amino acid sequence|AA sequence|polypeptide!} can be found.
{#Amino acid sequence |Sequence|AA sequence!}: {other#}
Result: {organisms#}
- |-
User: In what organism can you find the following {#protein|amino acid sequence|AA sequence|polypeptide!}: {other#}?
Assistant: The given {#polypeptide|protein|amino acid sequence|AA sequence!} can be found in {organisms#}.
- |-
Task: {#Predict|Identify!} the organism in which this {#protein|amino acid sequence|AA sequence|polypeptide!} can be found.
{#Amino acid sequence |Sequence|AA sequence!}: {other#}
Result:<EOI> {organisms#}
25 changes: 25 additions & 0 deletions data/tabular/uniprot_organisms/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import pandas as pd
from huggingface_hub import hf_hub_download

DATA = "uniprot_organisms"


def load_dataset() -> pd.DataFrame:
uniprot = hf_hub_download(
repo_id="chemnlp/uniprot",
filename=f"{DATA}/data_clean.csv",
repo_type="dataset",
)
uniprot = pd.read_csv(uniprot)
uniprot.rename(columns={"sequence": "other"}, inplace=True)
uniprot.drop_duplicates(
inplace=True,
)
print(f"Successfully loaded {DATA}! {len(uniprot)} rows")
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {DATA}!")
return uniprot


if __name__ == "__main__":
load_dataset()
56 changes: 56 additions & 0 deletions data/tabular/uniprot_reactions/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
name: uniprot_reactions
description: |-
Protein sequences and the reactions these can catalyze.
targets:
- id: reactions
description: biochemical reactions catalyzed by a protein
type: text
names:
- noun: chemical reactions
- noun: biochemical reactions
identifiers:
- id: other
type: AS_SEQUENCE
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 253713
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
The {#protein|amino acid sequence|AA sequence|polypeptide!} {#with the sequence |!}{other#} catalyzes the {#following |!}{#chemical |biochemical |!}reaction: {reactions#}
- |-
Task: {#Predict|Identify!} a {#biochemical |chemical |!}reaction that can be catalyzed by {#this|the following!} {#protein|amino acid sequence|AA sequence|polypeptide!}.
{#Amino acid sequence |Sequence|AA sequence!}: {other#}
Result: {reactions#}
- |-
Task: {#Generate|Create|Come up with!} a {#protein|amino acid sequence|AA sequence|polypeptide!} that can catalyze {#a|this!} specific {#biochemical |chemical |!}reaction.
Reaction: {reactions#}
{#Output|Result!}: {other#}
- |-
User: Can you {#tell me|come up with!} a {#biochemical |chemical |!}reaction that can be catalyzed by the following {#protein|amino acid sequence|AA sequence|polypeptide!}:\n{other#}
Assistant: {#Yes, the|Sure, the|Yes, sure, the|The!} {#chemical |biochemical |!}reaction that can be catalyzed by the given {#protein|amino acid sequence|AA sequence|polypeptide!} are:\n{reactions#}
- |-
Task: {#Predict|Identify!} a {#biochemical |chemical |!}reaction that can be catalyzed by {#this|the following!} {#protein|amino acid sequence|AA sequence|polypeptide!}.
{#Amino acid sequence |Sequence|AA sequence!}: {other#}
Result:<EOI> {reactions#}
- |-
Task: {#Generate|Create|Come up with|Design!} a {#protein|amino acid sequence|AA sequence|polypeptide!} that can catalyze {#a|this!} specific {#biochemical |chemical |!}reaction.
Reaction: {reactions#}
{#Output|Result!}:<EOI> {other#}
25 changes: 25 additions & 0 deletions data/tabular/uniprot_reactions/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import pandas as pd
from huggingface_hub import hf_hub_download

DATA = "uniprot_reactions"


def load_dataset() -> pd.DataFrame:
uniprot = hf_hub_download(
repo_id="chemnlp/uniprot",
filename=f"{DATA}/data_clean.csv",
repo_type="dataset",
)
uniprot = pd.read_csv(uniprot)
uniprot.rename(columns={"sequence": "other"}, inplace=True)
uniprot.drop_duplicates(
inplace=True,
)
print(f"Successfully loaded {DATA}! {len(uniprot)} rows")
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {DATA}!")
return uniprot


if __name__ == "__main__":
load_dataset()
48 changes: 48 additions & 0 deletions data/tabular/uniprot_sentences/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
name: uniprot_sentences
description: |-
Descriptions of the function of a protein.
targets:
- id: sentences
description: sentences describing the function of a protein
type: text
names:
- noun: function
identifiers:
- id: sequence
type: AS_SEQUENCE
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 396241
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
User: {#Please describe|Describe|Please briefly describe|Briefly describe!} the {#biological |biochemical |!}function of {#the|this!} {#protein|amino acid sequence|AA sequence|polypeptide!}: {sequence#}
Assistant: {sentences#}.
kjappelbaum marked this conversation as resolved.
Show resolved Hide resolved
- |-
User: What {#protein|amino acid sequence|AA sequence|polypeptide!} fits the {#biological |biochemical |!}description {#in the next sentence(s) |below |!}best?\n{sentences#}
Assistant: A {#protein|amino acid sequence|AA sequence|polypeptide!} that fits the {#description|sentences!} is:\n{sequence#}
- |-
Task: {#Generate|Create|Come up with!} a {#protein|amino acid sequence|AA sequence|polypeptide!} based on the description.
Description: {sentences#}
{#Output|Result!}: {sequence#}
- |-
Task: {#Generate|Create|Come up with!} a {#protein|amino acid sequence|AA sequence|polypeptide!} based on the description.
Description: {sentences#}
{#Output|Result!}:<EOI> {sequence#}
36 changes: 36 additions & 0 deletions data/tabular/uniprot_sentences/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import pandas as pd
import regex as re
from huggingface_hub import hf_hub_download

DATA = "uniprot_sentences"


def clean_up_sentences(text: str) -> str:
"Remove (By similarity) from the sentences"

updated_text = re.sub(r"\s*\((?:By\.? similarity)\)\s*", "", text)
updated_text = updated_text.replace(" . ", ". ")
updated_text = updated_text.replace(" .", ".")
return updated_text


def load_dataset() -> pd.DataFrame:
uniprot = hf_hub_download(
repo_id="chemnlp/uniprot",
filename=f"{DATA}/data_clean.csv",
repo_type="dataset",
)

uniprot = pd.read_csv(uniprot)
uniprot.sentences = uniprot.sentences.apply(clean_up_sentences)
uniprot.drop_duplicates(
inplace=True,
)
print(f"Successfully loaded {DATA}! {len(uniprot)} rows")
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {DATA}!")
return uniprot


if __name__ == "__main__":
load_dataset()
4 changes: 4 additions & 0 deletions data/text_sampling/text_sampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,10 @@
"mol_repr_transl_canonical_iupac_name",
"mol_repr_transl_inchi_iupac_name",
# "h2_storage_materials", # only IUPAC identifier, more than one target, LOW PRIO: has only 30 samples
"uniprot_binding_sites",
"uniprot_organisms",
"uniprot_reactions",
"uniprot_sentences",
]


Expand Down