Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniprot #470

Merged
merged 47 commits into from
Nov 18, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
d079c4f
example processing dataset
AdrianM0 Oct 30, 2023
cf676cc
processing notebook
AdrianM0 Nov 1, 2023
700cd1e
feat: minor dataframe construction adjustments
AdrianM0 Nov 1, 2023
58639c5
first version templates
AdrianM0 Nov 1, 2023
56fecb5
remove lint
Nov 1, 2023
eae4fd0
feat: working templates + working data extraction
AdrianM0 Nov 2, 2023
3b824d6
fix: data link
AdrianM0 Nov 2, 2023
ed6b532
lint
AdrianM0 Nov 2, 2023
75c0273
fix lines
AdrianM0 Nov 2, 2023
388e822
pre-commit
AdrianM0 Nov 2, 2023
e511385
separate files and read from hugging face
AdrianM0 Nov 3, 2023
d79774a
remove full dataset files
AdrianM0 Nov 3, 2023
1e915c1
Update data/tabular/uniprot_organisms/meta.yaml
AdrianM0 Nov 8, 2023
526f866
add sentences clean-up regex + new text template
AdrianM0 Nov 8, 2023
4461e4f
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
e3c1c25
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
14af762
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
ea25d27
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 9, 2023
3584439
fix space
AdrianM0 Nov 9, 2023
b086b61
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 14, 2023
b333c2c
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 14, 2023
d097e11
Update data/tabular/uniprot_reactions/meta.yaml
AdrianM0 Nov 14, 2023
0e79a7b
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 Nov 14, 2023
c2a191d
hugging face downloader
AdrianM0 Nov 15, 2023
6ed8ca6
Merge branch 'uniprot' of https://github.com/AdrianM0/chemnlp into un…
AdrianM0 Nov 15, 2023
6e9f2bd
Update data/tabular/uniprot_organisms/meta.yaml
kjappelbaum Nov 17, 2023
d515a0a
Update data/tabular/uniprot_binding_sites/meta.yaml
kjappelbaum Nov 17, 2023
c87127e
fix: exclude_from_standard_tabular_text_templates
MicPie Nov 17, 2023
aead83f
fix: templates uniprot_binding_sites
MicPie Nov 17, 2023
b4361c9
fix: templates uniprot_organisms
MicPie Nov 17, 2023
30eb7fb
fix: templates uniprot_binding_sites 2
MicPie Nov 17, 2023
e03704a
fix: templates uniprot_reactions
MicPie Nov 17, 2023
fe04f4e
fix: templates uniprot_sentences
MicPie Nov 17, 2023
c8d56f7
feat: inverse design template uniprot_reactions
MicPie Nov 17, 2023
3d593c8
Merge branch 'main' into uniprot
MicPie Nov 17, 2023
b33878f
Apply suggestions from code review
MicPie Nov 17, 2023
073f6ab
feat: add uniprot benchmarking tasks
MicPie Nov 17, 2023
c37f116
additional clean-up
AdrianM0 Nov 17, 2023
04b36d2
clean By . similarity strings
AdrianM0 Nov 17, 2023
a1fd93a
fix column id for identifier
AdrianM0 Nov 17, 2023
52dd9af
fix column id for identifier
AdrianM0 Nov 17, 2023
8f3eff9
file cleanup
AdrianM0 Nov 17, 2023
40a66dc
replace templates and transform with new tasks/data
AdrianM0 Nov 18, 2023
0c3c5d8
type change to int in column
AdrianM0 Nov 18, 2023
f309c28
final fixing, working templates
AdrianM0 Nov 18, 2023
d5c4ca6
updates templates and types
Nov 18, 2023
c6efe2e
remove lint
Nov 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions data/tabular/uniprot_binding_sites/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
name: uniprot_binding_sites
description: |-
Descriptions of the function of a protein.
targets:
- id: binding_sites
description: binding sites of a protein
type: text
names:
- noun: binding sites
identifiers:
- id: other
type: other
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 216329
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
User: What are the binding sites indices in this {#protein|amino-acid sequence|AA sequence|polypeptide!} {other#}?
kjappelbaum marked this conversation as resolved.
Show resolved Hide resolved
Assistant: The binding sites indices are: {binding_sites#}.
17 changes: 17 additions & 0 deletions data/tabular/uniprot_binding_sites/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pandas as pd

FILENAME = "uniprot_binding_sites"


def load_dataset() -> pd.DataFrame:
uniprot = pd.read_csv(
f"https://huggingface.co/datasets/chemNLP/uniprot/resolve/main/{FILENAME}/data_clean.csv" # noqa: E501
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work with private repos? How das pandas know about the authentication?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least I struggle, but maybe I do something wrong

)
uniprot.rename(columns={"sequence": "other"}, inplace=True)
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {FILENAME}!")
return uniprot


if __name__ == "__main__":
load_dataset()
37 changes: 37 additions & 0 deletions data/tabular/uniprot_organisms/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
name: uniprot_organisms
description: |-
Organisms in which a amino-acid sequence can be found.
targets:
- id: organisms
description: organisms in which a protein can be found
type: text
names:
- noun: organisms
identifiers:
- id: other
type: other
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 560033
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
User: In what organism can you find the following {#protein|amino-acid sequence|AA sequence|polypeptide!} {other#}?
MicPie marked this conversation as resolved.
Show resolved Hide resolved
Assistant: The given {#polypeptide|protein|amino-acid sequence|AA sequence!} can be found in {organisms#}.
kjappelbaum marked this conversation as resolved.
Show resolved Hide resolved
17 changes: 17 additions & 0 deletions data/tabular/uniprot_organisms/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pandas as pd

FILENAME = "uniprot_organisms"


def load_dataset() -> pd.DataFrame:
uniprot = pd.read_csv(
f"https://huggingface.co/datasets/chemNLP/uniprot/resolve/main/{FILENAME}/data_clean.csv" # noqa: E501
)
uniprot.rename(columns={"sequence": "other"}, inplace=True)
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {FILENAME}!")
return uniprot


if __name__ == "__main__":
load_dataset()
38 changes: 38 additions & 0 deletions data/tabular/uniprot_reactions/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
name: uniprot_reactions
description: |-
Protein sequences and the reactions these can catalyze.
targets:
- id: reactions
description: biochemical reactions catalyzed by a protein
type: text
names:
- noun: chemical reactions
- noun: biochemical reactions
identifiers:
- id: other
type: other
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 253713
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
User: What {#biochemical|chemical|bio-chemical!} reactions can be catalyzed by the following {#protein|amino-acid sequence|AA sequence|polypeptide!} : {other#}?
AdrianM0 marked this conversation as resolved.
Show resolved Hide resolved
Assistant: The reactions that can be catalyzed by the given sequence are: {reactions#}.
17 changes: 17 additions & 0 deletions data/tabular/uniprot_reactions/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pandas as pd

FILENAME = "uniprot_reactions"


def load_dataset() -> pd.DataFrame:
uniprot = pd.read_csv(
f"https://huggingface.co/datasets/chemNLP/uniprot/resolve/main/{FILENAME}/data_clean.csv" # noqa: E501
)
uniprot.rename(columns={"sequence": "other"}, inplace=True)
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {FILENAME}!")
return uniprot


if __name__ == "__main__":
load_dataset()
44 changes: 44 additions & 0 deletions data/tabular/uniprot_sentences/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
name: uniprot_sentences
description: |-
Descriptions of the function of a protein.
targets:
- id: sentences
description: sentences describing the function of a protein
type: text
names:
- noun: function
identifiers:
- id: other
type: other
description: other
license: MIT
links:
- url: https://www.uniprot.org/
description: data source
num_points: 464396
bibtex:
- |-
@article{10.1093/nar/gkac1052,
author = {The UniProt Consortium},
title = {UniProt - the Universal Protein Knowledgebase in 2023},
journal = {Nucleic Acids Research},
volume = {51},
number = {D1},
pages = {D523-D531},
year = {2022},
month = {11},
issn = {0305-1048},
doi = {10.1093/nar/gkac1052},
url = {https://doi.org/10.1093/nar/gkac1052}}
templates:
- |-
User: Describe the {#function|biological function!} of the {#protein|amino-acid sequence|AA sequence|polypeptide!} {other#}?
AdrianM0 marked this conversation as resolved.
Show resolved Hide resolved
Assistant: {sentences#}.
kjappelbaum marked this conversation as resolved.
Show resolved Hide resolved
- |-
User: What {#protein|amino-acid sequence|AA sequence|polypeptide!} best fits the {#function|biological function!} described in the next sentence(s). {sentences#}
AdrianM0 marked this conversation as resolved.
Show resolved Hide resolved
Assistant: The {#protein|amino-acid sequence|AA sequence|polypeptide!} that best fits the described function is {other#}.
AdrianM0 marked this conversation as resolved.
Show resolved Hide resolved
- |-
Task: {#Generate|Create|Come up with!} a {AA sequence|protein} based on the description.
AdrianM0 marked this conversation as resolved.
Show resolved Hide resolved
Descriptions: {sentences#}
AdrianM0 marked this conversation as resolved.
Show resolved Hide resolved
{#Output|Result!}: {other#}
26 changes: 26 additions & 0 deletions data/tabular/uniprot_sentences/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import pandas as pd
import regex as re

FILENAME = "uniprot_sentences"


def remove_text_from_column(sentence: str) -> str:
# Replace "(By similarity)" with empty string and remove extra spaces
updated_text = re.sub(r"\s*\(By similarity\)", "", sentence)
return updated_text


def load_dataset() -> pd.DataFrame:
uniprot = pd.read_csv(
f"https://huggingface.co/datasets/chemNLP/uniprot/resolve/main/{FILENAME}/data_clean.csv" # noqa: E501
)

uniprot.rename(columns={"sequence": "other"}, inplace=True)
uniprot["sentences"] = uniprot["sentences"].apply(remove_text_from_column)
uniprot.to_csv("data_clean.csv", index=False)
print(f"Successfully loaded {FILENAME}!")
return uniprot


if __name__ == "__main__":
load_dataset()