-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniprot #470
Merged
Merged
Uniprot #470
Changes from 5 commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
d079c4f
example processing dataset
AdrianM0 cf676cc
processing notebook
AdrianM0 700cd1e
feat: minor dataframe construction adjustments
AdrianM0 58639c5
first version templates
AdrianM0 56fecb5
remove lint
eae4fd0
feat: working templates + working data extraction
AdrianM0 3b824d6
fix: data link
AdrianM0 ed6b532
lint
AdrianM0 75c0273
fix lines
AdrianM0 388e822
pre-commit
AdrianM0 e511385
separate files and read from hugging face
AdrianM0 d79774a
remove full dataset files
AdrianM0 1e915c1
Update data/tabular/uniprot_organisms/meta.yaml
AdrianM0 526f866
add sentences clean-up regex + new text template
AdrianM0 4461e4f
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 e3c1c25
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 14af762
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 ea25d27
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 3584439
fix space
AdrianM0 b086b61
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 b333c2c
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 d097e11
Update data/tabular/uniprot_reactions/meta.yaml
AdrianM0 0e79a7b
Update data/tabular/uniprot_sentences/meta.yaml
AdrianM0 c2a191d
hugging face downloader
AdrianM0 6ed8ca6
Merge branch 'uniprot' of https://github.com/AdrianM0/chemnlp into un…
AdrianM0 6e9f2bd
Update data/tabular/uniprot_organisms/meta.yaml
kjappelbaum d515a0a
Update data/tabular/uniprot_binding_sites/meta.yaml
kjappelbaum c87127e
fix: exclude_from_standard_tabular_text_templates
MicPie aead83f
fix: templates uniprot_binding_sites
MicPie b4361c9
fix: templates uniprot_organisms
MicPie 30eb7fb
fix: templates uniprot_binding_sites 2
MicPie e03704a
fix: templates uniprot_reactions
MicPie fe04f4e
fix: templates uniprot_sentences
MicPie c8d56f7
feat: inverse design template uniprot_reactions
MicPie 3d593c8
Merge branch 'main' into uniprot
MicPie b33878f
Apply suggestions from code review
MicPie 073f6ab
feat: add uniprot benchmarking tasks
MicPie c37f116
additional clean-up
AdrianM0 04b36d2
clean By . similarity strings
AdrianM0 a1fd93a
fix column id for identifier
AdrianM0 52dd9af
fix column id for identifier
AdrianM0 8f3eff9
file cleanup
AdrianM0 40a66dc
replace templates and transform with new tasks/data
AdrianM0 0c3c5d8
type change to int in column
AdrianM0 f309c28
final fixing, working templates
AdrianM0 d5c4ca6
updates templates and types
c6efe2e
remove lint
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
--- | ||
name: uniprot | ||
description: |- | ||
Protein sequences, the reaction these can catalyze and the descriptions of the chemical reaction. | ||
targets: | ||
- id: sentences | ||
description: sentences describing the catalytic activity of a protein | ||
names: | ||
- noun: catalytic activity | ||
- id: reactions | ||
description: biochemical reactions catalyzed by a protein | ||
names: | ||
- noun: chemical reactions | ||
- noun: biochemical reactions | ||
identifiers: | ||
- id: sequence | ||
type: sequence | ||
description: sequence | ||
license: MIT | ||
links: | ||
- url: https://www.uniprot.org/ | ||
description: data source | ||
num_points: 542630 | ||
bibtex: | ||
- |- | ||
@article{10.1093/nar/gkac1052, | ||
author = {The UniProt Consortium}, | ||
title = {UniProt - the Universal Protein Knowledgebase in 2023}, | ||
journal = {Nucleic Acids Research}, | ||
volume = {51}, | ||
number = {D1}, | ||
pages = {D523-D531}, | ||
year = {2022}, | ||
month = {11}, | ||
issn = {0305-1048}, | ||
doi = {10.1093/nar/gkac1052}, | ||
url = {https://doi.org/10.1093/nar/gkac1052}} | ||
templates: | ||
- |- | ||
User: Describe the {sentences__names__noun} of the {#protein|amino-acid sequence|AA sequence|polypeptide!} {sequence#}? | ||
Assistant: {sentences#}. | ||
- |- | ||
User: What biochemical reactions can be catalyzed by the following {#protein|amino-acid sequence|AA sequence|polypeptide!}? | ||
Assistant: The biochemical reactions that can be catalyzed by the given sequence are: {reactions#} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,296 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 23, | ||
"id": "702d78bd-ba4b-4b47-8644-c62df4b21c0a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import json\n", | ||
"from typing import Dict\n", | ||
"from urllib.request import urlopen \n", | ||
"import regex as re\n", | ||
"import pandas as pd" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "9a2657ec-b30c-4a2d-ba64-3a326c5447cc", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Read json from url example\n", | ||
"\n", | ||
"def read_json(url :str):\n", | ||
"\n", | ||
" read_json_from_url = urlopen(url)\n", | ||
" data_json = json.loads(read_json_from_url.read()) \n", | ||
" return data_json\n", | ||
"\n", | ||
"\n", | ||
"def read_json_to_list_of_proteins(path : str):\n", | ||
" \"\"\"Reads json file and returns a dictionary\n", | ||
" Example usage:\n", | ||
" proteins = read_json_to_list_of_proteins(\"uniprotkb_AND_reviewed_true_2023_10_30.json\")\n", | ||
" \"\"\"\n", | ||
" uniprot_json = json.load(open(path)) \n", | ||
" proteins = uniprot_json[\"results\"]\n", | ||
" return proteins\n", | ||
"\n", | ||
"\n", | ||
"def remove_pubchem_med_from_text(string : str):\n", | ||
" pattern = r'\\(PubMed:\\d+(?:,\\s*PubMed:\\d+)*\\)'\n", | ||
" string = re.sub(pattern, '', string)\n", | ||
" return string" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "07e8b508-26af-4dff-b026-3c520bb23d8c", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENARIQSKLSDLQKKKIDIDNKLLKEKQNLIKEEILERKKLEVLTKKQQKDEIEHQKKLKREIDAIKASTQYITDVSISSYNNTIPETEPEYDLFISHASEDKEDFVRPLAETLQQLGVNVWYDEFTLKVGDSLRQKIDSGLRNSKYGTVVLSTDFIKKDWTNYELDGLVAREMNGHKMILPIWHKITKNDVLDYSPNLADKVALNTSVNSIEEIAHQLADVILNR'" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"example = read_json(\"https://rest.uniprot.org/uniprotkb/A0A009IHW8.json\")\n", | ||
"example[\"sequence\"][\"value\"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ca8435e6-38e1-48ec-96ce-900e4d7c02eb", | ||
"metadata": {}, | ||
"source": [ | ||
"### Example" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 65, | ||
"id": "f859045f-031a-4fb3-8fcc-9b3b2340c729", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"CONSIDERED_COMMENTS = [\n", | ||
" \"CATALYTIC ACTIVITY\",\n", | ||
" \"FUNCTION\"\n", | ||
" \"DOMAIN\"\n", | ||
"]\n", | ||
"\n", | ||
"\n", | ||
"def extract_catalyzed_reactions(json_dict : Dict):\n", | ||
" # Define lists\n", | ||
" sentences : List = []\n", | ||
" reactions : List = []\n", | ||
" \n", | ||
" comments = json_dict[\"comments\"]\n", | ||
" for i in range(len(comments)):\n", | ||
" if comments[i][\"commentType\"] == \"FUNCTION\":\n", | ||
" sentence_description = json_dict[\"comments\"][i][\"texts\"][i][\"value\"]\n", | ||
" sentence_description = remove_pubchem_med_from_text(sentence_description)\n", | ||
" sentences.append(sentence_description)\n", | ||
" if comments[i][\"commentType\"] == \"CATALYTIC ACTIVITY\": \n", | ||
" catalyzed_reaction = json_dict[\"comments\"][i][\"reaction\"][\"name\"]\n", | ||
" ### Replace = with -> for latex\n", | ||
" catalyzed_reaction = catalyzed_reaction.replace(\"=\", \"->\")\n", | ||
" reactions.append(catalyzed_reaction)\n", | ||
" return list(set(sentences)), list(set(reactions))\n", | ||
"\n", | ||
"\n", | ||
"def extract_binding_sites(json_dict : Dict):\n", | ||
" binding_sites = [i[\"location\"][\"start\"][\"value\"] for i in example[\"features\"] if i[\"type\"] == \"Binding site\"] \n", | ||
" return binding_sites\n", | ||
"\n", | ||
"\n", | ||
"def extract_all(json_dict):\n", | ||
" # try:\n", | ||
" sequence = json_dict[\"sequence\"][\"value\"]\n", | ||
" sentences, reactions = extract_catalyzed_reactions(json_dict)\n", | ||
" binding_sites = extract_binding_sites(json_dict)\n", | ||
" return {\"sequence\" : sequence,\n", | ||
" \"reactions\" : '|'.join([str(i) for i in reactions]), \n", | ||
" \"sentences\" : sentences,\n", | ||
" \"binding sites\" : ','.join([str(i) for i in binding_sites])\n", | ||
" }\n", | ||
" # except KeyError:\n", | ||
" # print(\"err\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"id": "6d7ed83d-bd6f-4b7c-9f89-71a3ed9a54c6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"{'sequence': {'MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENARIQSKLSDLQKKKIDIDNKLLKEKQNLIKEEILERKKLEVLTKKQQKDEIEHQKKLKREIDAIKASTQYITDVSISSYNNTIPETEPEYDLFISHASEDKEDFVRPLAETLQQLGVNVWYDEFTLKVGDSLRQKIDSGLRNSKYGTVVLSTDFIKKDWTNYELDGLVAREMNGHKMILPIWHKITKNDVLDYSPNLADKVALNTSVNSIEEIAHQLADVILNR': {'reactions': \"NAD(+) -> 2'cADPR + H(+) + nicotinamide|H2O + NADP(+) -> ADP-D-ribose 2'-phosphate + H(+) + nicotinamide|H2O + NAD(+) -> ADP-D-ribose + H(+) + nicotinamide\",\n", | ||
" 'binding sites': '143,172,202,245'}}}" | ||
] | ||
}, | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"extract_all(example)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"id": "845867c8-6457-4c3f-b359-e247f1d3c757", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"proteins = read_json_to_list_of_proteins(\"uniprotkb_AND_reviewed_true_2023_10_30.json\")\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 66, | ||
"id": "3bb06419-8f01-4205-aa48-afebdfd68824", | ||
"metadata": { | ||
"scrolled": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"data = []\n", | ||
"\n", | ||
"for prot_id, prot_seq_json in enumerate(proteins):\n", | ||
" try:\n", | ||
" data.append({prot_id : extract_all(prot_seq_json)})\n", | ||
" except Exception:\n", | ||
" pass" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "1586f0d7-70ea-4bfe-b0e4-6800cbe3e435", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df = pd.DataFrame(data)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 28, | ||
"id": "d295a250-d5e9-41fe-a3c4-84c47df9e5ff", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from collections import ChainMap" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 67, | ||
"id": "f6a7b9f3-bb41-41ba-b283-547907d23fa6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"{}" | ||
] | ||
}, | ||
"execution_count": 67, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# data = dict(ChainMap(*data))\n", | ||
"a = {}\n", | ||
"a\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 68, | ||
"id": "bcbc63f4-e586-4c1a-a5bf-4964161276d1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for dictionary in data:\n", | ||
" key = int(list(dictionary.keys())[0])\n", | ||
" value = list(dictionary.values())[0]\n", | ||
"\n", | ||
" a[key] = value" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 69, | ||
"id": "b432cca1-5582-4f5e-9c70-6af37f425559", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"reactions = [a[i][\"reactions\"] for i in list(a.keys())]\n", | ||
"sequences = [a[i][\"sequence\"] for i in list(a.keys())]\n", | ||
"sentences = [a[i][\"sentences\"] for i in list(a.keys())]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 72, | ||
"id": "2ccd268b-0902-45fe-89b1-6a83d2b3bcab", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df = pd.DataFrame()\n", | ||
"\n", | ||
"df[\"sequence\"] = sequences\n", | ||
"df[\"sentences\"] = sentences\n", | ||
"df[\"reactions\"] = reactions" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 75, | ||
"id": "3666eb32-e26c-4a96-a37d-73e064d725ed", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df.to_csv(\"reactions_sentences.csv\")\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.18" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
import pandas as pd | ||
|
||
|
||
# https://huggingface.co/datasets/chemNLP/uniprot/resolve/main/reactions_sentences.csv | ||
def load_dataset() -> pd.DataFrame: | ||
uniprot = pd.read_csv("reactions_sentences.csv") | ||
uniprot.to_csv("data_clean.csv", index=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this would still require update to something we can download |
||
return uniprot | ||
|
||
|
||
if __name__ == "__main__": | ||
print(len(load_dataset())) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also get smth about the binding sites and the function description?