You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wonderful tool! Thanks for the hard work here. In working with the notebooks, I noticed the following issues that I hope I can help address.
The problem arises when using the molecule_query.ipynb to attempt to locate certain molecules as a demonstration.
For example, when attempting to locate Molnupiravir which is in US11331331 (claims 1-9) and is also part of PatCID using the molecule_query.ipynb with SMILES: CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
The notebook fails in the last cell at:
# Get page index of query molecule
for figure in patent_entry["figures"]:
if not("smiles" in figure):
continue
if (figure["smiles"]["value"] == query_smiles):
page_index = figure["page"]
# Convert pdf to page image
pages = convert_from_path(os.getcwd() + f"/../data/pdfs/{query_patent}.pdf", dpi=200, first_page=page_index, last_page=page_index)
pages = [None]*(page_index-1) + pages + [None]*(len(patent_entry["page-dimensions"])-page_index)
with: NameError: name 'page_index' is not defined
Because there is no matching entry among the two .jsonl data files.
If I manually grep it fortuitously matches:
There appear to be important differences in the SMILES entries in: data/patcid/patcid_patent_to_molecules_{office}.jsonl
versus the entries in: data/patcid/patcid_molecule_to_patents.jsonl, at the very least includes the use of case in designating bond order in aromatic rings.
In addtion there could be lost stereochemistry resulting from use of grep and a SMILES string containing [@]'s without accounting for these characters.
Note, if I change the "query_smiles" in the last cell of the notebook to molnupiravir with the stereochemistry removed: query_smiles_mod = "CC(C)C(=O)OCC1OC(N2C=CC(NO)=NC2=O)C(O)C1O" it executes and finds the molecule in the document:
Which as you can see has the stereochemistry that is absent from the modified SMILES query. I believe MolGrapher can extract this.
My thought is to instead use a data query structure incorporating InChi or InChiKey. As you can see below, InChiKey would find the molecule by truncating the stereochemistry terms (2nd block):
patcid_molecule_to_patents.jsonl:3247652:
"HTNPEHXGEKVIHG-UHFFFAOYSA-N"
vs.
"HTNPEHXGEKVIHG-QCNRFFRDSA-N"
The text was updated successfully, but these errors were encountered:
As you pointed out, there is no stereo-chemistry stored in the current version of PatCID.
I modified the notebook molecule_query.ipynb to clarify this.
Also, you can now use the query_smiles = CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
and its stereo-chemistry will be automatically removed before searching in the dataset.
You are correct that MolGrapher could be configured to recognize stereo-chemistry, but it was not done when creating PatCID.
Wonderful tool! Thanks for the hard work here. In working with the notebooks, I noticed the following issues that I hope I can help address.
The problem arises when using the
molecule_query.ipynb
to attempt to locate certain molecules as a demonstration.For example, when attempting to locate Molnupiravir which is in US11331331 (claims 1-9) and is also part of PatCID using the molecule_query.ipynb with SMILES:
CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
The notebook fails in the last cell at:
with:
NameError: name 'page_index' is not defined
Because there is no matching entry among the two .jsonl data files.
If I manually grep it fortuitously matches:
There appear to be important differences in the SMILES entries in:
data/patcid/patcid_patent_to_molecules_{office}.jsonl
versus the entries in:
data/patcid/patcid_molecule_to_patents.jsonl
, at the very least includes the use of case in designating bond order in aromatic rings.In addtion there could be lost stereochemistry resulting from use of
grep
and a SMILES string containing [@]'s without accounting for these characters.Note, if I change the "query_smiles" in the last cell of the notebook to molnupiravir with the stereochemistry removed:
query_smiles_mod = "CC(C)C(=O)OCC1OC(N2C=CC(NO)=NC2=O)C(O)C1O"
it executes and finds the molecule in the document:Which as you can see has the stereochemistry that is absent from the modified SMILES query. I believe MolGrapher can extract this.
My thought is to instead use a data query structure incorporating InChi or InChiKey. As you can see below, InChiKey would find the molecule by truncating the stereochemistry terms (2nd block):
The text was updated successfully, but these errors were encountered: