Molecule Query Notebook issue and stereochemistry #4

loopy321 · 2024-11-11T22:00:19Z

Wonderful tool! Thanks for the hard work here. In working with the notebooks, I noticed the following issues that I hope I can help address.

The problem arises when using the molecule_query.ipynb to attempt to locate certain molecules as a demonstration.

For example, when attempting to locate Molnupiravir which is in US11331331 (claims 1-9) and is also part of PatCID using the molecule_query.ipynb with SMILES:
CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
The notebook fails in the last cell at:

# Get page index of query molecule 
for figure in patent_entry["figures"]:
    if not("smiles" in figure):
        continue 
    if (figure["smiles"]["value"] == query_smiles):
        page_index = figure["page"]
        
# Convert pdf to page image
pages = convert_from_path(os.getcwd() + f"/../data/pdfs/{query_patent}.pdf", dpi=200, first_page=page_index, last_page=page_index)
pages = [None]*(page_index-1) + pages + [None]*(len(patent_entry["page-dimensions"])-page_index)

with:
NameError: name 'page_index' is not defined
Because there is no matching entry among the two .jsonl data files.

If I manually grep it fortuitously matches:

There appear to be important differences in the SMILES entries in: data/patcid/patcid_patent_to_molecules_{office}.jsonl
versus the entries in: data/patcid/patcid_molecule_to_patents.jsonl, at the very least includes the use of case in designating bond order in aromatic rings.

In addtion there could be lost stereochemistry resulting from use of grep and a SMILES string containing [@]'s without accounting for these characters.

Note, if I change the "query_smiles" in the last cell of the notebook to molnupiravir with the stereochemistry removed: query_smiles_mod = "CC(C)C(=O)OCC1OC(N2C=CC(NO)=NC2=O)C(O)C1O" it executes and finds the molecule in the document:

Which as you can see has the stereochemistry that is absent from the modified SMILES query. I believe MolGrapher can extract this.

My thought is to instead use a data query structure incorporating InChi or InChiKey. As you can see below, InChiKey would find the molecule by truncating the stereochemistry terms (2nd block):

patcid_molecule_to_patents.jsonl:3247652: 
"HTNPEHXGEKVIHG-UHFFFAOYSA-N"
vs.
"HTNPEHXGEKVIHG-QCNRFFRDSA-N"

The text was updated successfully, but these errors were encountered:

lucas-morin · 2024-11-26T16:28:53Z

Hi @loopy321,

Thank you for your interest in PatCID!

As you pointed out, there is no stereo-chemistry stored in the current version of PatCID.
I modified the notebook molecule_query.ipynb to clarify this.

Also, you can now use the query_smiles = CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
and its stereo-chemistry will be automatically removed before searching in the dataset.

You are correct that MolGrapher could be configured to recognize stereo-chemistry, but it was not done when creating PatCID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Molecule Query Notebook issue and stereochemistry #4

Molecule Query Notebook issue and stereochemistry #4

loopy321 commented Nov 11, 2024 •

edited

Loading

lucas-morin commented Nov 26, 2024

Molecule Query Notebook issue and stereochemistry #4

Molecule Query Notebook issue and stereochemistry #4

Comments

loopy321 commented Nov 11, 2024 • edited Loading

lucas-morin commented Nov 26, 2024

loopy321 commented Nov 11, 2024 •

edited

Loading