Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Molecule Query Notebook issue and stereochemistry #4

Open
loopy321 opened this issue Nov 11, 2024 · 1 comment
Open

Molecule Query Notebook issue and stereochemistry #4

loopy321 opened this issue Nov 11, 2024 · 1 comment

Comments

@loopy321
Copy link

loopy321 commented Nov 11, 2024

Wonderful tool! Thanks for the hard work here. In working with the notebooks, I noticed the following issues that I hope I can help address.

The problem arises when using the molecule_query.ipynb to attempt to locate certain molecules as a demonstration.

For example, when attempting to locate Molnupiravir which is in US11331331 (claims 1-9) and is also part of PatCID using the molecule_query.ipynb with SMILES:
CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
The notebook fails in the last cell at:

# Get page index of query molecule 
for figure in patent_entry["figures"]:
    if not("smiles" in figure):
        continue 
    if (figure["smiles"]["value"] == query_smiles):
        page_index = figure["page"]
        
# Convert pdf to page image
pages = convert_from_path(os.getcwd() + f"/../data/pdfs/{query_patent}.pdf", dpi=200, first_page=page_index, last_page=page_index)
pages = [None]*(page_index-1) + pages + [None]*(len(patent_entry["page-dimensions"])-page_index)

with:
NameError: name 'page_index' is not defined
Because there is no matching entry among the two .jsonl data files.

If I manually grep it fortuitously matches:
image

There appear to be important differences in the SMILES entries in: data/patcid/patcid_patent_to_molecules_{office}.jsonl
versus the entries in: data/patcid/patcid_molecule_to_patents.jsonl, at the very least includes the use of case in designating bond order in aromatic rings.

In addtion there could be lost stereochemistry resulting from use of grep and a SMILES string containing [@]'s without accounting for these characters.

Note, if I change the "query_smiles" in the last cell of the notebook to molnupiravir with the stereochemistry removed: query_smiles_mod = "CC(C)C(=O)OCC1OC(N2C=CC(NO)=NC2=O)C(O)C1O" it executes and finds the molecule in the document:
image

Which as you can see has the stereochemistry that is absent from the modified SMILES query. I believe MolGrapher can extract this.

My thought is to instead use a data query structure incorporating InChi or InChiKey. As you can see below, InChiKey would find the molecule by truncating the stereochemistry terms (2nd block):

patcid_molecule_to_patents.jsonl:3247652: 
"HTNPEHXGEKVIHG-UHFFFAOYSA-N"
vs.
"HTNPEHXGEKVIHG-QCNRFFRDSA-N"

@lucas-morin
Copy link
Collaborator

Hi @loopy321,

Thank you for your interest in PatCID!

As you pointed out, there is no stereo-chemistry stored in the current version of PatCID.
I modified the notebook molecule_query.ipynb to clarify this.

Also, you can now use the query_smiles = CC(C)C(=O)OC[C@@H]1[C@H]([C@H]([C@@H](O1)N2C=CC(=NC2=O)NO)O)O
and its stereo-chemistry will be automatically removed before searching in the dataset.

You are correct that MolGrapher could be configured to recognize stereo-chemistry, but it was not done when creating PatCID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants