-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Molecule.to/from_qcschema does not round trip #720
Comments
This would be a good one to talk about in-person, but my thoughts briefly are:
Could you clarify whether you think the "nail in the coffin" is a good or bad thing? I (naively) think this is a good idea, but may not be remembering all the context around this.
We could bring this up to MolSSI, but I could see them having good cause to reject adding a widespread requirement for this -- Many of the atomic configurations that QM people study will defy connectivity tables and RDKit's molecule model. |
I am always up for discussing this area of the project :)
The common way we did this was (taken from the phenyl set)
That sounds good; I might issue a PR quite soon to get this settled. I think the best way forward in the short-term is always including it, since it is a required piece for
The Another aspect is that there is currently some initial discussion to make molecules mutable in the database (MolSSI/QCFractal#602), based on the argument that, since there is no validation done on anything in the
As for including CMILES into the standard data we need: MolSSI has already done their part in supplying the proper model to include CMILES data (unless we also require e.g. protomer fields). The effort I am referring to is that the |
This issue can be closed when the following is implemented in a PR:
Note: Implementation could be as simply as trying to load
|
I think this has been fixed? In [1]: from openff.toolkit.topology import Molecule
Warning: Unable to load toolkit 'OpenEye Toolkit'. The Open Force Field Toolkit does not require the OpenEye Toolkits, and can use RDKit/AmberTools instead. However, if you have a valid license for the OpenEye Toolkits, consider installing them for faster performance and additional file format support: https://docs.eyesopen.com/toolkits/python/quickstart-python/linuxosx.html OpenEye offers free Toolkit licenses for academics: https://www.eyesopen.com/academic-licensing
In [2]: mol = Molecule.from_smiles("CC")
In [3]: mol.generate_conformers()
In [4]: qcmol = mol.to_qcschema()
In [5]: Molecule.from_qcschema(qcmol)
Out[5]: Molecule with name '' and SMILES '[H][C]([H])([H])[C]([H])([H])[H]'
In [6]: Molecule.from_qcschema(qcmol) == mol
Out[6]: True
In [7]: qcmol
Out[7]: Molecule(name='C2H6', formula='C2H6', hash='ea1e725')
In [8]: qcmol.extras
Out[8]: {'canonical_isomeric_explicit_hydrogen_mapped_smiles': '[C:1]([C:2]([H:6])([H:7])[H:8])([H:3])([H:4])[H:5]'} |
IMO this is the only reasonable way to handle this if the QC stack is going to be supported as an input method, much like the toolkit trusts the geometry, symbols, and connectivity of an RDKit molecule.
I'm not sure the concern here; afaik QCElemental validates this on construction, and the elements must map to the geometry for any valid use of the molecule with qcengine, etc. >>> mol = qcel.models.Molecule(**{"symbols": ["He", "He"], "geometry": [0, 0, -3]})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/models/molecule.py", line 343, in __init__
from_schema(kwargs, nonphysical=nonphysical), dtype=kwargs["schema_version"], copy=False, np_out=True
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 48, in from_schema
dcontig = contiguize_from_fragment_pattern(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 159, in contiguize_from_fragment_pattern
raise ValidationError("""dropped atoms! nat = {} != {}""".format(nat, ncgeom.shape[0]))
qcelemental.exceptions.ValidationError: dropped atoms! nat = 2 != 1 >>> mol = qcel.models.Molecule(**{"symbols": ["Hello", "World"], "geometry": [0, 0, -3, 0, 0, 3]})
Traceback (most recent call last):
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 77, in resolve_eliso
self._eliso2mass[atom.capitalize()] # type: ignore
KeyError: 'Hello'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 80, in resolve_eliso
E = self._z2el[int(atom)]
ValueError: invalid literal for int() with base 10: 'Hello'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 83, in resolve_eliso
E = self._element2el[atom.capitalize()] # type: ignore
KeyError: 'Hello'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/models/molecule.py", line 343, in __init__
from_schema(kwargs, nonphysical=nonphysical), dtype=kwargs["schema_version"], copy=False, np_out=True
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_schema.py", line 60, in from_schema
molrec = from_arrays(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 356, in from_arrays
processed = validate_and_fill_nuclei(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 680, in validate_and_fill_nuclei
*[
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/from_arrays.py", line 681, in <listcomp>
reconcile_nucleus(
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/nucleus.py", line 294, in reconcile_nucleus
offer_element_symbol(E)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/molparse/nucleus.py", line 168, in offer_element_symbol
_Z = periodictable.to_Z(e, strict=True)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 186, in to_Z
identifier = self._resolve_atom_to_key(atom, strict=strict)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 94, in _resolve_atom_to_key
eliso = resolve_eliso(atom)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/qcelemental/periodic_table.py", line 85, in resolve_eliso
raise NotAnElementError(atom)
qcelemental.exceptions.NotAnElementError: Hello Right now from the name of the function a new user would be very confused that they can't do this: >>> mol = qcel.models.Molecule(**{"symbols": ["He", "He"], "geometry": [0, 0, -3, 0, 0, 3]})
>>> Molecule.from_qcschema(mol)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/openff/toolkit/utils/utils.py", line 86, in wrapper
return function(*args, **kwargs)
File "/data/homezvol0/lilyw7/miniconda3/envs/psiresp-3.8/lib/python3.8/site-packages/openff/toolkit/topology/molecule.py", line 5443, in from_qcschema
raise KeyError(
KeyError: "The record must contain the hydrogen mapped smiles to be safely made from the archive. It is not present in either 'attributes' or 'extras' on the provided `qca_record`" And yet from RDKit: >>> rdmol = Chem.RWMol()
>>> rdmol.AddAtom(Chem.Atom("He"))
>>> rdmol.AddAtom(Chem.Atom("He"))
>>> Chem.SanitizeMol(rdmol)
>>> Molecule.from_rdkit(rdmol)
Molecule with name '' and SMILES '[He].[He]' |
Describe the bug
The toolkit molecule
to_qcschema
exports to a schema that represents a QCArchive molecule, however the correspondingfrom_qcschema
actually wants an Entry object, which has the CMILES identifiers. This prevents a round trip to/from qcschema.I think we will need to make some decisions here since this really defines how we interact with QCArchive.
I see two options:
Trust the QCArchive molecule's geometry, symbols, and connectivity, and build the molecule from that. This will not work if there is not a linear mapping of symbols to geometry.
Put the CMILES information into the
extras
field in the QCArchive molecule. This is what has recently been done to e.g. running MM jobs in QCEngine.The first is probably more "appropriate" but this is the not the lowest hanging fruit. The lowest hanging fruit is 2, but then it would be the last nail in the coffin of making an
extra
attribute absolutely essential in our "data standards".Another option that is slightly half-way is using the
Indentifiers
object (https://github.com/MolSSI/QCElemental/blob/master/qcelemental/models/molecule.py#L62) in the molecules, which is designed to specifically hold such things. Right now, the default in the new data submissions is just to hold the hash and the Hill formula (which makes sense for a quantum molecule), so we can make an effort to get our CMILES included there as standard.The text was updated successfully, but these errors were encountered: