Skip to content

Molecule.to/from_qcschema does not round trip #720

Open
@trevorgokey

Description

@trevorgokey

Describe the bug
The toolkit molecule to_qcschema exports to a schema that represents a QCArchive molecule, however the corresponding from_qcschema actually wants an Entry object, which has the CMILES identifiers. This prevents a round trip to/from qcschema.

In [1]: from openforcefield.topology.molecule import Molecule
In [2]: mol = Molecule.from_smiles("CC")
In [3]: mol.generate_conformers()
In [4]: qcmol = mol.to_qcschema()
In [5]: Molecule.from_qcschema(qcmol)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/projects/openforcefield/openforcefield/topology/molecule.py in from_qcschema(cls, qca_record, client, toolkit_registry, allow_undefined_stereo)
   4458         try:
-> 4459             mapped_smiles = qca_record["attributes"][
   4460                 "canonical_isomeric_explicit_hydrogen_mapped_smiles"

KeyError: 'attributes'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-d24f74fea5e4> in <module>
----> 1 Molecule.from_qcschema(qcmol)

~/projects/openforcefield/openforcefield/topology/molecule.py in from_qcschema(cls, qca_record, client, toolkit_registry, allow_undefined_stereo)
   4462         except KeyError:
   4463             raise KeyError(
-> 4464                 "The record must contain the hydrogen mapped smiles to be safely made from the archive."
   4465             )
   4466 

KeyError: 'The record must contain the hydrogen mapped smiles to be safely made from the archive.'

I think we will need to make some decisions here since this really defines how we interact with QCArchive.

I see two options:

  1. Trust the QCArchive molecule's geometry, symbols, and connectivity, and build the molecule from that. This will not work if there is not a linear mapping of symbols to geometry.

  2. Put the CMILES information into the extras field in the QCArchive molecule. This is what has recently been done to e.g. running MM jobs in QCEngine.

The first is probably more "appropriate" but this is the not the lowest hanging fruit. The lowest hanging fruit is 2, but then it would be the last nail in the coffin of making an extra attribute absolutely essential in our "data standards".

Another option that is slightly half-way is using the Indentifiers object (https://github.com/MolSSI/QCElemental/blob/master/qcelemental/models/molecule.py#L62) in the molecules, which is designed to specifically hold such things. Right now, the default in the new data submissions is just to hold the hash and the Hill formula (which makes sense for a quantum molecule), so we can make an effort to get our CMILES included there as standard.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions