-
Notifications
You must be signed in to change notification settings - Fork 245
Use rdkit for SSSR and RCs (bug fix + Python upgrade)
#2796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
99bf00c to
8d6fe64
Compare
|
Cantera 2.6 isn't available for Python 3.12, so this PR will also need to upgrade the Cantera version to 3 as mostly completed in #2751 |
|
A note for the path forward on this PR - the |
|
I ended up moving all the functions in |
|
Addressed some of the failures, but still a few nontrivial challenges: TODO list:
The tests are just failing mainly for the aromatic compounds, indicating there's some issue with the implementation of the code. Since everything runs, the code "flow" seems clean, just need to iron out the implementation. |
|
We are now very close to passing all of the unit tests. There are just a couple major bugs to resolve, namely:
|
|
One of the last failing tests is @rwest, since you're the last remaining dev who was involved in some capacity with this part of the code, do you know why we have that assert statement? I would think it's better to just plot a "dot" than to outright crash out. While we're at it, the |
I don't. I can see that if there is no straight chain backbone that a function to
I haven't investigated the full stack trace. But some safeguard, or exception handling, sounds appropriate.
We can try it. I expect RDKit at the time wasn't being helpful. Hopefully the commit messages offer clues? (this is why we should write helpful commit messages that explain why things are being done). |
|
Is this related? |
regarding the draw unit test@rwest yes, thanks, #2744 looks relevant. The test that's failing is for this charged species: maybe failing due to the missing X-O connectivity, and so no backbone? Edit: I tried modifying the connectivity s.t. there is an apparent backbone. But, I still run into this problem even after changing the adjacency list to So it is more fundamentally a problem with the species , than a unit-test specific one. Note about sanitizationFor posterity: some molecules that fail the first kekulization step include: As implemented in this PR, the |
|
Regarding the The test case is drawn as a cyclic molecule, even though it's not truly "cyclic". So it should take the However, it thinks that So, how to proceed? a few options...
|
|
I think using RDKit for drawing as much as possible is fine, if it does a good job. Probably we ran into issues in the past that may not persist with more recent versions of RDKit, so it's fine to revisit past decisions. We should continue to put reasons for things in commit messages, to make life easier for future developers (our future selves included). |
|
OK. Maybe we can incorporate the fix into #2838 and then rebase onto |
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
7598c98 to
019362c
Compare
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
6e6490a to
e92bec0
Compare
In #2744 and #2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
|
With the merger of e92bec0 this PR's CI should now pass - let's see! |
|
@jonwzheng the overnight test failures look spurious - re-running them now. |
Fragments will sometimes call `get_smallest_set_of_smallest_rings` (e.g. for drawing), which will then call the _fragment_ version of `to_rdkit_mol` (rather than Molecule, since Fragment inherits from Molecule), which returns a _tuple_ rather than a _mol_. This causes a crash. I considerd just replacing this with `converter.to_rdkit_mol` without the checks, but then you'd lose out on any fragment-related benefits from to_rdkit_mol (for example, you need to replace the fragments with H atoms). This commit also adds a check so that the user is at least aware that the default behavior is to change the kwarg to forcibly return mapping=True for fragments.
Some compounds with resonance form resonance hybrids, which create non-integer bond orders that then call ring perception (via get_symmetry_number). Because non-integer bond orders are not recognized, we handle them as `unspecified`. Alternatively, the kekulization rules for RMG may sometimes differ from those of RDKit, which also logged a warning. For ring perception, these 'warnings' do not impact performance, and for nearly all users should not raise any concerns. So this demotes the logging level from `warning` to `debug`
Because atom.element is an Element object and not a string like "C", both the `if atom.element = "X"` statements returned False and we were testing nothing.
It fails. (The current algorithm only looks at rings smaller than or equal to 7 atoms.)
The "get_relevant_cycles" function does more work than needed, and currently only returns rings of 7 or fewer atoms. This replacement algorithm should be faster. The FastFindRings algorithm in RDKit is a fast depth-first search to just find if atoms are in rings (of a certain size). The blog post at https://rdkit.blogspot.com/2016/09/avoiding-unnecessary-work-and.html suggests this should be faster than a full "sanitize". The to_rdkit_mol might have scope for further optimization (eg. we don't care about bond orders, etc. etc.)
We were raising a string, not an exception.
…_cycle We already have this method. The one introduced in c49650c (then rewritten twice) doesn't seem necessarry. This seems to be twice as fast, for a certain test set of molecules. old: %%timeit for mol in mols: mol.identify_ring_membership() # via rdkit FastFindRings for a in mol.atoms: pass 4.41 ms ± 368 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) new: %%timeit for mol in mols: for a in mol.atoms: a.props["inRing"] = mol.is_vertex_in_cycle(a) 2.17 ms ± 203 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
…thods. get_symmetrized_smallest_set_of_smallest_rings is now the only function that works and is appropriately named. get_smallest_set_of_smallest_rings and get_relevant_cycles now just raise NotImplementedError, because they are not actually implemented. This will break things until calls to those methods are replaced everywhere.
get_relevant_cycles and get_smallest_set_of_smallest_rings now just issue a deprecation warning, and return the result of get_symmetrized_smallest_set_of_smallest_rings.
This reverts commit 2d84bbc.
Hopefully nowhere NEEDED the "relevant cycles" which I think might sometimes be larger set than the symmetrized SSSR. Although it is still not, I think, the same as all possible cycles.
Now that this is the only ring-finding algorithm we have, we are forced to use it in bits of code that were previously using either the SSSR or the Relevant Cycles sets of rings.
Since get_smallest_set_of_smallest_rings no longer works, and get_symmetrized_smallest_set_of_smallest_rings is in most cases a more helpful set of rings anyway, we are using the new get_symmetrized_smallest_set_of_smallest_rings method. For some polycyclic species (like cubane) this set is larger. But probably better in most cases.
Because relevant_cycles no longer functions, we can no longer test that it returns atoms in a ring-like order. If we restore the relevant_cycles we should restore this test. I have left test_cycle_list_order_sssr in place, which tests the new get_symmetrized_smallest_set_of_smallest_rings
If you had remove_h = True and sanitize="partial", it would try passing that along to RDKit, which would not expect it. Now we remove H's (without sanitizing) before possibly doing the sanitizing.
So you don't need to create an instance of Fragment() just in order to call the method (seeing as that seems to be how it is mostly used.) Probably it should be a module level function, but this is a minimal change.
Now that we use RDKit to detect rings, we create a lot of rdkit molecules in which we don't care about the bond orders, sometimes we have non-integer or weird bond orders, and we also don't want to waste time sanitizing them since all we care about is connectivity of the molecular graph. So if you know you're calling to_rdkit_mol in this setting, call it with ignore_bond_orders=True and sanitize=False.
This should save complication and time. Also, don't sanitize. All we're going to ask it for is ring information. For that all bonds (that we give it) count. If we want it to ignore H-bonds, or vdW bonds, for some scenarios, then we should figure that out. But often we use this ring detection code for things like drawing, which does want to use those bonds.
This reverts commit 55f8fc4. We should, during ring detection, now be not attempting to convert bond orders into strings, so this warning should not be triggered. We can restore the debug level if this becomes too noisy.
We do a call to detect_cutting_label (which is a complicated function to find them buried inside SMILES strings) passing it the label of every atom. (Though maybe these are mostly ''?). This commit checks whether a simpler approach would work: just checking if the atom.label is either 'R' or 'L'. If that's the case, we can then make the change.
As far as I can tell, cutting labels can only be "L" or "R". I had a previous commit that checked this is equivalent, and by running the unit tests and regression tests could not find a counterexample. This check is much simpler, and about 200 times faster, than the previous call to detect_cutting_label. Also simplified the dictionary structure.
…rnings. Previously, if you call to_rdkit_mol on a Fragment, it would always return the mapping, but if you called it on a Molecule, it would by default not. Now if you call it with return_mapping=False (or not specified) then it doesn't return the mapping, and if you call it with return_mapping=True, then it does. Internally, it calls converter.to_rdkit_mol with return_mapping=True, but doesn't pass it back unless asked. Now the behavior of thing.to_rdkit_mol() is the same whether thing is a Molecule or a Fragment (which is a subclass of Molecule). Should reduce number of warnings, eg. when drawing fragments. Also cleaned up the code a bit, and added some docstrings,
…_rdkit_mol The latest change of to_rdkit_mol means that both Molecule and Fragment return the same signature, and the calling code doesn't need to cope with unexpected tuples.
The comment suggests it was going through Geometry in order to save the mapping, but this can be done with a direct call to to_rdkit_mol using return_mapping=True.
08bc181 to
37fd69e
Compare
|
Here's the short summary of test failures following the latest rebase: |
|
The rate constant imprecision ones might be fixed by 116301d |

Currently we use
RingDecomposerLibfor finding the Smallest Set of Smallest Rings and getting the Relevant Cycles. This package does not support Python 3.10+ and is thus blocking further upgrades to RMG.@KnathanM in particular is looking to get RMG to Python 3.11 so as to add support for ChemProp v2.
I believe we can just use RDKit to do these operations instead. The original paper mentions that the functionality was being moved upstream to RDKit. With the help of AI I've taken just a first pass at reimplementing, with the special note that:
get_deterministic_sssris not really deterministic #2562This PR will be a draft for now, as it is predicated on Python 3.9 already being available (which it nearly is in #2741)
Motivation or Problem
A clear and concise description of what what you're trying to fix or improve. Please reference any issues that this addresses.
Description of Changes
A clear and concise description of what what you've changed or added.
Testing
A clear and concise description of testing that you've done or plan to do.
Reviewer Tips
Suggestions for verifying that this PR works or other notes for the reviewer.