Skip to content

Conversation

@JacksonBurns
Copy link
Contributor

Currently we use RingDecomposerLib for finding the Smallest Set of Smallest Rings and getting the Relevant Cycles. This package does not support Python 3.10+ and is thus blocking further upgrades to RMG.

@KnathanM in particular is looking to get RMG to Python 3.11 so as to add support for ChemProp v2.

I believe we can just use RDKit to do these operations instead. The original paper mentions that the functionality was being moved upstream to RDKit. With the help of AI I've taken just a first pass at reimplementing, with the special note that:

This PR will be a draft for now, as it is predicated on Python 3.9 already being available (which it nearly is in #2741)

Motivation or Problem

A clear and concise description of what what you're trying to fix or improve. Please reference any issues that this addresses.

Description of Changes

A clear and concise description of what what you've changed or added.

Testing

A clear and concise description of testing that you've done or plan to do.

Reviewer Tips

Suggestions for verifying that this PR works or other notes for the reviewer.

@JacksonBurns JacksonBurns marked this pull request as draft May 25, 2025 21:43
@JacksonBurns
Copy link
Contributor Author

Cantera 2.6 isn't available for Python 3.12, so this PR will also need to upgrade the Cantera version to 3 as mostly completed in #2751

@JacksonBurns
Copy link
Contributor Author

A note for the path forward on this PR - the get_relevant_cycles and get_smallest_set_of_smallest_rings functions will need to be moved to Molecule. Currently they are in Graph, and Graph has no way to send itself to RDKit in order to be subject to RDKit GetSymmSSSR but Molecule can be converted to RDKit via converted.to_rdkit_mol (also exposed as Molecule.to_rdkit_mol, and then RDKit can be used. This may have implications for inheritance elsewhere in the codebase, but should be OK.

Base automatically changed from feat/py39_rebase to main May 29, 2025 19:55
@jonwzheng jonwzheng self-assigned this Jun 30, 2025
@jonwzheng
Copy link
Contributor

I ended up moving all the functions in Graph that call get_relevant_cycles or get_smallest_set_of_smallest_rings to Molecule as well. Still have a couple of unit tests to update, but if this is the direction we want to take, hopefully this should be adequate

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 16, 2025

Addressed some of the failures, but still a few nontrivial challenges:

TODO list:

  • address e / electron representation if integrating to the new RDKit workflow
  • fix missing vdW bond order
  • make sure ring algorithms are implemented correctly
  • certain bond types like R need some sort of RDKit definition; or just never parse them into an RDKit mol object in the first places
  • how to represent certain surface bonds?
  • bond orders without definitions? issue from rdkit side or something else?
  • fill in remaining placeholder unit tests

The tests are just failing mainly for the aromatic compounds, indicating there's some issue with the implementation of the code. Since everything runs, the code "flow" seems clean, just need to iron out the implementation.

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 18, 2025

We are now very close to passing all of the unit tests. There are just a couple major bugs to resolve, namely:

  • how to handle the R/L cutting fragment labels in the RDKit mol object? need to replace with placeholder atom? also with surfaces?
  • some bond orders are weird now, so need to investigate where something is going wrong. Due to sanitization maybe?
  • I think we can undo the changes to AtomTypeTest
  • some surface glitches
  • can't kekulize some molecules: if this occurs, maybe back-up is to use increasingly strict flags? e.g. turn kekulization off, and finally turn all sanitize off?

@jonwzheng
Copy link
Contributor

One of the last failing tests is test_draw_hydrogen_bond_adsorbate, which actually fails due to an assertion error in the codebase:

https://github.com/ReactionMechanismGenerator/RMG-Py/blame/d0d95b3a6bb9cd5d873c8d5b552ec112d6ba9283/rmgpy/molecule/draw.py#L541

@rwest, since you're the last remaining dev who was involved in some capacity with this part of the code, do you know why we have that assert statement? I would think it's better to just plot a "dot" than to outright crash out.
I'm also not sure why this only popped up in this PR - maybe better to address any underlying problem in this code - but thought I'd ask, since this assert might cause problems in the future. There's no safeguards against passing a molecule with no backbone, unless there's something about how this is called that I'm not getting.

While we're at it, the _generate_coordinates also didn't use RDKit to draw if the charge != 0, but maybe that can be relaxed with new versions?

@rwest
Copy link
Member

rwest commented Jul 24, 2025

One of the last failing tests is test_draw_hydrogen_bond_adsorbate, which actually fails due to an assertion error in the codebase:

https://github.com/ReactionMechanismGenerator/RMG-Py/blame/d0d95b3a6bb9cd5d873c8d5b552ec112d6ba9283/rmgpy/molecule/draw.py#L541

@rwest, since you're the last remaining dev who was involved in some capacity with this part of the code, do you know why we have that assert statement?

I don't. I can see that if there is no straight chain backbone that a function to _find_straight_chain_backbone shouldn't work. It should either fail, and whatever called it deals with the exception (maybe a ValueError is better than AssertError, but either way it's an error to be handled). Or it could return None, but then the calling code would also have to be modified to handle that without crashing. Or the calling code shouldn't try calling it - but then it'd have to know there is no straight chain backbone (and perhaps the easiest way to find that out is to ask for it and fail.) So I think either way the calling code needs updating.

There's no safeguards against passing a molecule with no backbone, unless there's something about how this is called that I'm not getting.

I haven't investigated the full stack trace. But some safeguard, or exception handling, sounds appropriate.

While we're at it, the _generate_coordinates also didn't use RDKit to draw if the charge != 0, but maybe that can be relaxed with new versions?

We can try it. I expect RDKit at the time wasn't being helpful. Hopefully the commit messages offer clues? (this is why we should write helpful commit messages that explain why things are being done).

@rwest
Copy link
Member

rwest commented Jul 24, 2025

Is this related?

#2744

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 24, 2025

regarding the draw unit test

@rwest yes, thanks, #2744 looks relevant. The test that's failing is for this charged species:
image

maybe failing due to the missing X-O connectivity, and so no backbone?
I see the discussion in the thread. I'll move any further discussion about this to that issue.
Maybe a separate PR can fix that issue, merged in, and then this can be rebased onto main.

Edit: I tried modifying the connectivity s.t. there is an apparent backbone. But, I still run into this problem even after changing the adjacency list to

1  O u0 p3 c-1 {2,S} {10,H}
2  N u0 p1 c0 {1,S} {3,S} {4,S}
3  O u0 p2 c0 {2,S} {11,S}
4  O u0 p2 c0 {2,S} {7,S}
5  N u0 p1 c0 {6,S} {8,S} {9,S} {7,H}
6  O u0 p2 c0 {5,S} {10,S}
7  H u0 p0 c0 {4,S} {5,H}
8  H u0 p0 c0 {5,S}
9  H u0 p0 c0 {5,S}
10 H u0 p0 c0 {6,S} {1,H}
11 X u0 p0 c0 {3,S}

So it is more fundamentally a problem with the species , than a unit-test specific one.

Note about sanitization

For posterity: some molecules that fail the first kekulization step include:

Can't kekulize mol.  Unkekulized atoms
[H]C1=C([H])C([H])([H])C([H])([H])[c]([H])c1[c]([H])[H]
[H]C([H])=C([H])C1([H])[c]([H])c([H])[c]([H])C1([H])[H]
[H]C([H])=C1c([c]([H])[H])[c]([H])C([H])([H])C1([H])[H]
[H]C1=C([H])C([H])([H])C2([H])[c]([H])c([H])[c]([H])C2([H])C([H])=C1[H]
[H]C1~[C]([H])C2([H])[c](c([H])[c]~1[H])~C([H])~[C]([H])C2([H])C([H])([H])[H]
[H][c]1c([H])[c]([H])C([H])(C([H])([H])[H])C1([H])[H]
[H][c]([H])c1[c]([H])C([H])([H])c2c([H])c([H])c([H])c([H])c-12
[H][C]1~C([H])~[C]([H])C2([H])[C](C([H])([H])[c]([H])c2[c]([H])[H])~C~1[H]
[H][c]([H])c1[c]([H])C([H])([H])c2c([H])c([H])c([H])c([H])c-12
[H][c]1c([H])[c]([H])C([H])(C([H])([H])[H])C1([H])[H]
[H]C1~[C]([H])C2([H])[c](c([H])[c]~1[H])~C([H])~[C]([H])C2([H])[H]
[H][c]([H])c1[c]([H])C1([H])[H]
non-ring atom marked aromatic:
[H][C]~c[c]([H])C([H])([H])C1=C([H])C([H])=C([H])C([H])=C1[H]
[H]c1=c([H])-c([H])=C(C([H])([H])[c]([H])c([H])[c]([H])[H])c([H])=c-1[H]

As implemented in this PR, the RDKit sanitization step will fallback to not use kekulization.

@jonwzheng
Copy link
Contributor

jonwzheng commented Jul 24, 2025

Regarding the draw issue, I think I've figured out what's happening...

The test case is drawn as a cyclic molecule, even though it's not truly "cyclic". So it should take the find_cyclic_backbone branch rather than the find_straight_chain_backbone branch.

However, it thinks that len(self.cycles) = 0 because of the changes to ring perception, specifically molecule.get_smallest_set_of_smallest_rings: the H bonds are treated as RDKit hydrogen bonds which don't "count" toward the ring. So in RDKit world it's treated like a non-cyclic species, whereas RMG treats it as cyclic.

So, how to proceed? a few options...

  1. Prefer RDKit drawing: if ring perception is through RDKit, then drawing with RDKit is going to generally be safer. Probably the easiest option, as all we need to do is get rid of the check for if there's charged species. (I think it should work alright personally!)
  2. Change ring perception: instead of treating the H bonds as hydrogen bonds, treat them as an explicit bond?
  3. Create a separate function from SSSR that is used purely to see if there's "ring-like" structure regardless of the nature of the bonds, purely for drawing/graph structure type analysis.

@rwest
Copy link
Member

rwest commented Jul 24, 2025

I think using RDKit for drawing as much as possible is fine, if it does a good job. Probably we ran into issues in the past that may not persist with more recent versions of RDKit, so it's fine to revisit past decisions. We should continue to put reasons for things in commit messages, to make life easier for future developers (our future selves included).

@jonwzheng
Copy link
Contributor

OK. Maybe we can incorporate the fix into #2838 and then rebase onto main once it's merged in.

jonwzheng added a commit to kirkbadger18/RMG-Py that referenced this pull request Jul 25, 2025
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
jonwzheng added a commit to kirkbadger18/RMG-Py that referenced this pull request Jul 25, 2025
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
jonwzheng added a commit to kirkbadger18/RMG-Py that referenced this pull request Jul 25, 2025
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
JacksonBurns pushed a commit that referenced this pull request Aug 4, 2025
In #2744 and #2796 it was found that charge-separated bidentate species
can have issues due to ring perception conflicts. The previous implementation
also by default did not use the rdkit backend for charged species, but
this was decided many years ago (~10 years!) In the meantime, RDKit
conformer generation has improved and likely this we can just use RDKit
by default, which would avoid the pesky edge-case issues for
ions/zwitterions.

In case the old behavior is desired, use_rdkit can be set to False.
@JacksonBurns
Copy link
Contributor Author

With the merger of e92bec0 this PR's CI should now pass - let's see!

@JacksonBurns
Copy link
Contributor Author

@jonwzheng the overnight test failures look spurious - re-running them now.

jonwzheng and others added 27 commits October 21, 2025 16:22
Fragments will sometimes call `get_smallest_set_of_smallest_rings` (e.g.
for drawing), which will then call the _fragment_ version of
`to_rdkit_mol` (rather than Molecule, since Fragment inherits from
Molecule), which returns a _tuple_ rather than a _mol_. This causes
a crash.

I considerd just replacing this with `converter.to_rdkit_mol` without
the checks, but then you'd lose out on any fragment-related benefits
from to_rdkit_mol (for example, you need to replace the fragments with H
atoms).

This commit also adds a check so that the user is at least aware that
the default behavior is to change the kwarg to forcibly return
mapping=True for fragments.
Some compounds with resonance form resonance hybrids, which create
non-integer bond orders that then call ring perception (via
get_symmetry_number). Because non-integer bond orders are not
recognized, we handle them as `unspecified`. Alternatively, the
kekulization rules for RMG may sometimes differ from those of RDKit,
which also logged a warning.

For ring perception, these 'warnings' do not impact performance, and for
nearly all users should not raise any concerns. So this demotes the
logging level from `warning` to `debug`
Because atom.element is an Element object and not a string like "C",
both the `if atom.element = "X"` statements returned False
and we were testing nothing.
It fails.
(The current algorithm only looks at rings smaller than or equal to 7 atoms.)
The "get_relevant_cycles" function does more work than needed,
and currently only returns rings of 7 or fewer atoms.
This replacement algorithm should be faster.
The FastFindRings algorithm in RDKit is a fast depth-first search
to just find if atoms are in rings (of a certain size).

The blog post at 
https://rdkit.blogspot.com/2016/09/avoiding-unnecessary-work-and.html
suggests this should be faster than a full "sanitize".
The to_rdkit_mol might have scope for further optimization
(eg. we don't care about bond orders, etc. etc.)
We were raising a string, not an exception.
…_cycle

We already have this method.
The one introduced in c49650c 
(then rewritten twice) doesn't seem necessarry.

This seems to be twice as fast, for a certain test set of molecules.

old:
%%timeit
for mol in mols:
    mol.identify_ring_membership() # via rdkit FastFindRings
    for a in mol.atoms:
        pass
4.41 ms ± 368 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

new:
%%timeit
for mol in mols:
    for a in mol.atoms:
        a.props["inRing"] = mol.is_vertex_in_cycle(a)
2.17 ms ± 203 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
…thods.

get_symmetrized_smallest_set_of_smallest_rings is now the only function that works
and is appropriately named.

get_smallest_set_of_smallest_rings and get_relevant_cycles now just raise
NotImplementedError, because they are not actually implemented.

This will break things until calls to those methods are replaced everywhere.
get_relevant_cycles and get_smallest_set_of_smallest_rings
now just issue a deprecation warning, and return the result of
get_symmetrized_smallest_set_of_smallest_rings.
Hopefully nowhere NEEDED the "relevant cycles" which I think might 
sometimes be larger set than the symmetrized SSSR.
Although it is still not, I think, the same as all possible cycles.
Now that this is the only ring-finding algorithm we have, we are forced
to use it in bits of code that were previously using either the SSSR
or the Relevant Cycles sets of rings.
Since get_smallest_set_of_smallest_rings no longer works, and
get_symmetrized_smallest_set_of_smallest_rings is in most cases
a more helpful set of rings anyway, we are using the new
get_symmetrized_smallest_set_of_smallest_rings method.
For some polycyclic species (like cubane) this set is larger.
But probably better in most cases.
Because relevant_cycles no longer functions, we can no longer
test that it returns atoms in a ring-like order.
If we restore the relevant_cycles we should restore this test.
I have left test_cycle_list_order_sssr in place, which tests 
the new get_symmetrized_smallest_set_of_smallest_rings
If you had remove_h = True and sanitize="partial",
it would try passing that along to RDKit, which would not expect it.
Now we remove H's (without sanitizing) before possibly
doing the sanitizing.
So you don't need to create an instance of Fragment() just 
in order to call the method (seeing as that seems to be how
it is mostly used.)

Probably it should be a module level function, but this
is a minimal change.
Now that we use RDKit to detect rings, we create a lot of rdkit molecules
in which we don't care about the bond orders, sometimes we have non-integer
or weird bond orders, and we also don't want to waste time sanitizing them
since all we care about is connectivity of the molecular graph.
So if you know you're calling to_rdkit_mol in this setting, call it with
ignore_bond_orders=True and sanitize=False.
This should save complication and time. Also, don't sanitize.
All we're going to ask it for is ring information. 
For that all bonds (that we give it) count.
If we want it to ignore H-bonds, or vdW bonds, for some scenarios,
then we should figure that out. But often we use this ring detection 
code for things like drawing, which does want to use those bonds.
This reverts commit 55f8fc4.

We should, during ring detection, now be not attempting to convert bond
orders into strings, so this warning should not be triggered.
We can restore the debug level if this becomes too noisy.
We do a call to detect_cutting_label (which is a complicated function
to find them buried inside SMILES strings) passing it the label of
every atom. (Though maybe these are mostly ''?).

This commit checks whether a simpler approach would work: just
checking if the atom.label is either 'R' or 'L'.

If that's the case, we can then make the change.
As far as I can tell, cutting labels can only be "L" or "R".
I had a previous commit that checked this is equivalent, and 
by running the unit tests and regression tests could not find
a counterexample.
This check is much simpler, and about 200 times faster, than
the previous call to detect_cutting_label.
Also simplified the dictionary structure.
…rnings.

Previously, if you call to_rdkit_mol on a Fragment, it would always return
the mapping, but if you called it on a Molecule, it would by default not.

Now if you call it with return_mapping=False (or not specified) then it 
doesn't return the mapping, and if you call it with return_mapping=True,
then it does.

Internally, it calls converter.to_rdkit_mol with return_mapping=True,
but doesn't pass it back unless asked. Now the behavior of thing.to_rdkit_mol()
is the same whether thing is a Molecule or a Fragment (which is a subclass
of Molecule). 
Should reduce number of warnings, eg. when drawing fragments.

Also cleaned up the code a bit, and added some docstrings,
…_rdkit_mol

The latest change of to_rdkit_mol means that both Molecule and Fragment
return the same signature, and the calling code doesn't need to cope
with unexpected tuples.
The comment suggests it was going through Geometry in order to save
the mapping, but this can be done with a direct call to to_rdkit_mol
using return_mapping=True.
@JacksonBurns
Copy link
Contributor Author

Here's the short summary of test failures following the latest rebase:

=========================== short test summary info ============================
FAILED test/rmgpy/data/thermoTest.py::TestMolecularManipulationInvolvedInThermoEstimation::test_is_bicyclic1 - AssertionError: assert False
 +  where False = is_bicyclic([<Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>])
FAILED test/rmgpy/data/thermoTest.py::TestMolecularManipulationInvolvedInThermoEstimation::test_deterministic_bicyclic_decomposition - AssertionError: assert False
 +  where False = <function is_bicyclic at 0x7fb63963b040>([<Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>, <Atom 'C'>])
 +    where <function is_bicyclic at 0x7fb63963b040> = <module 'rmgpy.data.thermo' from '/home/runner/work/RMG-Py/RMG-Py/rmgpy/data/thermo.py'>.is_bicyclic
 +      where <module 'rmgpy.data.thermo' from '/home/runner/work/RMG-Py/RMG-Py/rmgpy/data/thermo.py'> = <module 'rmgpy.data' from '/home/runner/work/RMG-Py/RMG-Py/rmgpy/data/__init__.py'>.thermo
 +        where <module 'rmgpy.data' from '/home/runner/work/RMG-Py/RMG-Py/rmgpy/data/__init__.py'> = rmgpy.data
FAILED test/rmgpy/reactionTest.py::TestReactionToCantera::test_arrhenius - AssertionError: assert {'rate-constant': {'A': 660000.0, 'b': 1.62, 'Ea': 45354560.0}} == {'rate-constant': {'A': 660000.0000000001, 'b': 1.62, 'Ea': 45354560.0}}
  
  Differing items:
  {'rate-constant': {'A': 660000.0, 'Ea': 45354560.0, 'b': 1.62}} != {'rate-constant': {'A': 660000.0000000001, 'Ea': 45354560.0, 'b': 1.62}}
  
  Full diff:
    {
        'rate-constant': {
  -         'A': 660000.0000000001,
  ?                      ---------
  +         'A': 660000.0,
            'Ea': 45354560.0,
            'b': 1.62,
        },
    }
FAILED test/rmgpy/reactionTest.py::TestReactionToCantera::test_multi_arrhenius - assert 0.001 == 0
 +  where 0.001 = round(0.0009765625, 3)
 +    where 0.0009765625 = abs((5000000000000.0 - 5000000000000.001))
 +      where 5000000000000.0 = <ArrheniusRate at 7fb42083a9b0>.pre_exponential_factor
 +        where <ArrheniusRate at 7fb42083a9b0> = HO2(5) + OH(4) <=> H2O(27) + O2(6)    <Reaction(Arrhenius)>.rate
 +      and   5000000000000.001 = <ArrheniusRate at 7fb42083a3b0>.pre_exponential_factor
 +        where <ArrheniusRate at 7fb42083a3b0> = HO2(5) + OH(4) <=> H2O(27) + O2(6)    <Reaction(Arrhenius)>.rate
FAILED test/rmgpy/reactionTest.py::TestChargeTransferReaction::test_get_rate_coeff - assert 0.0078125 < 1e-06
 +  where 0.0078125 = abs((43870506959778.99 - 43870506959779.0))
== 5 failed, 1970 passed, 38 skipped, 3939699 warnings in 2107.75s (0:35:07) ===

@rwest
Copy link
Member

rwest commented Oct 23, 2025

The rate constant imprecision ones might be fixed by 116301d
The is_bicyclic one is the problem I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get_deterministic_sssr is not really deterministic

4 participants