Skip to content

Add role-aware RDKit molecule export#59

Open
joelaforet wants to merge 6 commits into
issue-51/shared-saamr-topology-indexfrom
issue-51/rdkit-role-aware-export
Open

Add role-aware RDKit molecule export#59
joelaforet wants to merge 6 commits into
issue-51/shared-saamr-topology-indexfrom
issue-51/rdkit-role-aware-export

Conversation

@joelaforet
Copy link
Copy Markdown
Contributor

@joelaforet joelaforet commented May 5, 2026

Summary

  • Adds strategy-based role-aware RDKit export.
  • Adds primitive_to_rdkit_mols(...), yielding one RDKit Mol per SEGMENT.
  • Preserves legacy primitive_to_rdkit(...).
  • Adds focused RDKit exporter tests.

Issue Coverage

Addresses #31 for the new role-aware SAAMR export path.

Issue #31 exposed a mismatch between RDKit's aromaticity model and MuPT's legacy flattened export path. When a heteroaromatic repeat unit such as thiophene is read from SMILES, MuPT first builds/sanitizes an RDKit molecule in primitive_from_smiles(). During RDKit import, MuPT stores each RDKit bond type directly on a Connector in connector_between_rdatoms(). For thiophene sulfur, that means two AROMATIC connectors.

The legacy primitive_to_rdkit(...) exporter then flattens the hierarchy and calls child_prim.check_valence() before constructing the output RDKit molecule (legacy flatten/check path). check_valence() compares allowed element valence against Primitive.valence (implementation), and Primitive.valence is computed by summing numeric connector bond orders. Those numeric orders come from RDKit's BondTypeAsDouble() table (BOND_ORDER), so two aromatic sulfur bonds are summed as 1.5 + 1.5 = 3. MuPT therefore rejected a chemically valid thiophene sulfur before RDKit could receive and validate the exported graph.

This PR does not remove check_valence() or claim that valence checks are unimportant. check_valence() remains useful as a simple MuPT-side consistency check for obvious invalid atomic primitives. The problem is that this local scalar check is not expressive enough to be the final chemistry validator for RDKit export, especially for aromatic systems where RDKit's sanitization and aromaticity model determine valid valence. For the role-aware RDKit exporter, MuPT should validate the role/topology structure, preserve the original connector bond types, and then let RDKit validate the constructed chemistry.

This PR avoids the failure mode for SAAMR export by bypassing the legacy flatten-and-check path. The new exporter validates and indexes the SAAMR role hierarchy with build_saamr_role_topology_index() through AllAtomRDKitExportStrategy, then constructs an RDKit Mol directly from the collected atom and connector references. Bonds are added back to RDKit with their original connector bond types in _add_rdkit_bonds(), and the resulting RDKit molecule is validated with RDKit's property cache (_mol_from_rdkit_data()).

Regression tests now cover thiophene, pyrrole, and pyromellitimide repeat units, plus an explicit RDKit sanitization check that exported thiophene retains aromatic bonds and a valid sulfur valence (tests).

Stack Position

This is PR 2 of 3 in the role-aware RDKit/SDF export stack.

Blocked by #58.

Resolves #31

Stack:

  1. Refactor SAAMR topology traversal for MDAnalysis/RDKit exporters #58: shared SAAMR topology traversal.
  2. This PR: role-aware RDKit Mol export.
  3. Add export-only MuPT SDF writer #60: export-only mupt.muptio.sdf writer.

After #58 Merges

After #58 merges into main:

  • Retarget this PR from base issue-51/shared-saamr-topology-index to base main.
  • Do not rebase unless GitHub reports conflicts.
  • If conflicts appear, merge updated main into this branch, resolve, and push.

After This Merges

Verification

  • python -m pytest mupt/tests/interfaces/test_shared_topology.py mupt/tests/interfaces/rdkit/test_exporters.py -q

return mol


PDB_CHAIN_IDS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
Copy link
Copy Markdown
Contributor Author

@joelaforet joelaforet May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this block is added so that exported systems have valid PDB Chain identifiers. Problems arise when one creates a topology from a file that has unspecified chain information. By having this "wrap around" logic, we ensure each residue has a unique chain/residue index. One chain per "segment" only allows for 26 unique segments in the topology. This messes with downstream analysis, since there are potentially multiple matches for something like "Chain A Residue 1". The wraparound gives us 26 x 9999 ~= 260,000 unique residues in the system, where a residue is the same concept as a repeat_unit. In the future, we may consider switching to PDBx format exclusively, but this works fine for now.

linker_refs: list[tuple[int, Primitive, ConnectorReference]] = field(default_factory=list)


class RDKitExportStrategy(ABC):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, added for completeness in case we want to export non-AA systems to RDKit. I don't know if that is needed, and this may be overkill. Happy to refactor to remove the ABC if needed.

Copy link
Copy Markdown
Contributor Author

@joelaforet joelaforet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments for reviewers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant