Skip to content

Extending cuik-molmaker to reactions (cuik-reactmaker)#4

Open
akshatzalte wants to merge 3 commits into
NVIDIA-Digital-Bio:mainfrom
akshatzalte:feat/cgr-reaction-featurization
Open

Extending cuik-molmaker to reactions (cuik-reactmaker)#4
akshatzalte wants to merge 3 commits into
NVIDIA-Digital-Bio:mainfrom
akshatzalte:feat/cgr-reaction-featurization

Conversation

@akshatzalte
Copy link
Copy Markdown

Add batch_reaction_featurizer for CGR reaction featurization

What this PR adds

A new batch_reaction_featurizer function — the reaction analogue of batch_mol_featurizer — using the Condensed Graph of Reaction (CGR) representation (same as Chemprop's CondensedGraphOfReactionFeaturizer). API is consistent with the existing package:

atom_onehot = cuik_molmaker.atom_onehot_feature_names_to_array([...])
atom_float  = cuik_molmaker.atom_float_feature_names_to_array(['aromaticity', 'mass'])
bond_feats  = cuik_molmaker.bond_feature_names_to_array([...])
mode_int    = cuik_molmaker.reaction_mode_names_to_array(['REAC_DIFF'])[0]

V, E, edge_index, rev_edge_index, batch = cuik_molmaker.batch_reaction_featurizer(
    reac_smiles_list, prod_smiles_list,
    atom_onehot, atom_float, bond_feats,
    keep_h=True, add_h=False, offset_carbon=False, mode=mode_int,
)
# Returns 5 NumPy arrays — same convention as batch_mol_featurizer

Supported: all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR) and all 6 reaction modes (REAC_DIFF, REAC_PROD, PROD_DIFF, and their _BALANCE variants). keep_h/add_h semantics match Chemprop's make_mol exactly.


Implementation notes

All new code is in src/reaction_features.cpp (701 lines); additions to features.h, one_hot.cpp, and cuik_molmaker_cpp.cpp. No existing code was modified.

Two design choices worth noting:

  • parse_rxn_side_mol does not clear atom-map numbers. The existing parse_mol strips them for reordering purposes; reaction featurization needs them for reactant↔product correspondence. Fully additive — molecule featurization is unchanged.
  • Bond lookup is O(bonds) via hash map (unordered_map<uint64_t, size_t> keyed by (min_idx << 32) | max_idx), replacing the Python CGR featurizer's O(n²) atom-pair scan.

Node ordering matches Chemprop exactly: reactant atoms 0..n_reac−1, then product-only atoms n_reac..n_cgr−1.


Correctness

Verified against Chemprop's Python CondensedGraphOfReactionFeaturizer on:

  • E2 dataset (1264 reactions, atom-mapped explicit H)
  • SN2 dataset (2362 reactions, includes [H-:2] lone-hydride nucleophiles)
  • 10 hand-crafted unbalanced reactions (num_only vs BALANCE divergence)

A test fixture CSV (tests/data/sample_rxns_100.csv, 110 reactions) is included. Golden .xz reference files can be committed once you've had a look at the design.

Speedup benchmarks present here

Implements C++ CGR (Condensed Graph of Reaction) featurization to
accelerate chemprop reaction property prediction (~9x over Python path).

New C++ code:
- src/reaction_features.cpp: batch_reaction_featurizer supporting all 6
  RxnModes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) and all
  4 atom featurizer modes (V1, V2, ORGANIC, RIGR). Uses O(bonds) hash-map
  bond enumeration instead of O(n^2) atom-pair scan.
- src/features.h: ReactionMode enum, reaction_mode_names_to_array, and
  parse_reaction declarations; get_atomic_num_onehot_index helper for
  num_only representation of unmatched atoms
- src/one_hot.cpp/h: get_atomic_num_onehot_index for num_only encoding
- src/cuik_molmaker_cpp.cpp: exports reaction_mode_names_to_array and
  batch_reaction_featurizer to Python
- CMakeLists.txt: add reaction_features.cpp to cuik_molmaker_core sources

Test fixtures:
- tests/data/sample_rxns_100.csv: 100 balanced reactions (50 E2 + 50 SN2)
  plus 10 hand-crafted unbalanced reactions covering num_only and BALANCE
  mode edge cases. Verified against chemprop CondensedGraphOfReactionFeaturizer
  across all 6 modes with max_diff=0.
test_reaction_features.py: parametrized over all 4 atom featurizer versions
(V1, V2, ORGANIC, RIGR) × all 6 reaction modes (REAC_DIFF, REAC_PROD,
PROD_DIFF and BALANCE variants) = 24 test cases.

RIGR uses reduced bond features (["is-null", "in-ring"], bond_fdim=2 per
side = 4 total) unlike V1/V2/ORGANIC which use 5 bond features (bond_fdim=14
per side = 28 total).

Golden .xz files generated from C++ batch_reaction_featurizer output after
verifying agreement with chemprop CondensedGraphOfReactionFeaturizer
(max_diff=0 on E2/SN2 data across all 6 modes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant