Extending cuik-molmaker to reactions (cuik-reactmaker)#4
Open
akshatzalte wants to merge 3 commits into
Open
Conversation
Implements C++ CGR (Condensed Graph of Reaction) featurization to accelerate chemprop reaction property prediction (~9x over Python path). New C++ code: - src/reaction_features.cpp: batch_reaction_featurizer supporting all 6 RxnModes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) and all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR). Uses O(bonds) hash-map bond enumeration instead of O(n^2) atom-pair scan. - src/features.h: ReactionMode enum, reaction_mode_names_to_array, and parse_reaction declarations; get_atomic_num_onehot_index helper for num_only representation of unmatched atoms - src/one_hot.cpp/h: get_atomic_num_onehot_index for num_only encoding - src/cuik_molmaker_cpp.cpp: exports reaction_mode_names_to_array and batch_reaction_featurizer to Python - CMakeLists.txt: add reaction_features.cpp to cuik_molmaker_core sources Test fixtures: - tests/data/sample_rxns_100.csv: 100 balanced reactions (50 E2 + 50 SN2) plus 10 hand-crafted unbalanced reactions covering num_only and BALANCE mode edge cases. Verified against chemprop CondensedGraphOfReactionFeaturizer across all 6 modes with max_diff=0.
test_reaction_features.py: parametrized over all 4 atom featurizer versions (V1, V2, ORGANIC, RIGR) × all 6 reaction modes (REAC_DIFF, REAC_PROD, PROD_DIFF and BALANCE variants) = 24 test cases. RIGR uses reduced bond features (["is-null", "in-ring"], bond_fdim=2 per side = 4 total) unlike V1/V2/ORGANIC which use 5 bond features (bond_fdim=14 per side = 28 total). Golden .xz files generated from C++ batch_reaction_featurizer output after verifying agreement with chemprop CondensedGraphOfReactionFeaturizer (max_diff=0 on E2/SN2 data across all 6 modes).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
batch_reaction_featurizerfor CGR reaction featurizationWhat this PR adds
A new
batch_reaction_featurizerfunction — the reaction analogue ofbatch_mol_featurizer— using the Condensed Graph of Reaction (CGR) representation (same as Chemprop'sCondensedGraphOfReactionFeaturizer). API is consistent with the existing package:Supported: all 4 atom featurizer modes (V1, V2, ORGANIC, RIGR) and all 6 reaction modes (
REAC_DIFF,REAC_PROD,PROD_DIFF, and their_BALANCEvariants).keep_h/add_hsemantics match Chemprop'smake_molexactly.Implementation notes
All new code is in
src/reaction_features.cpp(701 lines); additions tofeatures.h,one_hot.cpp, andcuik_molmaker_cpp.cpp. No existing code was modified.Two design choices worth noting:
parse_rxn_side_moldoes not clear atom-map numbers. The existingparse_molstrips them for reordering purposes; reaction featurization needs them for reactant↔product correspondence. Fully additive — molecule featurization is unchanged.unordered_map<uint64_t, size_t>keyed by(min_idx << 32) | max_idx), replacing the Python CGR featurizer's O(n²) atom-pair scan.Node ordering matches Chemprop exactly: reactant atoms 0..n_reac−1, then product-only atoms n_reac..n_cgr−1.
Correctness
Verified against Chemprop's Python
CondensedGraphOfReactionFeaturizeron:[H-:2]lone-hydride nucleophiles)num_onlyvs BALANCE divergence)A test fixture CSV (
tests/data/sample_rxns_100.csv, 110 reactions) is included. Golden.xzreference files can be committed once you've had a look at the design.Speedup benchmarks present here