Fix organometallic conformer gen and custom CCD support#132
Conversation
…-conformer-custom-ccd
…-conformer-custom-ccd
|
Nice work Lukas, I can see a touch of CC here and there :P |
jnwei
left a comment
There was a problem hiding this comment.
Overall really nice work @ljarosch ! The CCD issue / organometallic conformers has been a nasty issue to debug and iron out. I think your overall strategy of preferring the biotite CCD and then converting custom CCDs into biotite format makes sense, thank you for providing the explanation.
One main question regarding how to use the custom CCD: If a user has a structure for one molecule (for example, a custom ligand) but wishes to use standard structures from the CCD for everything else (e.g. for standard amino acids), will the custom CCD be "appended" to the standard CCD? Or will the user need to provide their own CCD with the standard aas included?
It could be helpful to provide an explicit example of using a custom CCD in our huggingface directory, so that we give users a model of how to use this route in addition to documentation. Perhaps this example could be the example you used in inference testing.
Otherwise, I have just a few small fixes in the caching definitions and tests.
| self._biotite_ccd_path = update_biotite_ccd_from_file( | ||
| dataset_config.ccd_file_path | ||
| ) |
There was a problem hiding this comment.
Will this run into an issue if the user provided CCD only contains customized structures, but not standard structures like the residues?
There was a problem hiding this comment.
If this file contains content copied from the Biotite library, can we add a reference about which file(s) were copied from biotite, and what our changes are on top of their functions?
There was a problem hiding this comment.
Ditto – let's vendor as little code as humanly possible, do we need to vendor all of it? Which parts are we changing?
There was a problem hiding this comment.
Yeah I can be more specific about it. Re code vendoring, I think we could remove some of these functions. All functions below concatenate_ccd are taken from biotite directly, and concatenate_ccd itself was only slightly modified to work with a local path instead of pulling from the web.
Re code vendoring I was thinking that adding these util functions makes it self-contained, as they are all "private" functions in biotite, so maybe not API-stable. But happy to change this back to importing them, it does introduce some awkwardness and I'm unsure about the best practice.
|
|
||
| assert atom_names == CUSTOM_CCD_EXPECTED_ATOM_NAMES | ||
| assert reference_atom_names == CUSTOM_CCD_EXPECTED_ATOM_NAMES | ||
| assert atom_names == reference_atom_names |
There was a problem hiding this comment.
nit: This 3rd assertion seems redundant with the two assertions before it
| "structure_with_ref_mol_from_ccd_code_aligns_atom_order", | ||
| ], | ||
| ) | ||
| def test_ligand_ccd_paths_respect_custom_ccd_and_atom_order(input_type): |
There was a problem hiding this comment.
nit: Since the verification is just one line anyway using assert_atom_names_align_with_reference_mol, does it make sense to write this as two different tests, rather than one test with parameterization?
| _assert_atom_names_align_with_reference_mol(structure_with_ref_mols) | ||
|
|
||
|
|
||
| def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch): |
There was a problem hiding this comment.
nit: monkeypatch seems unused here in favor of unittest.patch, can we remove it?
There was a problem hiding this comment.
we should either be using mock.patch or monkeypatch, not both, for consistency
There was a problem hiding this comment.
right sorry, that's just a leftover I forgot to clean up
|
|
||
| reference_data_path = Path(__file__).parent / "test_data" / "structure_from_query" | ||
|
|
||
| # Custom CCD with alanine renamed to "GLU", with atoms in deliberately non-standard row |
There was a problem hiding this comment.
This is a good test case. I like that the full cif is written in the test file for easy viewing within this test.
To expand my question regarding the usage of custom CCDs: Perhaps we could include one more test case that includes both a custom structure (provided by the this custom CCD cif file) and a standard residue? So for example, Valine + custom GLU?
| _mol_from_biotite_ccd_cached = lru_cache(maxsize=500)(mol_from_biotite_ccd) | ||
| """Cached version to speed up repeated access to same CCD entries.""" | ||
|
|
||
|
|
||
| def mol_from_biotite_ccd_cached(ccd_code: str) -> AnnotatedMol: | ||
| """Builds mol from Biotite's internal CCD using pdbeccdutils. | ||
|
|
||
| Internally uses a cached version of `mol_from_biotite_ccd` to speed up repeated | ||
| access to the same CCD entries (cache memory is limited to 500 entries). | ||
|
|
||
| Args: | ||
| ccd_code: | ||
| The CCD code of the component to build. | ||
|
|
||
| Returns: | ||
| An AnnotatedMol with atom names and (potentially) ideal/model conformers. | ||
| Despite caching, the returned Mol will be a new object that can be safely | ||
| modified without affecting future calls with the same CCD code. | ||
| """ | ||
| mol = _mol_from_biotite_ccd_cached(ccd_code) | ||
|
|
||
| # Copy to avoid issues with mutable objects | ||
| mol_cp = Chem.Mol(mol) | ||
|
|
||
| return mol_cp |
There was a problem hiding this comment.
While the use of the functional lru_cache around another function in line is technically correct, it looks a bit strange with the hanging docstring. It is much more common to see lru_cache as a decorator around an internal function.
Could we rewrite these two functions slightly to follow this pattern instead?
@lru_cache(maxsize=500)
def _mol_from_biotite_ccd_cached(ccd_code: str) -> AnnotatedMol:
# internal, do not call directly — returns shared cached object
...
def mol_from_biotite_ccd(ccd_code: str) -> AnnotatedMol:
"""...Returns a safe copy; cached internally."""
return Chem.Mol(_mol_from_biotite_ccd_cached(ccd_code))
| return mol_cp | ||
|
|
||
|
|
||
| _get_residue_cached = lru_cache(maxsize=500)(struc.info.residue) |
There was a problem hiding this comment.
Same thing with _get_residue_cached and the lru_cache
| assert masks[0] is False | ||
| assert all(masks[1:]) | ||
|
|
||
| # Featurization should succeed without NaN |
There was a problem hiding this comment.
nice that this test also tests featurization runs
jandom
left a comment
There was a problem hiding this comment.
Overall, i'm really happy this is here but some work still remains. A lot of code was added here and we need to make sure this really works.
| ], | ||
| } | ||
| }, | ||
| "ccd_file_path": "/path/to/CCD/file.cif" |
There was a problem hiding this comment.
Wait so this will no longer be supported?
| structure_array_directory: null | ||
| cache_directory: <tmp-dir>/of3_template_data/template_cache | ||
| log_directory: null | ||
| ccd_file_path: null No newline at end of file |
There was a problem hiding this comment.
Ah, still there but moved up
| ccd_path = getattr(dataset, "_biotite_ccd_path", None) | ||
| if ccd_path is not None: | ||
| update_biotite_ccd(ccd_path) |
There was a problem hiding this comment.
I see Claude doing this pattern a lot – and I don't love it – a better pattern would be to declare _biotite_ccd_path as a nullable attribute that's defaulting to None.
There was a problem hiding this comment.
Ditto – let's vendor as little code as humanly possible, do we need to vendor all of it? Which parts are we changing?
| bondlist_arr = atom_array.bonds.as_array() | ||
| if not np.any(bondlist_arr[:, 2] == BondType.COORDINATION): | ||
| return atom_array |
| return result | ||
|
|
||
|
|
||
| def _bcif_to_cif_category(bcif_category) -> CIFCategory: |
There was a problem hiding this comment.
Is this vendored code again?
There was a problem hiding this comment.
No this is our own code
| assert conf_id == 0 | ||
| except ConformerGenerationError: | ||
| if ideal_conf is None: | ||
| raise |
| # Shouldn't be necessary for ideal coordinates but better to be safe | ||
| mol = add_conformer_atom_mask(mol) | ||
| replace_nan_coords_with_zeros(mol) |
There was a problem hiding this comment.
What's this guarding against? After conformer generation, we should either have an ideal conformer/a valid generated conformer. The shouldn't have nans, surely?
There was a problem hiding this comment.
If our own conformer generation fails we fall back to the CCD-derived conformer, which can sometimes have missing coordinates. So far I've only ever seen this for "model" coordinates, not "ideal" coordinates, but technically the PDB spec indicates that it may be possible for coordinates to be "missing or incomplete" also for ideal conformers, so I thought better safe than sorry if PDB doesn't guarantee this.
I could pull this into the ideal conformer clause to be more precise, as it shouldn't apply to RDKit-generated conformers.
There was a problem hiding this comment.
I could pull this into the ideal conformer clause to be more precise, as it shouldn't apply to RDKit-generated conformers.
I'd be in favor of that, because i like sane logic – your explainer about would also make for a terrific comment to anybody studying this code in the future
| _assert_atom_names_align_with_reference_mol(structure_with_ref_mols) | ||
|
|
||
|
|
||
| def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch): |
There was a problem hiding this comment.
we should either be using mock.patch or monkeypatch, not both, for consistency
| values = values.astype(float) | ||
| array = np.zeros(string_array.shape, dtype=values.dtype) | ||
| array[mask_bool] = values | ||
| return array |
|
@ljarosch wants to revisit this PR in a week or two, dropping to draft as it's still in progress |
Summary
Refactors inference-time conformer logic to be
pdbeccdutils-based (fixes organometallic conformer generation issues), and adds support for custom CCD file.Changes
1) Custom CCD support
Custom CCDs are now read in using the new function
update_biotite_ccd_from_fileand can be supplied in both.cifand.bcifformat. This function will update Biotite's own internal CCD to the user-supplied one, resulting in automatic compatibility with any Biotite-based utilities (such as residue information, bond writing, etc.). In parallel we introduce careful piping to make Biotite's internal CCD work withpdbeccdutils, see note.Note
.cifinput requires a few minutes to be converted to.bcifformat under the hood. We have a script for preconverting inpreprocess_ccd_biotite.pythat we point to in the documentation, but could consider making it a native openfold3 command.Warning
Biotite's internal CCD represents a module-level state, which is currently only compatible with the
forkmultiprocessing method. We likely want to future-proof this as newer Python versions switch the default toforkserver. This PR introduces a_worker_init_with_ccdfor this purpose which will reapply the biotite CCD change to each worker, but we have upstreamDataModuleissues blocking actual compatibility with forkserver/spawn that seemed out of scope for this PR.Note about CCD behavior going forward
The idea going forward is that the canonical CCD is always handled through Biotite's internally stored one: If a CIFFile-like object is needed as in the template code, we use
BiotiteCCDWrapper. If interfacing withpdbeccdutilsis required, we can usemol_from_biotite_ccd, which will handle the piping from the Biotite CCD bCIF format into the raw CIF format thatgemmiandpdbeccdutilsexpect.While Biotite's internal CCD format and these conversions are a little awkward, it seemed worse to implement lots of parallel code flows depending on which input format is supplied, so this consolidation was the best solution.
2) Basing conformer processing on
pdbeccdutilspdbeccdutilscontains useful input sanitization that will rescue the conformer generation of organometallics by setting proper coordination bond types. We also rely on this library for training-time conformer processing. This switches inference conformer processing from pure Biotite logic back topdbeccdutilslogic, usingprocessed_reference_molecule_from_ccd_codeandmol_from_biotite_ccd. Aspdbeccdutilsadds some slight overhead due to running more functionalities than we technically require, this PR adds caching to these functions to speed them up.Note
We might decide on getting rid of
pdbeccdutilsaltogether in the future, but its molecule sanitization proved a bit tricky to track down and maintain ourselves, so for now it seemed better to rely on a PDBe-developed library.For additional robustness, this PR also adds support for fallback to CCD-derived ideal coordinates in case conformer generation fails.
3) Proactive dative bond conversion in CIF output
pdbeccdutilsintroducesCOORDINATIONbond types that Biotite cannot serialize in the_chem_comp_bondtable. Previously,_create_cif_filecaught the resultingKeyErrorand retried after converting bonds. This replaces that try/except with a proactive conversion of intra-residue dative bonds toSINGLEbefore writing, avoiding throwing a warning and making the behavior explicit.Related Issues
Closes #109
Closes #110
Testing
Added tests to
test_structure_from_query.py:ccd_codeinput pathsTested full inference workflow with:
Standard inference and HEM conformer
query.json
{ "queries": { "protein_with_hem": { "chains": [ { "molecule_type": "protein", "chain_ids": "A", "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH" }, { "molecule_type": "ligand", "chain_ids": "B", "ccd_codes": "HEM" } ] }, "protein_with_atp": { "chains": [ { "molecule_type": "protein", "chain_ids": "A", "sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT" }, { "molecule_type": "ligand", "chain_ids": "B", "ccd_codes": "ATP" } ] } } }runner.yml
Custom CCD
CCD file was modified to contain a ligand named TEST1.
query.json
{ "queries": { "protein_with_test1": { "chains": [ { "molecule_type": "protein", "chain_ids": "A", "sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT" }, { "molecule_type": "ligand", "chain_ids": "B", "ccd_codes": "TEST1" } ] } } }cif-based runner.yml
bcif-based runner.yml