Skip to content

Fix organometallic conformer gen and custom CCD support#132

Draft
ljarosch wants to merge 15 commits into
mainfrom
ljarosch/2026-02-12/fix/issue-109-110-conformer-custom-ccd
Draft

Fix organometallic conformer gen and custom CCD support#132
ljarosch wants to merge 15 commits into
mainfrom
ljarosch/2026-02-12/fix/issue-109-110-conformer-custom-ccd

Conversation

@ljarosch
Copy link
Copy Markdown
Collaborator

@ljarosch ljarosch commented Mar 10, 2026

Summary

Refactors inference-time conformer logic to be pdbeccdutils-based (fixes organometallic conformer generation issues), and adds support for custom CCD file.

Changes

1) Custom CCD support

Custom CCDs are now read in using the new function update_biotite_ccd_from_file and can be supplied in both .cif and .bcif format. This function will update Biotite's own internal CCD to the user-supplied one, resulting in automatic compatibility with any Biotite-based utilities (such as residue information, bond writing, etc.). In parallel we introduce careful piping to make Biotite's internal CCD work with pdbeccdutils, see note.

Note

.cif input requires a few minutes to be converted to .bcif format under the hood. We have a script for preconverting in preprocess_ccd_biotite.py that we point to in the documentation, but could consider making it a native openfold3 command.

Warning

Biotite's internal CCD represents a module-level state, which is currently only compatible with the fork multiprocessing method. We likely want to future-proof this as newer Python versions switch the default to forkserver. This PR introduces a _worker_init_with_ccd for this purpose which will reapply the biotite CCD change to each worker, but we have upstream DataModule issues blocking actual compatibility with forkserver/spawn that seemed out of scope for this PR.

Note about CCD behavior going forward

The idea going forward is that the canonical CCD is always handled through Biotite's internally stored one: If a CIFFile-like object is needed as in the template code, we use BiotiteCCDWrapper. If interfacing with pdbeccdutils is required, we can use mol_from_biotite_ccd, which will handle the piping from the Biotite CCD bCIF format into the raw CIF format that gemmi and pdbeccdutils expect.

While Biotite's internal CCD format and these conversions are a little awkward, it seemed worse to implement lots of parallel code flows depending on which input format is supplied, so this consolidation was the best solution.

2) Basing conformer processing on pdbeccdutils

pdbeccdutils contains useful input sanitization that will rescue the conformer generation of organometallics by setting proper coordination bond types. We also rely on this library for training-time conformer processing. This switches inference conformer processing from pure Biotite logic back to pdbeccdutils logic, using processed_reference_molecule_from_ccd_code and mol_from_biotite_ccd. As pdbeccdutils adds some slight overhead due to running more functionalities than we technically require, this PR adds caching to these functions to speed them up.

Note

We might decide on getting rid of pdbeccdutils altogether in the future, but its molecule sanitization proved a bit tricky to track down and maintain ourselves, so for now it seemed better to rely on a PDBe-developed library.

For additional robustness, this PR also adds support for fallback to CCD-derived ideal coordinates in case conformer generation fails.

3) Proactive dative bond conversion in CIF output

pdbeccdutils introduces COORDINATION bond types that Biotite cannot serialize in the _chem_comp_bond table. Previously, _create_cif_file caught the resulting KeyError and retried after converting bonds. This replaces that try/except with a proactive conversion of intra-residue dative bonds to SINGLE before writing, avoiding throwing a warning and making the behavior explicit.

Related Issues

Closes #109
Closes #110

Testing

Added tests to test_structure_from_query.py:

  • HEM ligand conformer generation (regression test for organometallics)
  • Custom CCD with modified atom order, verified through both the query and direct ccd_code input paths
  • Conformer fallback to CCD Ideal coordinates with partial NaN handling

Tested full inference workflow with:

Standard inference and HEM conformer

query.json

{
	"queries": {
		"protein_with_hem": {
			"chains": [
				{
					"molecule_type": "protein",
					"chain_ids": "A",
					"sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
				},
				{
					"molecule_type": "ligand",
					"chain_ids": "B",
					"ccd_codes": "HEM"
				}
			]
		},
		"protein_with_atp": {
			"chains": [
				{
					"molecule_type": "protein",
					"chain_ids": "A",
					"sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"
				},
				{
					"molecule_type": "ligand",
					"chain_ids": "B",
					"ccd_codes": "ATP"
				}
			]
		}
	}
}

runner.yml

data_module_args:
  num_workers: 2

model_update:
  presets:
    - predict
    - low_mem
    - pae_enabled

Custom CCD

CCD file was modified to contain a ligand named TEST1.

query.json

{
	"queries": {
		"protein_with_test1": {
			"chains": [
				{
					"molecule_type": "protein",
					"chain_ids": "A",
					"sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"
				},
				{
					"molecule_type": "ligand",
					"chain_ids": "B",
					"ccd_codes": "TEST1"
				}
			]
		}
	}
}

cif-based runner.yml

# Test runner for custom CCD via .cif path
# Tests that the .cif -> .bcif auto-conversion works end-to-end

data_module_args:
  num_workers: 0

dataset_config_kwargs:
  ccd_file_path: ./custom_components.cif

model_update:
  presets:
    - predict
    - low_mem
    - pae_enabled

bcif-based runner.yml

# Test runner for custom CCD via .bcif path (pre-converted, fast)
# Tests the direct .bcif code path

data_module_args:
  num_workers: 0

dataset_config_kwargs:
  ccd_file_path: ./custom_components.bcif

model_update:
  presets:
    - predict
    - low_mem
    - pae_enabled

@ljarosch ljarosch requested review from jandom and jnwei March 10, 2026 00:39
@ljarosch ljarosch changed the title Ljarosch/2026 02 12/fix/issue 109 110 conformer custom ccd Fix organometallic conformer gen and custom CCD support Mar 10, 2026
@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Mar 10, 2026

Nice work Lukas, I can see a touch of CC here and there :P

Copy link
Copy Markdown
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really nice work @ljarosch ! The CCD issue / organometallic conformers has been a nasty issue to debug and iron out. I think your overall strategy of preferring the biotite CCD and then converting custom CCDs into biotite format makes sense, thank you for providing the explanation.

One main question regarding how to use the custom CCD: If a user has a structure for one molecule (for example, a custom ligand) but wishes to use standard structures from the CCD for everything else (e.g. for standard amino acids), will the custom CCD be "appended" to the standard CCD? Or will the user need to provide their own CCD with the standard aas included?

It could be helpful to provide an explicit example of using a custom CCD in our huggingface directory, so that we give users a model of how to use this route in addition to documentation. Perhaps this example could be the example you used in inference testing.

Otherwise, I have just a few small fixes in the caching definitions and tests.

Comment on lines +120 to +122
self._biotite_ccd_path = update_biotite_ccd_from_file(
dataset_config.ccd_file_path
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this run into an issue if the user provided CCD only contains customized structures, but not standard structures like the residues?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this file contains content copied from the Biotite library, can we add a reference about which file(s) were copied from biotite, and what our changes are on top of their functions?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto – let's vendor as little code as humanly possible, do we need to vendor all of it? Which parts are we changing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can be more specific about it. Re code vendoring, I think we could remove some of these functions. All functions below concatenate_ccd are taken from biotite directly, and concatenate_ccd itself was only slightly modified to work with a local path instead of pulling from the web.

Re code vendoring I was thinking that adding these util functions makes it self-contained, as they are all "private" functions in biotite, so maybe not API-stable. But happy to change this back to importing them, it does introduce some awkwardness and I'm unsure about the best practice.


assert atom_names == CUSTOM_CCD_EXPECTED_ATOM_NAMES
assert reference_atom_names == CUSTOM_CCD_EXPECTED_ATOM_NAMES
assert atom_names == reference_atom_names
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This 3rd assertion seems redundant with the two assertions before it

"structure_with_ref_mol_from_ccd_code_aligns_atom_order",
],
)
def test_ligand_ccd_paths_respect_custom_ccd_and_atom_order(input_type):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since the verification is just one line anyway using assert_atom_names_align_with_reference_mol, does it make sense to write this as two different tests, rather than one test with parameterization?

_assert_atom_names_align_with_reference_mol(structure_with_ref_mols)


def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: monkeypatch seems unused here in favor of unittest.patch, can we remove it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should either be using mock.patch or monkeypatch, not both, for consistency

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right sorry, that's just a leftover I forgot to clean up


reference_data_path = Path(__file__).parent / "test_data" / "structure_from_query"

# Custom CCD with alanine renamed to "GLU", with atoms in deliberately non-standard row
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test case. I like that the full cif is written in the test file for easy viewing within this test.

To expand my question regarding the usage of custom CCDs: Perhaps we could include one more test case that includes both a custom structure (provided by the this custom CCD cif file) and a standard residue? So for example, Valine + custom GLU?

Comment on lines +796 to +820
_mol_from_biotite_ccd_cached = lru_cache(maxsize=500)(mol_from_biotite_ccd)
"""Cached version to speed up repeated access to same CCD entries."""


def mol_from_biotite_ccd_cached(ccd_code: str) -> AnnotatedMol:
"""Builds mol from Biotite's internal CCD using pdbeccdutils.

Internally uses a cached version of `mol_from_biotite_ccd` to speed up repeated
access to the same CCD entries (cache memory is limited to 500 entries).

Args:
ccd_code:
The CCD code of the component to build.

Returns:
An AnnotatedMol with atom names and (potentially) ideal/model conformers.
Despite caching, the returned Mol will be a new object that can be safely
modified without affecting future calls with the same CCD code.
"""
mol = _mol_from_biotite_ccd_cached(ccd_code)

# Copy to avoid issues with mutable objects
mol_cp = Chem.Mol(mol)

return mol_cp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the use of the functional lru_cache around another function in line is technically correct, it looks a bit strange with the hanging docstring. It is much more common to see lru_cache as a decorator around an internal function.

Could we rewrite these two functions slightly to follow this pattern instead?

@lru_cache(maxsize=500)
def _mol_from_biotite_ccd_cached(ccd_code: str) -> AnnotatedMol:
    # internal, do not call directly — returns shared cached object
    ...

def mol_from_biotite_ccd(ccd_code: str) -> AnnotatedMol:
    """...Returns a safe copy; cached internally."""
    return Chem.Mol(_mol_from_biotite_ccd_cached(ccd_code))

return mol_cp


_get_residue_cached = lru_cache(maxsize=500)(struc.info.residue)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing with _get_residue_cached and the lru_cache

assert masks[0] is False
assert all(masks[1:])

# Featurization should succeed without NaN
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice that this test also tests featurization runs

Copy link
Copy Markdown
Collaborator

@jandom jandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, i'm really happy this is here but some work still remains. A lot of code was added here and we need to make sure this really works.

],
}
},
"ccd_file_path": "/path/to/CCD/file.cif"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait so this will no longer be supported?

structure_array_directory: null
cache_directory: <tmp-dir>/of3_template_data/template_cache
log_directory: null
ccd_file_path: null No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, still there but moved up

Comment on lines +562 to +564
ccd_path = getattr(dataset, "_biotite_ccd_path", None)
if ccd_path is not None:
update_biotite_ccd(ccd_path)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Claude doing this pattern a lot – and I don't love it – a better pattern would be to declare _biotite_ccd_path as a nullable attribute that's defaulting to None.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto – let's vendor as little code as humanly possible, do we need to vendor all of it? Which parts are we changing?

Comment on lines +1095 to +1097
bondlist_arr = atom_array.bonds.as_array()
if not np.any(bondlist_arr[:, 2] == BondType.COORDINATION):
return atom_array
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good

return result


def _bcif_to_cif_category(bcif_category) -> CIFCategory:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this vendored code again?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is our own code

assert conf_id == 0
except ConformerGenerationError:
if ideal_conf is None:
raise
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good

Comment on lines +331 to +333
# Shouldn't be necessary for ideal coordinates but better to be safe
mol = add_conformer_atom_mask(mol)
replace_nan_coords_with_zeros(mol)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this guarding against? After conformer generation, we should either have an ideal conformer/a valid generated conformer. The shouldn't have nans, surely?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If our own conformer generation fails we fall back to the CCD-derived conformer, which can sometimes have missing coordinates. So far I've only ever seen this for "model" coordinates, not "ideal" coordinates, but technically the PDB spec indicates that it may be possible for coordinates to be "missing or incomplete" also for ideal conformers, so I thought better safe than sorry if PDB doesn't guarantee this.

I could pull this into the ideal conformer clause to be more precise, as it shouldn't apply to RDKit-generated conformers.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could pull this into the ideal conformer clause to be more precise, as it shouldn't apply to RDKit-generated conformers.

I'd be in favor of that, because i like sane logic – your explainer about would also make for a terrific comment to anybody studying this code in the future

_assert_atom_names_align_with_reference_mol(structure_with_ref_mols)


def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should either be using mock.patch or monkeypatch, not both, for consistency

values = values.astype(float)
array = np.zeros(string_array.shape, dtype=values.dtype)
array[mask_bool] = values
return array
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bye-bye code always good

@jandom
Copy link
Copy Markdown
Collaborator

jandom commented Apr 28, 2026

@ljarosch wants to revisit this PR in a week or two, dropping to draft as it's still in progress

@jandom jandom marked this pull request as draft April 28, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Conformer generation struggles with organometallics like HEM [BUG] Custom CCD is not properly read

3 participants