Fix organometallic conformer gen and custom CCD support by ljarosch · Pull Request #132 · aqlaboratory/openfold-3

ljarosch · 2026-03-10T00:25:04Z

Summary

Refactors inference-time conformer logic to be pdbeccdutils-based (fixes organometallic conformer generation issues), and adds support for custom CCD file.

Changes

1) Custom CCD support

Custom CCDs are now read in using the new function update_biotite_ccd_from_file and can be supplied in both .cif and .bcif format. This function will update Biotite's own internal CCD to the user-supplied one, resulting in automatic compatibility with any Biotite-based utilities (such as residue information, bond writing, etc.). In parallel we introduce careful piping to make Biotite's internal CCD work with pdbeccdutils, see note.

Note

.cif input requires a few minutes to be converted to .bcif format under the hood. We have a script for preconverting in preprocess_ccd_biotite.py that we point to in the documentation, but could consider making it a native openfold3 command.

Warning

Biotite's internal CCD represents a module-level state, which is currently only compatible with the fork multiprocessing method. We likely want to future-proof this as newer Python versions switch the default to forkserver. This PR introduces a _worker_init_with_ccd for this purpose which will reapply the biotite CCD change to each worker, but we have upstream DataModule issues blocking actual compatibility with forkserver/spawn that seemed out of scope for this PR.

Note about CCD behavior going forward

The idea going forward is that the canonical CCD is always handled through Biotite's internally stored one: If a CIFFile-like object is needed as in the template code, we use BiotiteCCDWrapper. If interfacing with pdbeccdutils is required, we can use mol_from_biotite_ccd, which will handle the piping from the Biotite CCD bCIF format into the raw CIF format that gemmi and pdbeccdutils expect.

While Biotite's internal CCD format and these conversions are a little awkward, it seemed worse to implement lots of parallel code flows depending on which input format is supplied, so this consolidation was the best solution.

2) Basing conformer processing on `pdbeccdutils`

pdbeccdutils contains useful input sanitization that will rescue the conformer generation of organometallics by setting proper coordination bond types. We also rely on this library for training-time conformer processing. This switches inference conformer processing from pure Biotite logic back to pdbeccdutils logic, using processed_reference_molecule_from_ccd_code and mol_from_biotite_ccd. As pdbeccdutils adds some slight overhead due to running more functionalities than we technically require, this PR adds caching to these functions to speed them up.

Note

We might decide on getting rid of pdbeccdutils altogether in the future, but its molecule sanitization proved a bit tricky to track down and maintain ourselves, so for now it seemed better to rely on a PDBe-developed library.

For additional robustness, this PR also adds support for fallback to CCD-derived ideal coordinates in case conformer generation fails.

3) Proactive dative bond conversion in CIF output

pdbeccdutils introduces COORDINATION bond types that Biotite cannot serialize in the _chem_comp_bond table. Previously, _create_cif_file caught the resulting KeyError and retried after converting bonds. This replaces that try/except with a proactive conversion of intra-residue dative bonds to SINGLE before writing, avoiding throwing a warning and making the behavior explicit.

Related Issues

Closes #109
Closes #110

Testing

Added tests to test_structure_from_query.py:

HEM ligand conformer generation (regression test for organometallics)
Custom CCD with modified atom order, verified through both the query and direct ccd_code input paths
Conformer fallback to CCD Ideal coordinates with partial NaN handling

Tested full inference workflow with:

Standard inference and HEM conformer

query.json

{
	"queries": {
		"protein_with_hem": {
			"chains": [
				{
					"molecule_type": "protein",
					"chain_ids": "A",
					"sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
				},
				{
					"molecule_type": "ligand",
					"chain_ids": "B",
					"ccd_codes": "HEM"
				}
			]
		},
		"protein_with_atp": {
			"chains": [
				{
					"molecule_type": "protein",
					"chain_ids": "A",
					"sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"
				},
				{
					"molecule_type": "ligand",
					"chain_ids": "B",
					"ccd_codes": "ATP"
				}
			]
		}
	}
}

runner.yml

data_module_args:
  num_workers: 2

model_update:
  presets:
    - predict
    - low_mem
    - pae_enabled

Custom CCD

CCD file was modified to contain a ligand named TEST1.

query.json

{
	"queries": {
		"protein_with_test1": {
			"chains": [
				{
					"molecule_type": "protein",
					"chain_ids": "A",
					"sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT"
				},
				{
					"molecule_type": "ligand",
					"chain_ids": "B",
					"ccd_codes": "TEST1"
				}
			]
		}
	}
}

cif-based runner.yml

# Test runner for custom CCD via .cif path
# Tests that the .cif -> .bcif auto-conversion works end-to-end

data_module_args:
  num_workers: 0

dataset_config_kwargs:
  ccd_file_path: ./custom_components.cif

model_update:
  presets:
    - predict
    - low_mem
    - pae_enabled

bcif-based runner.yml

# Test runner for custom CCD via .bcif path (pre-converted, fast)
# Tests the direct .bcif code path

data_module_args:
  num_workers: 0

dataset_config_kwargs:
  ccd_file_path: ./custom_components.bcif

model_update:
  presets:
    - predict
    - low_mem
    - pae_enabled

…-conformer-custom-ccd

jandom · 2026-03-10T08:28:07Z

Nice work Lukas, I can see a touch of CC here and there :P

jnwei

Overall really nice work @ljarosch ! The CCD issue / organometallic conformers has been a nasty issue to debug and iron out. I think your overall strategy of preferring the biotite CCD and then converting custom CCDs into biotite format makes sense, thank you for providing the explanation.

One main question regarding how to use the custom CCD: If a user has a structure for one molecule (for example, a custom ligand) but wishes to use standard structures from the CCD for everything else (e.g. for standard amino acids), will the custom CCD be "appended" to the standard CCD? Or will the user need to provide their own CCD with the standard aas included?

It could be helpful to provide an explicit example of using a custom CCD in our huggingface directory, so that we give users a model of how to use this route in addition to documentation. Perhaps this example could be the example you used in inference testing.

Otherwise, I have just a few small fixes in the caching definitions and tests.

jnwei · 2026-03-10T08:07:38Z

+        self._biotite_ccd_path = update_biotite_ccd_from_file(
+            dataset_config.ccd_file_path
+        )


Will this run into an issue if the user provided CCD only contains customized structures, but not standard structures like the residues?

jnwei · 2026-03-10T08:12:35Z

If this file contains content copied from the Biotite library, can we add a reference about which file(s) were copied from biotite, and what our changes are on top of their functions?

Ditto – let's vendor as little code as humanly possible, do we need to vendor all of it? Which parts are we changing?

Yeah I can be more specific about it. Re code vendoring, I think we could remove some of these functions. All functions below concatenate_ccd are taken from biotite directly, and concatenate_ccd itself was only slightly modified to work with a local path instead of pulling from the web.

Re code vendoring I was thinking that adding these util functions makes it self-contained, as they are all "private" functions in biotite, so maybe not API-stable. But happy to change this back to importing them, it does introduce some awkwardness and I'm unsure about the best practice.

jnwei · 2026-03-10T08:16:08Z

+
+    assert atom_names == CUSTOM_CCD_EXPECTED_ATOM_NAMES
+    assert reference_atom_names == CUSTOM_CCD_EXPECTED_ATOM_NAMES
+    assert atom_names == reference_atom_names


nit: This 3rd assertion seems redundant with the two assertions before it

jnwei · 2026-03-10T08:25:05Z

+        "structure_with_ref_mol_from_ccd_code_aligns_atom_order",
+    ],
+)
+def test_ligand_ccd_paths_respect_custom_ccd_and_atom_order(input_type):


nit: Since the verification is just one line anyway using assert_atom_names_align_with_reference_mol, does it make sense to write this as two different tests, rather than one test with parameterization?

jnwei · 2026-03-10T08:25:44Z

+    _assert_atom_names_align_with_reference_mol(structure_with_ref_mols)
+
+
+def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch):


nit: monkeypatch seems unused here in favor of unittest.patch, can we remove it?

we should either be using mock.patch or monkeypatch, not both, for consistency

right sorry, that's just a leftover I forgot to clean up

jnwei · 2026-03-10T08:36:42Z


 reference_data_path = Path(__file__).parent / "test_data" / "structure_from_query"

+# Custom CCD with alanine renamed to "GLU", with atoms in deliberately non-standard row


This is a good test case. I like that the full cif is written in the test file for easy viewing within this test.

To expand my question regarding the usage of custom CCDs: Perhaps we could include one more test case that includes both a custom structure (provided by the this custom CCD cif file) and a standard residue? So for example, Valine + custom GLU?

jnwei · 2026-03-10T08:52:45Z

+_mol_from_biotite_ccd_cached = lru_cache(maxsize=500)(mol_from_biotite_ccd)
+"""Cached version to speed up repeated access to same CCD entries."""
+
+
+def mol_from_biotite_ccd_cached(ccd_code: str) -> AnnotatedMol:
+    """Builds mol from Biotite's internal CCD using pdbeccdutils.
+
+    Internally uses a cached version of `mol_from_biotite_ccd` to speed up repeated
+    access to the same CCD entries (cache memory is limited to 500 entries).
+
+    Args:
+        ccd_code:
+            The CCD code of the component to build.
+
+    Returns:
+        An AnnotatedMol with atom names and (potentially) ideal/model conformers.
+        Despite caching, the returned Mol will be a new object that can be safely
+        modified without affecting future calls with the same CCD code.
+    """
+    mol = _mol_from_biotite_ccd_cached(ccd_code)
+
+    # Copy to avoid issues with mutable objects
+    mol_cp = Chem.Mol(mol)
+
+    return mol_cp


While the use of the functional lru_cache around another function in line is technically correct, it looks a bit strange with the hanging docstring. It is much more common to see lru_cache as a decorator around an internal function.

Could we rewrite these two functions slightly to follow this pattern instead?

@lru_cache(maxsize=500) def _mol_from_biotite_ccd_cached(ccd_code: str) -> AnnotatedMol: # internal, do not call directly — returns shared cached object ... def mol_from_biotite_ccd(ccd_code: str) -> AnnotatedMol: """...Returns a safe copy; cached internally.""" return Chem.Mol(_mol_from_biotite_ccd_cached(ccd_code))

jnwei · 2026-03-10T08:53:10Z

+    return mol_cp
+
+
+_get_residue_cached = lru_cache(maxsize=500)(struc.info.residue)


Same thing with _get_residue_cached and the lru_cache

jnwei · 2026-03-10T09:07:04Z

+    assert masks[0] is False
+    assert all(masks[1:])
+
+    # Featurization should succeed without NaN


nice that this test also tests featurization runs

jandom

Overall, i'm really happy this is here but some work still remains. A lot of code was added here and we need to make sure this really works.

jandom · 2026-03-10T11:04:19Z

            ],
        }
-    },
-    "ccd_file_path": "/path/to/CCD/file.cif"


Wait so this will no longer be supported?

jandom · 2026-03-10T11:05:04Z

  structure_array_directory: null
  cache_directory: <tmp-dir>/of3_template_data/template_cache
  log_directory: null
-  ccd_file_path: null


Ah, still there but moved up

jandom · 2026-03-10T11:06:29Z

+        ccd_path = getattr(dataset, "_biotite_ccd_path", None)
+        if ccd_path is not None:
+            update_biotite_ccd(ccd_path)


I see Claude doing this pattern a lot – and I don't love it – a better pattern would be to declare _biotite_ccd_path as a nullable attribute that's defaulting to None.

jandom · 2026-03-10T11:07:17Z

Ditto – let's vendor as little code as humanly possible, do we need to vendor all of it? Which parts are we changing?

jandom · 2026-03-10T11:08:18Z

+    bondlist_arr = atom_array.bonds.as_array()
+    if not np.any(bondlist_arr[:, 2] == BondType.COORDINATION):
+        return atom_array


This is good

jandom · 2026-03-10T11:08:35Z

            return result
+
+
+def _bcif_to_cif_category(bcif_category) -> CIFCategory:


Is this vendored code again?

No this is our own code

jandom · 2026-03-10T11:09:32Z

+        assert conf_id == 0
+    except ConformerGenerationError:
+        if ideal_conf is None:
+            raise


This is good

jandom · 2026-03-10T11:10:31Z

+    # Shouldn't be necessary for ideal coordinates but better to be safe
+    mol = add_conformer_atom_mask(mol)
+    replace_nan_coords_with_zeros(mol)


What's this guarding against? After conformer generation, we should either have an ideal conformer/a valid generated conformer. The shouldn't have nans, surely?

If our own conformer generation fails we fall back to the CCD-derived conformer, which can sometimes have missing coordinates. So far I've only ever seen this for "model" coordinates, not "ideal" coordinates, but technically the PDB spec indicates that it may be possible for coordinates to be "missing or incomplete" also for ideal conformers, so I thought better safe than sorry if PDB doesn't guarantee this.

I could pull this into the ideal conformer clause to be more precise, as it shouldn't apply to RDKit-generated conformers.

I could pull this into the ideal conformer clause to be more precise, as it shouldn't apply to RDKit-generated conformers.

I'd be in favor of that, because i like sane logic – your explainer about would also make for a terrific comment to anybody studying this code in the future

jandom · 2026-03-10T11:11:17Z

+    _assert_atom_names_align_with_reference_mol(structure_with_ref_mols)
+
+
+def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch):


we should either be using mock.patch or monkeypatch, not both, for consistency

jandom · 2026-03-10T11:11:32Z

-            values = values.astype(float)
-    array = np.zeros(string_array.shape, dtype=values.dtype)
-    array[mask_bool] = values
-    return array


Bye-bye code always good

jandom · 2026-04-28T14:31:02Z

@ljarosch wants to revisit this PR in a week or two, dropping to draft as it's still in progress

ljarosch added 15 commits March 3, 2026 12:20

Add biotite CCD overwrite & new query-builder

9453bc2

Add documentation

7219081

Add unit tests

1271e69

Merge branch 'public-main' into ljarosch/2026-02-12/fix/issue-109-110…

09bcf7e

…-conformer-custom-ccd

Simplify tmpdir management and improve clarity

393fe86

Apply formatting

a5bb17d

Improve coordination bond handling

b5500f2

Merge branch 'public-main' into ljarosch/2026-02-12/fix/issue-109-110…

3b21379

…-conformer-custom-ccd

Simplify conformer cleanup

300bfcd

Reuse category variable

ee44679

Refactor redundant code

1fd3d53

Improve logging

f3e25f3

Add conformer fallback

caa81b9

Move nested function to class-level function

76c0e9a

Change nested to top-level imports

d28bd2a

ljarosch requested review from jandom and jnwei March 10, 2026 00:39

ljarosch changed the title ~~Ljarosch/2026 02 12/fix/issue 109 110 conformer custom ccd~~ Fix organometallic conformer gen and custom CCD support Mar 10, 2026

jnwei requested changes Mar 10, 2026

View reviewed changes

jandom requested changes Mar 10, 2026

View reviewed changes

jandom marked this pull request as draft April 28, 2026 14:31

		_assert_atom_names_align_with_reference_mol(structure_with_ref_mols)


		def test_conformer_fallback_to_ideal_with_partial_nan(monkeypatch):


		reference_data_path = Path(__file__).parent / "test_data" / "structure_from_query"

		# Custom CCD with alanine renamed to "GLU", with atoms in deliberately non-standard row

		return mol_cp


		_get_residue_cached = lru_cache(maxsize=500)(struc.info.residue)

		return result


		def _bcif_to_cif_category(bcif_category) -> CIFCategory:

Conversation

ljarosch commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1) Custom CCD support

Note about CCD behavior going forward

2) Basing conformer processing on pdbeccdutils

3) Proactive dative bond conversion in CIF output

Related Issues

Testing

Standard inference and HEM conformer

Custom CCD

Uh oh!

jandom commented Mar 10, 2026

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jandom left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jandom commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

ljarosch commented Mar 10, 2026 •

edited

Loading

2) Basing conformer processing on `pdbeccdutils`