Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 61 additions & 2 deletions docs/source/Inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ Supported:
- Template-based prediction
- using ColabFold template alignments
- using pre-computed template alignments
- using direct CIF template files (no alignments required)
- Non-canonical residues

Coming soon:

- Covalently modified residues and other cross-chain covalent bonds
- User-specified template structures (as opposed to top 4)

### 1.2 DNA

Expand Down Expand Up @@ -301,6 +301,61 @@ model_update:

---

(inference-cif-direct-templates)=
#### 🧬 CIF Direct Template Mode

OpenFold3 supports providing template structures directly as CIF files without requiring pre-computed template alignments. In this mode, the system automatically:
1. Parses each provided CIF file
2. Extracts all chains and their sequences
3. Aligns each chain to your query sequence
4. Selects the best matching chain based on sequence identity × coverage score

This is particularly useful for stateless inference environments or when you have specific template structures but no alignment files.

**Usage:**

In your query JSON, specify `template_cif_paths` instead of `template_alignment_file_path`:

```json
{
"queries": {
"my_query": {
"chains": [
{
"molecule_type": "protein",
"chain_ids": ["A"],
"sequence": "MKLLVVDDAGQKFT...",
"template_cif_paths": [
"path/to/template1.cif",
"path/to/template2.cif",
"path/to/template3.cif"
],
"template_cif_chain_ids": ["A", null, "B"]
}
]
}
}
}
```

Optionally, use `template_cif_chain_ids` to specify which chain to use from each CIF file. Use `null` to let the system automatically select the best-matching chain.

**Configuration:**

You can adjust the minimum score threshold for chain selection in your `runner.yml`:

```yaml
template_preprocessor_settings:
cif_direct_min_score: 0.1 # Default: 0.1 (seq_identity × coverage)
```

**Notes:**
- For multi-chain CIF files, only the best matching chain per file is used as a template
- The `template_cif_paths` field cannot be used together with `template_alignment_file_path`
- This mode is currently supported for protein chains only

---

### 3.4 Customized ColabFold MSA Server Settings Using `runner.yml`

All settings for the ColabFold server and outputs can be set under [`msa_computation_settings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/tools/colabfold_msa_server.py#L904)
Expand Down Expand Up @@ -478,9 +533,13 @@ This file representing the full input query in a validated internal format defin

- `template_alignment_file_path`: Path to the preprocessed template cache entry `.npz` file used for template featurization. By default, template cache entries are automatically created in a short preprocessing step using the raw template alignment files provided under this same field and the template structures identified in the alignment.

- `template_cif_paths`: List of paths to CIF template files when using {ref}`CIF direct template mode <inference-cif-direct-templates>`. This field is mutually exclusive with `template_alignment_file_path`.

- `template_cif_chain_ids`: List of chain IDs to use from each corresponding CIF file in `template_cif_paths`. Use `null` for entries where automatic chain selection is desired. Must have the same length as `template_cif_paths` if provided.

- `template_entry_chain_ids`: List of template chains, identified by their entry (typically PDB) IDs and chain IDs, used for featurization. By default, up to the first 4 of these chains are used.

Note: Refer to the {doc}`Template How-To Documentation <template_how_to>` for how to specify these fields if you want to use precomputed template alignments instead of Colabfold alignments for template inputs.
Note: Refer to the {doc}`Template How-To Documentation <template_how_to>` for how to specify these fields if you want to use precomputed template alignments instead of Colabfold alignments for template inputs, or see {ref}`CIF Direct Template Mode <inference-cif-direct-templates>` for using template structures directly without alignments.

Note: If MSA and template files are persisted between runs, the same `inference_query_set.json` file can be used to resubmit the query without needing to rerun the template and MSA pipelines. To do so:

Expand Down
15 changes: 15 additions & 0 deletions docs/source/input_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ All chains must define a unique ```chain_ids``` field and appropriate sequence o
"paired_msa_file_paths": "/absolute/path/to/paired_msas",
"template_alignment_file_path": "/absolute/path/to/template_msa",
"template_entry_chain_ids": ["entry1_A", "entry2_B", "entry3_A"],
"template_cif_paths": ["/path/to/template1.cif", "/path/to/template2.cif"],
"template_cif_chain_ids": ["A", null],
}
```

Expand Down Expand Up @@ -119,6 +121,19 @@ All chains must define a unique ```chain_ids``` field and appropriate sequence o
- Use this field only when running inference with **precomputed alignments**. See the {doc}`Running with Templates Documentation <template_how_to>` for details.
- If using the ColabFold MSA server, this field is automatically populated and will **override any user-provided path**.

- `template_cif_paths` *(list[str], optional, default = null)*
- List of paths to CIF files to use as templates for this chain.
- Enables **CIF-direct template mode**, which parses templates directly from CIF files without requiring pre-computed alignments.
- Alignments are computed on-the-fly using Kalign.
- This is useful when you have known template structures but no pre-computed MSA/template alignments.
- Example: `["/path/to/template1.cif", "/path/to/template2.cif"]`

- `template_cif_chain_ids` *(list[str | null], optional, default = null)*
- List of chain IDs to use from each corresponding CIF file in `template_cif_paths`.
- Must have the same length as `template_cif_paths` if provided.
- Use `null` for a specific entry to let the parser automatically select the best-matching chain.
- Example: `["A", null, "B"]` - uses chain A from the first CIF, auto-selects from the second, and uses chain B from the third.

### 3.2. RNA Chains

```
Expand Down
104 changes: 94 additions & 10 deletions docs/source/template_how_to.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# Running OpenFold3 Inference with Templates

This document contains instructions on how to use template information for OF3 predictions. Here, we assume that you already generated all of your template alignments or intend to fetch them from Colabfold on-the-fly. If you do not have any precomputed template alignments and do not want to use Colabfold, refer to our {doc}`MSA Generation Guide <precomputed_msa_generation_how_to>` before consulting this document. If you need further clarifications on how some of the template components of our inference pipeline work, refer to {doc}`this explanatory document <template_explanation>`.
This document contains instructions on how to use template information for OF3 predictions. OpenFold3 supports two template modes:

1. **Alignment-based templates** (traditional): Requires template alignments and template structures
2. **CIF direct templates** (simplified): Requires only template CIF files, no alignments needed

For alignment-based templates, we assume you already generated all of your template alignments or intend to fetch them from Colabfold on-the-fly. If you do not have any precomputed template alignments and do not want to use Colabfold, refer to our {doc}`MSA Generation Guide <precomputed_msa_generation_how_to>` before consulting this document.

If you need further clarifications on how some of the template components of our inference pipeline work, refer to {doc}`this explanatory document <template_explanation>`.

The template pipeline currently supports monomeric templates and has been tested for protein chains only.

Expand All @@ -12,10 +19,18 @@ The main steps detailed in this guide are:
(1-template-files)=
## 1. Template Files

Template featurization requires query-to-template **alignments** and template **structures**.
OpenFold3 supports two modes for providing template information:

### Alignment-Based Mode (Traditional)
Requires query-to-template **alignments** and template **structures**. Sections 1.1 and 1.2 below describe the required file formats.

### CIF Direct Mode (Simplified)
Requires only template **CIF files**. The system automatically aligns template chains to your query sequence and selects the best matching chain. See {ref}`Section 2.3 <23-cif-direct-templates>` for usage details.

---

(11-template-aligment-file-format)=
### 1.1. Template Aligment File Format
### 1.1. Template Alignment File Format (Alignment-Based Mode)

Template alignments can be provided in either `sto`, `a3m` or `m8` format. Template alignments from the Colabfold server are in `m8` format.

Expand Down Expand Up @@ -73,16 +88,18 @@ query_A template_C 71.4 14 4 0 5 18 75 88 2e-03 22.3

Note that since `m8` files do not provide actual alignments, we only use them to identify which structure files to get templates from, retrieve sequences from these structure files and always realign them to the query sequence using Kalign. More on this in the [template processing explanatory document](template_explanation.md).

### 1.2. Template Structure File Format
### 1.2. Template Structure File Format (Alignment-Based Mode)

For alignment-based templates, template structures currently can only be provided in `cif` format. An upcoming release will add support for parsing templates from `pdb` files.

Template structures currently can only be provided in `cif` format. An upcoming release will add support for parsing templates from `pdb` files.
**Note:** For {ref}`CIF direct mode <23-cif-direct-templates>`, template CIF files are specified directly in the query JSON without separate structure directories.

(2-specifying-template-information-in-the-inference-query-file)=
## 2. Specifying Template Information in the Inference Query File

### 2.1. Specifying Alignments
### 2.1. Specifying Alignments (Alignment-Based Mode)

The data pipeline needs to know which template alignment to use for which chain. This information is provided by specifying the {ref}`paths to the alignments <31-protein-chains>` for each chain's `template_alignment_file_path` field in the inference query json file.
For alignment-based templates, the data pipeline needs to know which template alignment to use for which chain. This information is provided by specifying the {ref}`paths to the alignments <31-protein-chains>` for each chain's `template_alignment_file_path` field in the inference query json file.

Note that when fetching alignments from the Colabfold server, `template_alignment_file_path` fields are automatically populated.

Expand Down Expand Up @@ -118,9 +135,9 @@ Note that when fetching alignments from the Colabfold server, `template_alignmen
</code></pre>
</details>

### 2.2. Using Specific Templates
### 2.2. Using Specific Templates (Alignment-Based Mode)

By default, the template pipeline automatically populates the `template_entry_chain_ids` field with [n templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L1535) from the alignment, which is then further subset to the [top k templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L116) during featurization for inference.
By default, for alignment-based templates, the template pipeline automatically populates the `template_entry_chain_ids` field with [n templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L1535) from the alignment, which is then further subset to the [top k templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L116) during featurization for inference.

In an **upcoming release**, we will add support for specifying *specific templates* for the data pipeline to use for featurization. This will be possible through the `template_entry_chain_ids` field:

Expand Down Expand Up @@ -156,10 +173,77 @@ entry3_A MK----DDARGQGKFT
//
```

(23-cif-direct-templates)=
### 2.3. CIF Direct Templates (No Alignments Required)

OpenFold3 supports providing template structures directly as CIF files without requiring pre-computed template alignments. This is particularly useful for:
- Stateless inference environments (e.g., NVIDIA Inference Microservices)
- Quick predictions when you have specific template structures
- Simplified workflows without external alignment tools

#### How It Works

In CIF direct mode, the system automatically:
1. Parses each provided CIF file to extract all chains and their sequences
2. Aligns each chain sequence to your query sequence using sequence alignment
3. Scores each chain by `sequence_identity × coverage`
4. Selects the best matching chain as the template (if score ≥ minimum threshold)

For multi-chain CIF files, only the best matching chain per file is used.

#### Usage Example

Specify `template_cif_paths` instead of `template_alignment_file_path` in your query JSON:

```json
{
"queries": {
"my_protein": {
"chains": [
{
"molecule_type": "protein",
"chain_ids": ["A", "B"],
"sequence": "XRMKQLEDKVEELLSKNYHLENEVARLKKLVGER",
"template_cif_paths": [
"templates/1dgc.cif",
"templates/1ysa.cif",
"templates/1zta.cif"
]
}
]
}
}
}
```

**Example query files:**
- [Homomer with direct CIF templates](https://github.com/aqlaboratory/openfold-3/blob/main/examples/example_inference_inputs/query_homomer_with_direct_cif_templates.json)
- [Multimer with direct CIF templates](https://github.com/aqlaboratory/openfold-3/blob/main/examples/example_inference_inputs/query_multimer_with_direct_cif_templates.json)

#### Configuration

Adjust the minimum score threshold for chain selection in your `runner.yml`:

```yaml
template_preprocessor_settings:
cif_direct_min_score: 0.1 # Default: 0.1 (seq_identity × coverage)
```

Only chains with a score (sequence identity × coverage) above this threshold will be considered as valid templates.

#### Important Notes

- The `template_cif_paths` field is **mutually exclusive** with `template_alignment_file_path` - you must use one or the other, not both
- Template structures must be in CIF format
- Currently supported for protein chains only
- For multi-chain CIF files, the system automatically selects the best matching chain per file

(3-optimizations-for-high-throughput-workflows)=
## 3. Optimizations for High-Throughput Workflows

For high-throughput use cases, where a large number of structures are to be predicted, template processing can take a significant amount of time even with the built-in {doc}`deduplication utility <template_explanation>` we have for template alignment and structure processing. To avoid having to spend GPU compute on data transformations, we provide separate template preprocessing scripts to generate the necessary inputs from which template featurization can run efficiently in a subsequent job without being a bottleneck to the model forward pass.
**Note:** The optimizations described in this section apply to **alignment-based templates**. If you're using {ref}`CIF direct templates <23-cif-direct-templates>`, the workflow is already simplified and these preprocessing steps are not necessary.

For high-throughput use cases with alignment-based templates, where a large number of structures are to be predicted, template processing can take a significant amount of time even with the built-in {doc}`deduplication utility <template_explanation>` we have for template alignment and structure processing. To avoid having to spend GPU compute on data transformations, we provide separate template preprocessing scripts to generate the necessary inputs from which template featurization can run efficiently in a subsequent job without being a bottleneck to the model forward pass.

### 3.1. Template Alignment Preprocessing

Expand Down
11 changes: 11 additions & 0 deletions openfold3/core/data/framework/data_module.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@

import dataclasses
import enum
import logging
import random
import warnings
from typing import Any
Expand Down Expand Up @@ -74,6 +75,7 @@
from openfold3.core.utils.tensor_utils import dict_multimap

_NUMPY_AVAILABLE = RequirementCache("numpy")
logger = logging.getLogger(__name__)


class DatasetMode(enum.Enum):
Expand Down Expand Up @@ -516,8 +518,15 @@ def __init__(
self.inference_config = _configs.configs[0]

def prepare_data(self) -> None:
logger.info("=" * 60)
logger.info(
f"Prepare data: use_msa_server={self.use_msa_server}, use_templates={self.use_templates}"
)
logger.info("=" * 60)

# Colabfold msa preparation
if self.use_msa_server:
logger.info("Running ColabFold MSA server...")
self.inference_config.query_set = preprocess_colabfold_msas(
inference_query_set=self.inference_config.query_set,
compute_settings=self.msa_computation_settings,
Expand All @@ -529,11 +538,13 @@ def prepare_data(self) -> None:
)

if self.use_templates:
logger.info("Running template preprocessing...")
template_preprocessor = TemplatePreprocessor(
input_set=self.inference_config.query_set,
config=self.inference_config.template_preprocessor_settings,
)
template_preprocessor()
logger.info("Template preprocessing complete!")

def setup(self, stage=None):
"""Broadcast updated query set to all ranks if multiple GPUs are used."""
Expand Down
Loading