Skip to content

ESMFold2: support DNA/RNA/ligand chains in StructureRequest #21

@korbonits

Description

@korbonits

Motivation

On 2026-05-27 Chan Zuckerberg Biohub released ESMFold2 and Modal shipped their reference example the same day (modal-labs/modal-examples#1572, commit f3e0f9dd7b, blog). Sheaf shipped v0.11.0 with the matching esm pin (81b3646c) on the same day — but our StructureRequest is protein-chains-only by design (src/sheaf/api/structure.py:28-30), so we cannot reproduce Modal's flagship demo: folding M.HhaI DNA methyltransferase + modified DNA + SAH cofactor.

The headline of Modal's post (and the LinkedIn announcement) is biomolecular complexes — protein-DNA, antibody-antigen, protein-ligand. Antibody-antigen already works (multi-chain protein, we have iptm); DNA and ligand do not. Closing this gap is the obvious follow-up to the v0.11 release and keeps Sheaf in lockstep with what the upstream + Modal are pitching.

Current state

# src/sheaf/api/structure.py
class ChainInput(BaseModel):
    chain_id: str
    sequence: str   # amino-acid only

class StructureRequest(BaseRequest):
    chains: list[ChainInput] = Field(min_length=1)
    ...

Upstream API (esm @ 81b3646c, verbatim from esm/utils/structure/input_builder.py)

@dataclass
class ProteinInput:
    id: str | list[str]
    sequence: str
    modifications: list[Modification] | None = None
    msa: MSAInput = None

@dataclass
class DNAInput:
    id: str | list[str]
    sequence: str
    modifications: list[Modification] | None = None

@dataclass
class RNAInput:
    id: str | list[str]
    sequence: str
    modifications: list[Modification] | None = None

@dataclass
class LigandInput:
    id: str | list[str]
    smiles: str | None = None
    ccd: list[str] | None = None

@dataclass
class Modification:
    position: int
    ccd: str
    smiles: str | None = None

@dataclass
class StructurePredictionInput:
    sequences: Sequence[ProteinInput | RNAInput | DNAInput | LigandInput]
    pocket: PocketConditioning | None = None
    distogram_conditioning: list[DistogramConditioning] | None = None
    covalent_bonds: list[CovalentBond] | None = None

Notable: ligands accept either SMILES or a list of CCD codes; modifications accept either CCD or SMILES; IDs may be a list (used upstream for symmetry / multi-copy chains).

Proposed API — discriminated union

Pydantic v2 discriminated union, keyed by type. Existing protein-only JSON requests stay valid because type defaults to \"protein\".

class Modification(BaseModel):
    position: int
    ccd: str
    smiles: str | None = None

class ProteinChain(BaseModel):
    type: Literal[\"protein\"] = \"protein\"
    chain_id: str
    sequence: str
    modifications: list[Modification] = []
    # NB: MSA stays on StructureRequest for now (see Out of Scope)

class DNAChain(BaseModel):
    type: Literal[\"dna\"] = \"dna\"
    chain_id: str
    sequence: str
    modifications: list[Modification] = []

class RNAChain(BaseModel):
    type: Literal[\"rna\"] = \"rna\"
    chain_id: str
    sequence: str
    modifications: list[Modification] = []

class LigandChain(BaseModel):
    type: Literal[\"ligand\"] = \"ligand\"
    chain_id: str
    smiles: str | None = None
    ccd: list[str] | None = None
    # validator: exactly one of smiles / ccd must be set

ChainInput = Annotated[
    ProteinChain | DNAChain | RNAChain | LigandChain,
    Field(discriminator=\"type\"),
]

class StructureRequest(BaseRequest):
    chains: list[ChainInput] = Field(min_length=1)
    msa: list[list[str]] | None = None
    ...

Validation rules

  • chain_id must be unique across all chains in a request (cross-type).
  • LigandChain: exactly one of smiles / ccd must be set (@model_validator).
  • Modification.position must be in-range for its parent chain's sequence (1-indexed per upstream convention).
  • StructureRequest.msa, when set, must be parallel to the protein chains only — non-protein chains do not consume MSA rows. (Validate length match.)

Backend changes

ESMFold2Backend._run switches from the protein-only list-comprehension at src/sheaf/backends/esmfold2.py:122-125 to a per-type dispatch:

  • ProteinChainProteinInput(id=…, sequence=…, modifications=…)
  • DNAChainDNAInput(id=…, sequence=…, modifications=…)
  • RNAChainRNAInput(id=…, sequence=…, modifications=…)
  • LigandChainLigandInput(id=…, smiles=…, ccd=…)

Store the four upstream classes on self at load() time (same pattern as self._ProteinInput today) for test injectability.

Out of scope (file follow-ups)

  • CovalentBond — needs cross-chain residue/atom indexing; useful for covalent inhibitors but a separate design conversation.
  • PocketConditioning / DistogramConditioning — conditioning-input features, distinct from chain composition.
  • Per-chain msa on ProteinChain — upstream puts MSA on ProteinInput but our current design uses a top-level parallel list. Moving it onto the chain is a breaking API change worth its own ADR; keep the parallel-list shape for now.
  • Forge API variants (esmfold2-fast-2026-05) — still deferred per ADR-0001.

Acceptance criteria

  • ChainInput is a discriminated union (Protein/DNA/RNA/Ligand); Modification model added.
  • StructureRequest validation: unique chain_id across types; ligand smiles-XOR-ccd; MSA length parallel to protein chains only.
  • ESMFold2Backend._run dispatches per chain type; new self._DNAInput / self._RNAInput / self._LigandInput / self._Modification stored at load().
  • Tests (mocked) for each chain type and for cross-type complexes; reproduce Modal's M.HhaI demo as one test case.
  • H100 smoke test (gated on SHEAF_SMOKE_TEST=1) that folds the M.HhaI + DNA + SAH complex end-to-end and asserts a non-empty mmCIF.
  • examples/quickstart_protein_modal.py extended with a multimer-with-ligand example.
  • ADR-0001 amended with a short addendum documenting the chain-type union (link this issue / PR).
  • CLAUDE.md design-decisions section updated.
  • modal_server.py's AnyRequest union is unaffected (StructureRequest already lives there), but the modal quickstart should be updated alongside the Ray Serve one.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions