Motivation
On 2026-05-27 Chan Zuckerberg Biohub released ESMFold2 and Modal shipped their reference example the same day (modal-labs/modal-examples#1572, commit f3e0f9dd7b, blog). Sheaf shipped v0.11.0 with the matching esm pin (81b3646c) on the same day — but our StructureRequest is protein-chains-only by design (src/sheaf/api/structure.py:28-30), so we cannot reproduce Modal's flagship demo: folding M.HhaI DNA methyltransferase + modified DNA + SAH cofactor.
The headline of Modal's post (and the LinkedIn announcement) is biomolecular complexes — protein-DNA, antibody-antigen, protein-ligand. Antibody-antigen already works (multi-chain protein, we have iptm); DNA and ligand do not. Closing this gap is the obvious follow-up to the v0.11 release and keeps Sheaf in lockstep with what the upstream + Modal are pitching.
Current state
# src/sheaf/api/structure.py
class ChainInput(BaseModel):
chain_id: str
sequence: str # amino-acid only
class StructureRequest(BaseRequest):
chains: list[ChainInput] = Field(min_length=1)
...
Upstream API (esm @ 81b3646c, verbatim from esm/utils/structure/input_builder.py)
@dataclass
class ProteinInput:
id: str | list[str]
sequence: str
modifications: list[Modification] | None = None
msa: MSAInput = None
@dataclass
class DNAInput:
id: str | list[str]
sequence: str
modifications: list[Modification] | None = None
@dataclass
class RNAInput:
id: str | list[str]
sequence: str
modifications: list[Modification] | None = None
@dataclass
class LigandInput:
id: str | list[str]
smiles: str | None = None
ccd: list[str] | None = None
@dataclass
class Modification:
position: int
ccd: str
smiles: str | None = None
@dataclass
class StructurePredictionInput:
sequences: Sequence[ProteinInput | RNAInput | DNAInput | LigandInput]
pocket: PocketConditioning | None = None
distogram_conditioning: list[DistogramConditioning] | None = None
covalent_bonds: list[CovalentBond] | None = None
Notable: ligands accept either SMILES or a list of CCD codes; modifications accept either CCD or SMILES; IDs may be a list (used upstream for symmetry / multi-copy chains).
Proposed API — discriminated union
Pydantic v2 discriminated union, keyed by type. Existing protein-only JSON requests stay valid because type defaults to \"protein\".
class Modification(BaseModel):
position: int
ccd: str
smiles: str | None = None
class ProteinChain(BaseModel):
type: Literal[\"protein\"] = \"protein\"
chain_id: str
sequence: str
modifications: list[Modification] = []
# NB: MSA stays on StructureRequest for now (see Out of Scope)
class DNAChain(BaseModel):
type: Literal[\"dna\"] = \"dna\"
chain_id: str
sequence: str
modifications: list[Modification] = []
class RNAChain(BaseModel):
type: Literal[\"rna\"] = \"rna\"
chain_id: str
sequence: str
modifications: list[Modification] = []
class LigandChain(BaseModel):
type: Literal[\"ligand\"] = \"ligand\"
chain_id: str
smiles: str | None = None
ccd: list[str] | None = None
# validator: exactly one of smiles / ccd must be set
ChainInput = Annotated[
ProteinChain | DNAChain | RNAChain | LigandChain,
Field(discriminator=\"type\"),
]
class StructureRequest(BaseRequest):
chains: list[ChainInput] = Field(min_length=1)
msa: list[list[str]] | None = None
...
Validation rules
chain_id must be unique across all chains in a request (cross-type).
LigandChain: exactly one of smiles / ccd must be set (@model_validator).
Modification.position must be in-range for its parent chain's sequence (1-indexed per upstream convention).
StructureRequest.msa, when set, must be parallel to the protein chains only — non-protein chains do not consume MSA rows. (Validate length match.)
Backend changes
ESMFold2Backend._run switches from the protein-only list-comprehension at src/sheaf/backends/esmfold2.py:122-125 to a per-type dispatch:
ProteinChain → ProteinInput(id=…, sequence=…, modifications=…)
DNAChain → DNAInput(id=…, sequence=…, modifications=…)
RNAChain → RNAInput(id=…, sequence=…, modifications=…)
LigandChain → LigandInput(id=…, smiles=…, ccd=…)
Store the four upstream classes on self at load() time (same pattern as self._ProteinInput today) for test injectability.
Out of scope (file follow-ups)
CovalentBond — needs cross-chain residue/atom indexing; useful for covalent inhibitors but a separate design conversation.
PocketConditioning / DistogramConditioning — conditioning-input features, distinct from chain composition.
- Per-chain
msa on ProteinChain — upstream puts MSA on ProteinInput but our current design uses a top-level parallel list. Moving it onto the chain is a breaking API change worth its own ADR; keep the parallel-list shape for now.
- Forge API variants (
esmfold2-fast-2026-05) — still deferred per ADR-0001.
Acceptance criteria
References
Motivation
On 2026-05-27 Chan Zuckerberg Biohub released ESMFold2 and Modal shipped their reference example the same day (modal-labs/modal-examples#1572, commit
f3e0f9dd7b, blog). Sheaf shippedv0.11.0with the matchingesmpin (81b3646c) on the same day — but ourStructureRequestis protein-chains-only by design (src/sheaf/api/structure.py:28-30), so we cannot reproduce Modal's flagship demo: folding M.HhaI DNA methyltransferase + modified DNA + SAH cofactor.The headline of Modal's post (and the LinkedIn announcement) is biomolecular complexes — protein-DNA, antibody-antigen, protein-ligand. Antibody-antigen already works (multi-chain protein, we have
iptm); DNA and ligand do not. Closing this gap is the obvious follow-up to the v0.11 release and keeps Sheaf in lockstep with what the upstream + Modal are pitching.Current state
Upstream API (esm @ 81b3646c, verbatim from
esm/utils/structure/input_builder.py)Notable: ligands accept either SMILES or a list of CCD codes; modifications accept either CCD or SMILES; IDs may be a list (used upstream for symmetry / multi-copy chains).
Proposed API — discriminated union
Pydantic v2 discriminated union, keyed by
type. Existing protein-only JSON requests stay valid becausetypedefaults to\"protein\".Validation rules
chain_idmust be unique across all chains in a request (cross-type).LigandChain: exactly one ofsmiles/ccdmust be set (@model_validator).Modification.positionmust be in-range for its parent chain's sequence (1-indexed per upstream convention).StructureRequest.msa, when set, must be parallel to the protein chains only — non-protein chains do not consume MSA rows. (Validate length match.)Backend changes
ESMFold2Backend._runswitches from the protein-only list-comprehension atsrc/sheaf/backends/esmfold2.py:122-125to a per-type dispatch:ProteinChain→ProteinInput(id=…, sequence=…, modifications=…)DNAChain→DNAInput(id=…, sequence=…, modifications=…)RNAChain→RNAInput(id=…, sequence=…, modifications=…)LigandChain→LigandInput(id=…, smiles=…, ccd=…)Store the four upstream classes on
selfatload()time (same pattern asself._ProteinInputtoday) for test injectability.Out of scope (file follow-ups)
CovalentBond— needs cross-chain residue/atom indexing; useful for covalent inhibitors but a separate design conversation.PocketConditioning/DistogramConditioning— conditioning-input features, distinct from chain composition.msaonProteinChain— upstream puts MSA onProteinInputbut our current design uses a top-level parallel list. Moving it onto the chain is a breaking API change worth its own ADR; keep the parallel-list shape for now.esmfold2-fast-2026-05) — still deferred per ADR-0001.Acceptance criteria
ChainInputis a discriminated union (Protein/DNA/RNA/Ligand);Modificationmodel added.StructureRequestvalidation: uniquechain_idacross types; ligand smiles-XOR-ccd; MSA length parallel to protein chains only.ESMFold2Backend._rundispatches per chain type; newself._DNAInput/self._RNAInput/self._LigandInput/self._Modificationstored atload().SHEAF_SMOKE_TEST=1) that folds the M.HhaI + DNA + SAH complex end-to-end and asserts a non-empty mmCIF.examples/quickstart_protein_modal.pyextended with a multimer-with-ligand example.CLAUDE.mddesign-decisions section updated.modal_server.py'sAnyRequestunion is unaffected (StructureRequestalready lives there), but the modal quickstart should be updated alongside the Ray Serve one.References
docs/adr/0001-esmc-esmfold2-integration.md\"SAH\"= S-adenosyl-L-homocysteine,\"C36\"= 5-methyl-2′-deoxycytidine)