protein: add ESMC + ESMFold2 (Biohub 2026-05-27 release)#20
Merged
Conversation
Integrates the two model artifacts from Biohub's "world model of protein biology" release (MIT-licensed, github.com/Biohub/esm) as first-class typed Sheaf model types alongside the existing ESM-3 adapter. - ESMC (api/protein_language.py, backend `esmc`): per-token logits + optional per-token embeddings via transformers.AutoModelForMaskedLM, default model Biohub/ESMC-6B (only weight-downloadable variant; 300M and 600M variants are Forge API-only and raise NotImplementedError). - ESMFold2 (api/structure.py, backend `esmfold2`): protein structure prediction exposing num_loops / num_sampling_steps / num_samples / seed as first-class request fields, returning PDB or mmCIF + pLDDT + pTM/ipTM + optional PAE. - New ModelType.PROTEIN_LANGUAGE and ModelType.STRUCTURE; STRUCTURE is the first non-tensor output category (structure file as text). - [protein] install extra ships the supporting transformers/torch stack; the `esm` package itself is `pip install esm@git+...c94ed8d` per the upstream README (no PyPI release yet). Conflicts with [molecular] documented but not enforced at the resolver level (mirrors how [multimodal]/imagebind is handled). License verification: github.com/evolutionaryscale/esm 301→Biohub/esm; LICENSE.md is a standard MIT, Copyright 2026 Chan Zuckerberg Biohub, Inc. Full verification trail + design rationale in docs/adr/0001-esmc-esmfold2-integration.md. 682 + 27 = 709 tests passing.
✅ Deploy Preview for sheaf ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
- ESMCBackend: real transformers.MaskedLMOutput has no last_hidden_state; force output_hidden_states=True when embeddings are requested and read from hidden_states[-1]. Would have AttributeError'd on first GPU run. - modal_server.py: ProteinLanguageRequest + StructureRequest were missing from the parallel AnyRequest union (separate from sheaf.api.union); also wire esmc + esmfold2 into the backend registry imports. - StructureResponse.plddt: ESMFold2 returns pLDDT on [0, 1] not the conventional AlphaFold/ESMFold-v1 [0, 100]. Doc-only fix; faithful pass-through stays consistent with sheaf's "don't transform in backends" convention. - examples/: quickstart_protein_language.py, quickstart_structure.py, quickstart_protein_modal.py. The Modal quickstart drives ESMFold2Backend on H100 via the upstream esm git pin (revision 81b3646c); ~70s cold-start weight fetch to a persistent volume, sub-second inference per fold. Verified end-to-end: 53-residue fold on H100 → pTM=0.2465, 43,088-char mmCIF, no crashes. ADR amended with the empirical pLDDT-scale finding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gn notes - esm git pin: c94ed8d → 81b3646c9429ea8458918415ad6a46178cb59833 across README, ADR, pyproject.toml. The new SHA is what Modal's official modal-examples/06_gpu_and_ml/protein-folding/esmfold2.py uses and what we verified end-to-end via examples/quickstart_protein_modal.py on H100. ADR keeps a one-liner explaining the historical c94ed8d origin. - README v0.11 roadmap: tick the GPU smoke checkbox with the H100 result. - CLAUDE.md: add a v0.11 "Protein models" section with 10 design-decision bullets — MOLECULAR vs PROTEIN_LANGUAGE separation, STRUCTURE as the first non-tensor output category, the MaskedLMOutput.last_hidden_state bug we fixed, the [0,1] pLDDT scale finding, the modal_server.py parallel-union gotcha, the Forge-only NotImplementedError, etc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- pyproject.toml + src/sheaf/__init__.py: 0.10.0 → 0.11.0 - CLAUDE.md "Current state" header updated for the protein release - README v0.11 roadmap heading: strip "(in progress, draft PR)" qualifier PyPI publish + Docker image build will fire on `git tag v0.11.0 && git push --tags` after this merges to main (publish.yml is tag-triggered, not branch-triggered). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates the two model artifacts from Biohub's world model of protein biology release (2026-05-27, MIT) as first-class typed Sheaf model types, alongside the existing ESM-3 adapter.
api/protein_language.py, backendesmc. Per-token logits + optional per-token embeddings viatransformers.AutoModelForMaskedLM. Default modelBiohub/ESMC-6B(the only weight-downloadable variant; 300M / 600M are Forge API-only and raiseNotImplementedErrorwith a pointer to the ADR).api/structure.py, backendesmfold2. Protein structure prediction withnum_loops/num_sampling_steps/num_samples/seedas first-class request fields. Returns PDB or mmCIF + pLDDT + pTM / ipTM + optional PAE.STRUCTUREmodel category — the first non-tensor output category in Sheaf (structure file as text).PROTEIN_LANGUAGEis added alongsideMOLECULARbecause per-token logits / embeddings have an incompatible response shape with the per-sequence pooled-embedding contract.[protein]extra inpyproject.tomlshipstransformers+torch; theesmpackage itself is installed viapip install esm@git+https://github.com/Biohub/esm.git@c94ed8dper the upstream README (no PyPI release yet).[molecular]and[protein]share theesmimport name from different upstream packages — mutually exclusive, documented in the README.Full design rationale, upstream verification trail (HTTP probes, raw README, raw LICENSE.md), and alternatives considered live in
docs/adr/0001-esmc-esmfold2-integration.md.Release checklist
github.com/evolutionaryscale/esmHTTP 301 →github.com/Biohub/esm.LICENSE.md@mainis a standard MIT license, Copyright 2026 Chan Zuckerberg Biohub, Inc. (fetched verbatim viaraw.githubusercontent.com).Biohub/ESMC-6Bandbiohub/ESMFold2(case-inconsistency mirrors the upstream README). Forge-only variants (esmc-300m-2024-12,esmc-600m-2024-12,esmfold2-fast-2026-05) are explicitly rejected atload()with a pointer to the ADR.ruff check,ruff format --check,ty check src/all clean (no new diagnostics; pre-existing ty diagnostic count unchanged insheaf/worker/runner.py).examples/quickstart_protein_modal.py) against the standard target:NotImplementedErrorfor those model IDs today.Follow-up commit (b63557a)
The Modal H100 smoke surfaced three real bugs in the original commit, all fixed in the follow-up:
transformers.MaskedLMOutputhas nolast_hidden_statefield. Requestingreturn_embeddings=Truewithoutoutput_hidden_states=Truewould have AttributeError'd on the first real inference. Fixed by forcingoutput_hidden_states=Truewhenever embeddings are needed and reading uniformly fromhidden_states[-1]. Tests updated to mirror the realMaskedLMOutputshape.modal_server.py— has its ownAnyRequestunion (deliberately separate fromsheaf.api.unionto avoid pulling Ray into Modal containers).ProteinLanguageRequestandStructureRequestwere added tosheaf.api.unionbut missed inmodal_server; protein requests would have 422'd on the Modal path. Fixed, plus addedesmc+esmfold2to the backend registry imports in_build_asgi_app.StructureResponse.plddtscale — ADR and docstring claimed[0, 100], but ESMFold2 actually returns fractional[0, 1](verified empirically on H100). Doc-only fix; faithful pass-through is preserved (consistent with sheaf's "don't transform in backends" convention). ADR amended with the empirical finding.Also added three quickstart examples:
quickstart_protein_language.py,quickstart_structure.py,quickstart_protein_modal.py.Test plan
uv run --extra dev pytest tests/ -q— 682 passed, 65 skippeduv run --extra dev ruff check src/ tests/ examples/— cleanuv run --extra dev ruff format --check src/ tests/ examples/— cleanuv run --extra dev ty check src/— no new diagnostics (all remaining are pre-existing insheaf/worker/runner.py)modal run examples/quickstart_protein_modal.pyon H100 againstbiohub/ESMFold2withesm @ git+https://github.com/Biohub/esm.git@81b3646c…. End-to-end success; results above.[protein]extra install — verified twice: (1) fresh local Python 3.12 venv,pip install sheaf-serve[protein]from local source resolves cleanly (torch 2.12.0 + transformers 5.9.0); (2) inside the Modal image, where the upstreamesm@git+...pin also installs without issue.Out of scope
[molecular]and[protein]extras are mutually exclusive at theesmimport-name level; documented.Verification trail (one click for the reviewer)
Ready to merge once reviewed.
Generated by Claude Code