Skip to content

protein: add ESMC + ESMFold2 (Biohub 2026-05-27 release)#20

Merged
korbonits merged 4 commits into
mainfrom
claude/esmc-esmfold2-integration-J9WQu
May 28, 2026
Merged

protein: add ESMC + ESMFold2 (Biohub 2026-05-27 release)#20
korbonits merged 4 commits into
mainfrom
claude/esmc-esmfold2-integration-J9WQu

Conversation

@korbonits
Copy link
Copy Markdown
Owner

@korbonits korbonits commented May 27, 2026

Summary

Integrates the two model artifacts from Biohub's world model of protein biology release (2026-05-27, MIT) as first-class typed Sheaf model types, alongside the existing ESM-3 adapter.

  • ESMCapi/protein_language.py, backend esmc. Per-token logits + optional per-token embeddings via transformers.AutoModelForMaskedLM. Default model Biohub/ESMC-6B (the only weight-downloadable variant; 300M / 600M are Forge API-only and raise NotImplementedError with a pointer to the ADR).
  • ESMFold2api/structure.py, backend esmfold2. Protein structure prediction with num_loops / num_sampling_steps / num_samples / seed as first-class request fields. Returns PDB or mmCIF + pLDDT + pTM / ipTM + optional PAE.
  • New STRUCTURE model category — the first non-tensor output category in Sheaf (structure file as text). PROTEIN_LANGUAGE is added alongside MOLECULAR because per-token logits / embeddings have an incompatible response shape with the per-sequence pooled-embedding contract.
  • [protein] extra in pyproject.toml ships transformers + torch; the esm package itself is installed via pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d per the upstream README (no PyPI release yet). [molecular] and [protein] share the esm import name from different upstream packages — mutually exclusive, documented in the README.

Full design rationale, upstream verification trail (HTTP probes, raw README, raw LICENSE.md), and alternatives considered live in docs/adr/0001-esmc-esmfold2-integration.md.

Release checklist

  • License verifiedgithub.com/evolutionaryscale/esm HTTP 301 → github.com/Biohub/esm. LICENSE.md@main is a standard MIT license, Copyright 2026 Chan Zuckerberg Biohub, Inc. (fetched verbatim via raw.githubusercontent.com).
  • Weights downloadable, not Forge-gated — for the default models served by each backend: Biohub/ESMC-6B and biohub/ESMFold2 (case-inconsistency mirrors the upstream README). Forge-only variants (esmc-300m-2024-12, esmc-600m-2024-12, esmfold2-fast-2026-05) are explicitly rejected at load() with a pointer to the ADR.
  • Tests passing — full suite green (682 passed, 65 skipped); ruff check, ruff format --check, ty check src/ all clean (no new diagnostics; pre-existing ty diagnostic count unchanged in sheaf/worker/runner.py).
  • Docs updated — README "Protein models" section, supported-model-types table, install instructions, v0.11 roadmap entry; ADR-0001 with verification trail.
  • End-to-end smoke tests on GPU — verified via Modal H100 (examples/quickstart_protein_modal.py) against the standard target:
    MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEK
    
    Result: 53 residues, pTM=0.2465, mmCIF 43,088 chars. Cold start ~70s for the ~12 GB weight fetch (cached in a persistent Modal volume); sub-second per fold afterwards.
  • Forge / Biohub-Platform HTTP-client variants — explicitly out of scope for this PR; backends raise NotImplementedError for those model IDs today.

Follow-up commit (b63557a)

The Modal H100 smoke surfaced three real bugs in the original commit, all fixed in the follow-up:

  • ESMCBackendtransformers.MaskedLMOutput has no last_hidden_state field. Requesting return_embeddings=True without output_hidden_states=True would have AttributeError'd on the first real inference. Fixed by forcing output_hidden_states=True whenever embeddings are needed and reading uniformly from hidden_states[-1]. Tests updated to mirror the real MaskedLMOutput shape.
  • modal_server.py — has its own AnyRequest union (deliberately separate from sheaf.api.union to avoid pulling Ray into Modal containers). ProteinLanguageRequest and StructureRequest were added to sheaf.api.union but missed in modal_server; protein requests would have 422'd on the Modal path. Fixed, plus added esmc + esmfold2 to the backend registry imports in _build_asgi_app.
  • StructureResponse.plddt scale — ADR and docstring claimed [0, 100], but ESMFold2 actually returns fractional [0, 1] (verified empirically on H100). Doc-only fix; faithful pass-through is preserved (consistent with sheaf's "don't transform in backends" convention). ADR amended with the empirical finding.

Also added three quickstart examples: quickstart_protein_language.py, quickstart_structure.py, quickstart_protein_modal.py.

Test plan

  • uv run --extra dev pytest tests/ -q — 682 passed, 65 skipped
  • uv run --extra dev ruff check src/ tests/ examples/ — clean
  • uv run --extra dev ruff format --check src/ tests/ examples/ — clean
  • uv run --extra dev ty check src/ — no new diagnostics (all remaining are pre-existing in sheaf/worker/runner.py)
  • GPU smokemodal run examples/quickstart_protein_modal.py on H100 against biohub/ESMFold2 with esm @ git+https://github.com/Biohub/esm.git@81b3646c…. End-to-end success; results above.
  • [protein] extra install — verified twice: (1) fresh local Python 3.12 venv, pip install sheaf-serve[protein] from local source resolves cleanly (torch 2.12.0 + transformers 5.9.0); (2) inside the Modal image, where the upstream esm@git+... pin also installs without issue.

Out of scope

  • ESM Atlas ingestion — that's a dataset, not a model. Separate effort.
  • Replacing or deprecating ESM-3 — kept alongside ESMC; users may want either. The [molecular] and [protein] extras are mutually exclusive at the esm import-name level; documented.
  • Show HN / launch — after this lands.

Verification trail (one click for the reviewer)

# Confirm the redirect we're depending on
curl -sSI https://github.com/evolutionaryscale/esm | head -1
# → HTTP/2 301, location: https://github.com/Biohub/esm

# Confirm the LICENSE is standard MIT
curl -sS https://raw.githubusercontent.com/Biohub/esm/main/LICENSE.md
# → "License (MIT)\nCopyright 2026 Chan Zuckerberg Biohub, Inc. ..."

# Reproduce the GPU smoke (Modal account required, ~70s cold start, ~$0.05 on H100)
modal run examples/quickstart_protein_modal.py

Ready to merge once reviewed.


Generated by Claude Code

Integrates the two model artifacts from Biohub's "world model of protein
biology" release (MIT-licensed, github.com/Biohub/esm) as first-class
typed Sheaf model types alongside the existing ESM-3 adapter.

- ESMC (api/protein_language.py, backend `esmc`): per-token logits +
  optional per-token embeddings via transformers.AutoModelForMaskedLM,
  default model Biohub/ESMC-6B (only weight-downloadable variant; 300M
  and 600M variants are Forge API-only and raise NotImplementedError).
- ESMFold2 (api/structure.py, backend `esmfold2`): protein structure
  prediction exposing num_loops / num_sampling_steps / num_samples /
  seed as first-class request fields, returning PDB or mmCIF + pLDDT +
  pTM/ipTM + optional PAE.
- New ModelType.PROTEIN_LANGUAGE and ModelType.STRUCTURE; STRUCTURE is
  the first non-tensor output category (structure file as text).
- [protein] install extra ships the supporting transformers/torch stack;
  the `esm` package itself is `pip install esm@git+...c94ed8d` per the
  upstream README (no PyPI release yet).  Conflicts with [molecular]
  documented but not enforced at the resolver level (mirrors how
  [multimodal]/imagebind is handled).

License verification: github.com/evolutionaryscale/esm 301→Biohub/esm;
LICENSE.md is a standard MIT, Copyright 2026 Chan Zuckerberg Biohub, Inc.

Full verification trail + design rationale in
docs/adr/0001-esmc-esmfold2-integration.md.

682 + 27 = 709 tests passing.
@netlify
Copy link
Copy Markdown

netlify Bot commented May 27, 2026

Deploy Preview for sheaf ready!

Name Link
🔨 Latest commit 8031a03
🔍 Latest deploy log https://app.netlify.com/projects/sheaf/deploys/6a17cf6580eaa2000887ee9f
😎 Deploy Preview https://deploy-preview-20--sheaf.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

korbonits and others added 3 commits May 27, 2026 22:03
- ESMCBackend: real transformers.MaskedLMOutput has no last_hidden_state;
  force output_hidden_states=True when embeddings are requested and read
  from hidden_states[-1]. Would have AttributeError'd on first GPU run.
- modal_server.py: ProteinLanguageRequest + StructureRequest were missing
  from the parallel AnyRequest union (separate from sheaf.api.union); also
  wire esmc + esmfold2 into the backend registry imports.
- StructureResponse.plddt: ESMFold2 returns pLDDT on [0, 1] not the
  conventional AlphaFold/ESMFold-v1 [0, 100]. Doc-only fix; faithful
  pass-through stays consistent with sheaf's "don't transform in backends"
  convention.
- examples/: quickstart_protein_language.py, quickstart_structure.py,
  quickstart_protein_modal.py. The Modal quickstart drives ESMFold2Backend
  on H100 via the upstream esm git pin (revision 81b3646c); ~70s cold-start
  weight fetch to a persistent volume, sub-second inference per fold.

Verified end-to-end: 53-residue fold on H100 → pTM=0.2465, 43,088-char
mmCIF, no crashes. ADR amended with the empirical pLDDT-scale finding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gn notes

- esm git pin: c94ed8d → 81b3646c9429ea8458918415ad6a46178cb59833 across
  README, ADR, pyproject.toml. The new SHA is what Modal's official
  modal-examples/06_gpu_and_ml/protein-folding/esmfold2.py uses and what
  we verified end-to-end via examples/quickstart_protein_modal.py on H100.
  ADR keeps a one-liner explaining the historical c94ed8d origin.
- README v0.11 roadmap: tick the GPU smoke checkbox with the H100 result.
- CLAUDE.md: add a v0.11 "Protein models" section with 10 design-decision
  bullets — MOLECULAR vs PROTEIN_LANGUAGE separation, STRUCTURE as the
  first non-tensor output category, the MaskedLMOutput.last_hidden_state
  bug we fixed, the [0,1] pLDDT scale finding, the modal_server.py
  parallel-union gotcha, the Forge-only NotImplementedError, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- pyproject.toml + src/sheaf/__init__.py: 0.10.0 → 0.11.0
- CLAUDE.md "Current state" header updated for the protein release
- README v0.11 roadmap heading: strip "(in progress, draft PR)" qualifier

PyPI publish + Docker image build will fire on `git tag v0.11.0 && git push --tags`
after this merges to main (publish.yml is tag-triggered, not branch-triggered).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@korbonits korbonits marked this pull request as ready for review May 28, 2026 05:35
@korbonits korbonits merged commit 9e9206d into main May 28, 2026
8 checks passed
@korbonits korbonits deleted the claude/esmc-esmfold2-integration-J9WQu branch May 28, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants