Skip to content

Validate checkpoint files after download / before load_from_checkpoint to surface corrupt or empty weights #664

@bio-punk

Description

@bio-punk

After preprocessing succeeds, boltz predict aborts during checkpoint loading when the weight file exists but is 0 bytes (e.g. interrupted download or failed proxy). The failure surface is poor: the process often exits with a generic Aborted! and little or no Python traceback, which makes debugging difficult.
预处理成功后,若权重文件 boltz2_conf.ckpt(或 boltz2_aff.ckpt)存在但大小为 0(例如下载中断、代理/网络异常),boltz predict 会在加载权重阶段异常退出。对外表现通常是 Aborted!、退出码 1,且常缺少清晰的 Python 报错,排障成本高。

Environment

  • OS:Ubuntu 22.04.5 LTS
  • Python:3.10
  • boltz version :2.2.1
  • PyTorch / CUDA : 2.10.0 + cuda12.8
  • How weights were obtained: first-run auto-download

Steps to reproduce / 复现步骤

  1. Point --cache to a directory where boltz2_conf.ckpt exists but is empty (e.g. truncate -s 0 boltz2_conf.ckpt for a minimal repro, or reproduce via a bad network/download).
  2. Run: boltz predict <input.yaml> --cache <cache_dir> --accelerator gpu --devices 1 (or CPU).
  3. Observe exit after lines such as “Running structure prediction…”.

  1. --cache 指向的目录中 boltz2_conf.ckpt 存在且为空(最小复现可用 truncate -s 0 boltz2_conf.ckpt,或由异常下载得到)。
  2. 运行:boltz predict <input.yaml> --cache <cache_dir> --accelerator gpu --devices 1(或 CPU)。
  3. 在出现类似 “Running structure prediction…” 之后进程异常退出。

Expected behavior / 期望行为

If the checkpoint path exists but is empty, unreadable, or implausibly small, fail early with a clear, actionable error (e.g. “checkpoint file is empty or corrupt; delete and re-download”) instead of a low-level abort.
若 checkpoint 路径存在但为空、不可读或体积极小,应在加载前 尽早失败,并给出 可读、可操作的错误信息(例如提示文件为空/损坏、删除后重新下载),而不是底层直接 Aborted!

Actual behavior / 实际行为

Process terminates with exit code 1 and messages like Aborted!, often without a helpful Python exception or pointer to the bad checkpoint. (In our case, debug logging immediately before load_from_checkpoint showed exists=True but size 0, and the crash happened inside loading.)
进程以 退出码 1 结束,日志常见 Aborted!,往往 没有 明确指出是哪个 checkpoint 损坏。我们在 load_from_checkpoint 前增加调试日志后确认:exists=True 且大小为 0,崩溃发生在加载权重过程中。

Proposed solutions (feature request) / 可行改进(功能建议)

— optional checks (non-exhaustive):

  1. After download in download_boltz1 / download_boltz2 (and affinity weights): verify stat().st_size > min_bytes (and optionally magic/header sanity) before treating the file as ready.
  2. Before load_from_checkpoint: if path exists and st_size == 0 (or < threshold), raise a RuntimeError / click.ClickException with the resolved path and remediation (“delete file and retry”, check proxy/disk quota).
  3. Optionally document in README/Troubleshooting: empty .ckpt from interrupted downloads.

— 可选校验方向(示例,非穷尽):

  1. download_boltz1 / download_boltz2(及亲和力权重)下载完成后:校验文件 st_size 大于合理下限(必要时可做简单头部/格式检查),再认为下载成功。
  2. load_from_checkpoint 调用前:若路径存在且 st_size == 0(或小于某阈值),抛出 RuntimeErrorclick.ClickException,信息中包含 解析后的绝对路径 与处理建议(删除后重试、检查代理与磁盘配额等)。
  3. 可选:在 README / Troubleshooting 中说明 下载中断可能导致 0 字节 ckpt

Additional context / 补充说明

We understand that some failures originate in native/PyTorch code paths; pre-flight validation still improves UX a lot. Happy to open a PR if maintainers agree on thresholds and error types.
理解部分失败可能来自底层 native/PyTorch 路径;加载前校验仍能显著改善可维护性与用户体验。若维护者认可阈值与异常类型,我们愿意配合提交 PR。

Checklist / 自查清单

  • [ O ] Searched existing issues for duplicates / 已搜索是否已有重复 issue
  • [ O ] Minimal repro described / 已描述最小复现
  • [ O ] Version info attached / 已附上版本信息

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions