After preprocessing succeeds, boltz predict aborts during checkpoint loading when the weight file exists but is 0 bytes (e.g. interrupted download or failed proxy). The failure surface is poor: the process often exits with a generic Aborted! and little or no Python traceback, which makes debugging difficult.
预处理成功后,若权重文件 boltz2_conf.ckpt(或 boltz2_aff.ckpt)存在但大小为 0(例如下载中断、代理/网络异常),boltz predict 会在加载权重阶段异常退出。对外表现通常是 Aborted!、退出码 1,且常缺少清晰的 Python 报错,排障成本高。
Environment
- OS:Ubuntu 22.04.5 LTS
- Python:3.10
boltz version :2.2.1
- PyTorch / CUDA : 2.10.0 + cuda12.8
- How weights were obtained: first-run auto-download
Steps to reproduce / 复现步骤
- Point
--cache to a directory where boltz2_conf.ckpt exists but is empty (e.g. truncate -s 0 boltz2_conf.ckpt for a minimal repro, or reproduce via a bad network/download).
- Run:
boltz predict <input.yaml> --cache <cache_dir> --accelerator gpu --devices 1 (or CPU).
- Observe exit after lines such as “Running structure prediction…”.
- 令
--cache 指向的目录中 boltz2_conf.ckpt 存在且为空(最小复现可用 truncate -s 0 boltz2_conf.ckpt,或由异常下载得到)。
- 运行:
boltz predict <input.yaml> --cache <cache_dir> --accelerator gpu --devices 1(或 CPU)。
- 在出现类似 “Running structure prediction…” 之后进程异常退出。
Expected behavior / 期望行为
If the checkpoint path exists but is empty, unreadable, or implausibly small, fail early with a clear, actionable error (e.g. “checkpoint file is empty or corrupt; delete and re-download”) instead of a low-level abort.
若 checkpoint 路径存在但为空、不可读或体积极小,应在加载前 尽早失败,并给出 可读、可操作的错误信息(例如提示文件为空/损坏、删除后重新下载),而不是底层直接 Aborted!。
Actual behavior / 实际行为
Process terminates with exit code 1 and messages like Aborted!, often without a helpful Python exception or pointer to the bad checkpoint. (In our case, debug logging immediately before load_from_checkpoint showed exists=True but size 0, and the crash happened inside loading.)
进程以 退出码 1 结束,日志常见 Aborted!,往往 没有 明确指出是哪个 checkpoint 损坏。我们在 load_from_checkpoint 前增加调试日志后确认:exists=True 且大小为 0,崩溃发生在加载权重过程中。
Proposed solutions (feature request) / 可行改进(功能建议)
— optional checks (non-exhaustive):
- After download in
download_boltz1 / download_boltz2 (and affinity weights): verify stat().st_size > min_bytes (and optionally magic/header sanity) before treating the file as ready.
- Before
load_from_checkpoint: if path exists and st_size == 0 (or < threshold), raise a RuntimeError / click.ClickException with the resolved path and remediation (“delete file and retry”, check proxy/disk quota).
- Optionally document in README/Troubleshooting: empty
.ckpt from interrupted downloads.
— 可选校验方向(示例,非穷尽):
- 在
download_boltz1 / download_boltz2(及亲和力权重)下载完成后:校验文件 st_size 大于合理下限(必要时可做简单头部/格式检查),再认为下载成功。
- 在
load_from_checkpoint 调用前:若路径存在且 st_size == 0(或小于某阈值),抛出 RuntimeError 或 click.ClickException,信息中包含 解析后的绝对路径 与处理建议(删除后重试、检查代理与磁盘配额等)。
- 可选:在 README / Troubleshooting 中说明 下载中断可能导致 0 字节 ckpt。
Additional context / 补充说明
We understand that some failures originate in native/PyTorch code paths; pre-flight validation still improves UX a lot. Happy to open a PR if maintainers agree on thresholds and error types.
理解部分失败可能来自底层 native/PyTorch 路径;加载前校验仍能显著改善可维护性与用户体验。若维护者认可阈值与异常类型,我们愿意配合提交 PR。
Checklist / 自查清单
- [ O ] Searched existing issues for duplicates / 已搜索是否已有重复 issue
- [ O ] Minimal repro described / 已描述最小复现
- [ O ] Version info attached / 已附上版本信息
After preprocessing succeeds,
boltz predictaborts during checkpoint loading when the weight file exists but is 0 bytes (e.g. interrupted download or failed proxy). The failure surface is poor: the process often exits with a genericAborted!and little or no Python traceback, which makes debugging difficult.预处理成功后,若权重文件
boltz2_conf.ckpt(或boltz2_aff.ckpt)存在但大小为 0(例如下载中断、代理/网络异常),boltz predict会在加载权重阶段异常退出。对外表现通常是Aborted!、退出码 1,且常缺少清晰的 Python 报错,排障成本高。Environment
boltzversion :2.2.1Steps to reproduce / 复现步骤
--cacheto a directory whereboltz2_conf.ckptexists but is empty (e.g.truncate -s 0 boltz2_conf.ckptfor a minimal repro, or reproduce via a bad network/download).boltz predict <input.yaml> --cache <cache_dir> --accelerator gpu --devices 1(or CPU).--cache指向的目录中boltz2_conf.ckpt存在且为空(最小复现可用truncate -s 0 boltz2_conf.ckpt,或由异常下载得到)。boltz predict <input.yaml> --cache <cache_dir> --accelerator gpu --devices 1(或 CPU)。Expected behavior / 期望行为
If the checkpoint path exists but is empty, unreadable, or implausibly small, fail early with a clear, actionable error (e.g. “checkpoint file is empty or corrupt; delete and re-download”) instead of a low-level abort.
若 checkpoint 路径存在但为空、不可读或体积极小,应在加载前 尽早失败,并给出 可读、可操作的错误信息(例如提示文件为空/损坏、删除后重新下载),而不是底层直接
Aborted!。Actual behavior / 实际行为
Process terminates with exit code 1 and messages like
Aborted!, often without a helpful Python exception or pointer to the bad checkpoint. (In our case, debug logging immediately beforeload_from_checkpointshowedexists=Truebut size 0, and the crash happened inside loading.)进程以 退出码 1 结束,日志常见
Aborted!,往往 没有 明确指出是哪个 checkpoint 损坏。我们在load_from_checkpoint前增加调试日志后确认:exists=True且大小为 0,崩溃发生在加载权重过程中。Proposed solutions (feature request) / 可行改进(功能建议)
— optional checks (non-exhaustive):
download_boltz1/download_boltz2(and affinity weights): verifystat().st_size > min_bytes(and optionally magic/header sanity) before treating the file as ready.load_from_checkpoint: if path exists andst_size == 0(or< threshold), raise aRuntimeError/click.ClickExceptionwith the resolved path and remediation (“delete file and retry”, check proxy/disk quota)..ckptfrom interrupted downloads.— 可选校验方向(示例,非穷尽):
download_boltz1/download_boltz2(及亲和力权重)下载完成后:校验文件st_size大于合理下限(必要时可做简单头部/格式检查),再认为下载成功。load_from_checkpoint调用前:若路径存在且st_size == 0(或小于某阈值),抛出RuntimeError或click.ClickException,信息中包含 解析后的绝对路径 与处理建议(删除后重试、检查代理与磁盘配额等)。Additional context / 补充说明
We understand that some failures originate in native/PyTorch code paths; pre-flight validation still improves UX a lot. Happy to open a PR if maintainers agree on thresholds and error types.
理解部分失败可能来自底层 native/PyTorch 路径;加载前校验仍能显著改善可维护性与用户体验。若维护者认可阈值与异常类型,我们愿意配合提交 PR。
Checklist / 自查清单