Skip to content

Create CLIs for full OpenFold pipeline#67

Open
nipun-khanna wants to merge 8 commits into
AI2Science:mainfrom
bengal-tech365:main
Open

Create CLIs for full OpenFold pipeline#67
nipun-khanna wants to merge 8 commits into
AI2Science:mainfrom
bengal-tech365:main

Conversation

@nipun-khanna
Copy link
Copy Markdown

@nipun-khanna nipun-khanna commented Apr 23, 2026

Summary

  • Adds new cli scripts that can be used as a CLI for the full OpenFold pipeline (align, infer, visualize, data/feature gen)
  • Adds detailed documentation for the CLI (with plans to extend these docs as new CLIs are added)

Details

  • Created a new directory under project root called cli/ for organizing CLI scripts

    • For reviewers: If you think the cli/ directory is unnecessary or the scripts should be moved elsewhere please comment
  • Introduces three new CLI scripts for user execution of all stages of the OpenFold pipeline: precomputing alignments, running inference, and conducting visualization

  • The CLI offers HPC compatibility and the inference portion allows for dry running to verify commands before consuming GPU hours

  • Added vizfold_cli_precompute_align.py for better standardized CLI entry

    • Added validate_args() which checks all inputs before computations and notifies issues at once.
    • Added a run manifest that records all arguments used to the output directory.
    • Added functionality to capture intermediate outputs to save states and see where the pipeline failed.
  • Introduces CLI to run the full data pipeline (MSA + template search + feature generation)

    • Supports both monomer and multimer pipelines via --multimer flag
    • Integration with:
      • JackHMMer for MSA generation
      • HHSearch (monomer) / HMMSearch (multimer) for template search
      • AlphaFold/OpenFold DataPipeline and DataPipelineMultimer
    • Outputs a serialized feature_dict.pickle for downstream inference
    • Strong input validation for required databases and binaries
    • Clear logging for each pipeline stage
  • Added new CLI tool to cluster protein sequences using mmseqs2 with PDB‑style parameters

    • Produces standardized cluster files for downstream VizFold workflows
    • Integration with:
      • mmseqs2 easy-cluster pipeline
      • PDB‑style identity thresholds and coverage settings
    • Outputs a reformatted text file where each line lists all {PDB_ID}_{CHAIN_ID} entries in a cluster
    • Includes strong input validation for FASTA paths, mmseqs2 binary, and output directories
    • Clear logging for each clustering stage
  • Added new CLI tool to generate FASTA files from alignment directories or alignment‑DB index files

    • Supports both directory‑based alignments and compressed alignment‑DB formats
    • Integration with:
      • Mgnify, UniRef90, and BFD/Uniclust alignment file formats
      • Multi‑threaded extraction for large alignment sets
    • Outputs a consolidated FASTA containing one entry per chain
    • Includes strong validation for alignment sources and output paths
    • Clear logging for each extraction stage
  • Added extensive documentation under cli/USAGE.md

Example Usage
Full example usage can be found in the docs

Some examples:

python vizfold_cli_feature_dict.py
sequences.fasta templates_dir/ output_dir/
--uniref90_database_path /data/uniref90.fas
python vizfold_cli_fasta_to_clusterfile.py
sequences.fasta clusters.txt /path/to/mmseqs
--seq-id 0.4
python vizfold_cli_align_fasta.py output.fasta
--alignment-dir alignments/

Purpose

@MukilSundaravadivel
Copy link
Copy Markdown

MukilSundaravadivel commented Apr 26, 2026

Added New CLI tool to run data pipeline converting protein sequences to AlphaFold feature dictionaries (MSA + template search + feature generation)
Supports both monomer and multimer pipelines via --multimer flag
Integration with:
JackHMMer for MSA generation
HHSearch (monomer) / HMMSearch (multimer) for template search
AlphaFold/OpenFold DataPipeline and DataPipelineMultimer
Outputs a serialized feature_dict.pickle for downstream inference
Strong input validation for required databases and binaries
Clear logging for each pipeline stage
Usage example
python vizfold_cli_feature_dict.py
sequences.fasta templates_dir/ output_dir/
--uniref90_database_path /data/uniref90.fas

Addresses Issue #38

@nipun-khanna nipun-khanna changed the title Create CLI for alignment, inference, and visualization Create CLI for alignment, inference, visualization, and feature gen Apr 27, 2026
@nipun-khanna nipun-khanna changed the title Create CLI for alignment, inference, visualization, and feature gen Create CLIs for full OpenFold pipeline Apr 27, 2026
@Bailsnob
Copy link
Copy Markdown

Added new CLI tool to cluster protein sequences using mmseqs2 with PDB‑style parameters
Produces standardized cluster files for downstream VizFold workflows
Integration with:
• mmseqs2 easy-cluster pipeline
• PDB‑style identity thresholds and coverage settings
Outputs a reformatted text file where each line lists all {PDB_ID}_{CHAIN_ID} entries in a cluster
Includes strong input validation for FASTA paths, mmseqs2 binary, and output directories
Clear logging for each clustering stage
Usage example
python vizfold_cli_fasta_to_clusterfile.py
sequences.fasta clusters.txt /path/to/mmseqs
--seq-id 0.4

Added new CLI tool to generate FASTA files from alignment directories or alignment‑DB index files
Supports both directory‑based alignments and compressed alignment‑DB formats
Integration with:
• Mgnify, UniRef90, and BFD/Uniclust alignment file formats
• Multi‑threaded extraction for large alignment sets
Outputs a consolidated FASTA containing one entry per chain
Includes strong validation for alignment sources and output paths
Clear logging for each extraction stage
Usage example
python vizfold_cli_align_fasta.py output.fasta
--alignment-dir alignments/

Addresses Issue #38

Added details about additional utility CLIs for FASTA clustering and alignment extraction, including usage examples and output descriptions.
@PranavNarala1
Copy link
Copy Markdown

I think this PR does a good job of making the pipeline more usable from the command line instead of relying on individual scripts being run manually. One thing I liked is that it covers multiple stages of the workflow, including alignment, inference, visualization, and feature generation, so it feels more like a complete interface rather than just one extra utility. I also thought the stronger argument validation and run-manifest idea were good additions, since those make debugging and reproducibility a lot easier for longer HPC workflows. One thing I would still suggest checking is whether the growing number of CLI scripts could create overlap or duplicated logic over time, especially since the PR already mentions removing duplicate argparsers. It might help to think about whether some shared validation or common argument handling should be centralized early so the CLI layer stays easier to maintain as more commands get added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants