Skip to content

Hydra Configs#70

Open
vratins wants to merge 11 commits intomainfrom
dev_paths_clarity
Open

Hydra Configs#70
vratins wants to merge 11 commits intomainfrom
dev_paths_clarity

Conversation

@vratins
Copy link
Copy Markdown
Contributor

@vratins vratins commented Mar 24, 2026

PR towards:

  • Replacing argparse with hydra; defaults stored in configs
  • Replace all hard-coded paths with relative paths or no default
  • Replace _x with _feat for all nodes (trivial change hence in this PR)

Summary by CodeRabbit

  • Chores

    • Migrated training and inference CLIs to Hydra; scripts now use structured configs and require explicit data directory inputs.
    • Updated runtime deps to include Hydra/OmegaConf and adjusted visualization package.
  • Documentation

    • README overhauled for Hydra usage; CLI flag docs replaced with configuration structure and key mappings.
    • Added comprehensive YAML configs for data, flow, models, logging, training, and inference.
  • Bug Fixes

    • Standardized cached feature naming and updated cache handling.
  • Tests

    • Tests updated to use structured configs (OmegaConf) instead of argparse.

Copilot AI review requested due to automatic review settings March 24, 2026 21:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a2d7a24-2124-4518-88b8-702c38a61a96

📥 Commits

Reviewing files that changed from the base of the PR and between 45bd4ec and d615f1a.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • pyproject.toml
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml

📝 Walkthrough

Walkthrough

This PR migrates CLI argument parsing to Hydra, adds new Hydra configs (train, inference, logging, data, flow, model variants), updates scripts to accept DictConfig (train.py, inference.py), renames cached tensor keys to *_feat, makes base_pdb_dir required, updates docs/tests, and adds hydra/omegaconf deps.

Changes

Cohort / File(s) Summary
Hydra config files
configs/train.yaml, configs/inference.yaml, configs/logging/default.yaml, configs/data/default.yaml, configs/flow/default.yaml, configs/model/esm.yaml, configs/model/gvp.yaml, configs/model/slae.yaml
Add new Hydra YAML configs defining training, inference, logging, data quality/water filters, flow options, and model variants; declare required top-level keys (train_list, val_list, run_dir, pdb_list, output_dir) and nested structured options.
Train & inference scripts
scripts/train.py, scripts/inference.py
Replace argparse with @hydra.main DictConfig entrypoints; change function signatures to accept cfg: DictConfig; load resolved config YAMLs via OmegaConf; rewire scheduler/optimizer/dataloader/W&B/metrics/checkpoint plumbing to use cfg.*.
Dataset & feature naming
src/dataset.py, src/flow.py, tests/test_dataset.py
Rename cached/runtime feature keys from protein_x/water_x/mate_xprotein_feat/water_feat/mate_feat; update cache read/write and graph .x assignments; enforce base_pdb_dir required at init (raises if missing); update tests accordingly.
Embedding CLI changes
scripts/generate_esm_embeddings.py, scripts/generate_slae_embeddings.py
Remove hardcoded --base_pdb_dir defaults and make --base_pdb_dir a required CLI argument.
Tests & test helpers
tests/test_train_config.py, other tests...`
Migrate tests from argparse.Namespace to OmegaConf.create(...)/DictConfig, update call sites, and remove argparse-specific tests.
Documentation
README.md
Update README to document Hydra-based usage (override syntax), config structure and keys, checkpoint/save cadence (logging.save_every), cache naming (*_feat), and W&B config keys.
Project dependencies
pyproject.toml
Add hydra-core and omegaconf to project dependencies; replace pymol-open-source with pymol-open-source-whl>=3.1.0.4.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Hydra as Hydra (config)
  participant Train as scripts/train.py
  participant Dataset
  participant Model
  participant WAndB as W&B

  User->>Hydra: supply overrides (e.g., model=esm, training.batch_size=..., run_dir=...)
  Hydra->>Train: resolved cfg passed to main(cfg)
  Train->>Dataset: init dataloaders using cfg.data and cfg.training
  Train->>Model: build model using cfg.model and cfg.flow
  Train->>WAndB: init logging using cfg.logging.wandb
  Train->>Model: run training loop per cfg.training
  Model-->>Train: emit metrics & checkpoints
  Train->>WAndB: log metrics & artifacts per cfg.logging.save_every
  Train->>Hydra: write resolved cfg to run_dir/config.yaml
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • marcuscollins

Poem

🐰
Hydra whispers, YAML leaves unfurl,
Configs hop in—each a tidy pearl,
Features renamed, directories made true,
Training tunes up, checkpoints hum anew,
A carrot-coded cheer for the dev crew! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Hydra Configs' is vague and generic, using non-descriptive terminology that does not clearly convey the scope or significance of the changes. Consider a more descriptive title such as 'Replace argparse with Hydra configuration system' or 'Migrate to Hydra configs and remove hardcoded paths' to better reflect the actual changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 80.65% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev_paths_clarity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates the training/inference entrypoints from argparse to Hydra configs (stored under configs/), removes hard-coded default paths by making key paths user-supplied, and renames cached node feature keys from *_x to *_feat across the dataset/cache layer.

Changes:

  • Replaced argparse-driven train.py / inference.py CLIs with Hydra (configs/train.yaml, configs/inference.yaml, and grouped defaults).
  • Removed hard-coded default base_pdb_dir in dataset + embedding scripts; made it required.
  • Renamed cached feature keys (protein_x/water_xprotein_feat/water_feat) and updated tests/docs accordingly.

Reviewed changes

Copilot reviewed 18 out of 26 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_train_config.py Updates tests to pass Hydra-style cfg into resolve_encoder_config; removes argparse-specific tests.
tests/test_dataset.py Updates cache-key expectations to protein_feat / water_feat.
src/flow.py Renames local water feature variable (still assigned to PyG .x).
src/dataset.py Removes hard-coded base_pdb_dir default; enforces base_pdb_dir presence; renames cached keys to *_feat.
scripts/train.py Converts training pipeline to Hydra; writes config.yaml into the run directory.
scripts/inference.py Converts inference pipeline to Hydra; loads config.yaml from training run.
scripts/generate_slae_embeddings.py Makes --base_pdb_dir required (removes hard-coded default).
scripts/generate_esm_embeddings.py Makes --base_pdb_dir required (removes hard-coded default).
configs/train.yaml New Hydra training entrypoint config with grouped defaults.
configs/model/slae.yaml New Hydra model group config for SLAE.
configs/model/gvp.yaml New Hydra model group config for GVP.
configs/model/esm.yaml New Hydra model group config for ESM.
configs/logging/default.yaml New Hydra logging group config (save/checkpoint/W&B settings).
configs/inference.yaml New Hydra inference entrypoint config.
configs/flow/default.yaml New Hydra flow group config.
configs/data/default.yaml New Hydra data group config (paths + quality/water filter settings).
README.md Updates docs for Hydra usage and *_feat cache naming.
figures/water_quality/01_ediam_waters.png Adds figure artifact.
figures/water_quality/02_ediam_pdbs.png Adds figure artifact.
figures/water_quality/03_rsccs_waters.png Adds figure artifact.
figures/water_quality/04_rsccs_pdbs.png Adds figure artifact.
figures/water_quality/05_bfactor_waters.png Adds figure artifact.
figures/water_quality/06_bfactor_pdbs.png Adds figure artifact.
figures/water_quality/07_ediam_bfactor_correlation.png Adds figure artifact.
Comments suppressed due to low confidence (3)

tests/test_train_config.py:151

  • These inference-related tests still construct a flat training_config / config dict, but the training script now writes a nested Hydra config.yaml (data.*, model.*, etc.). Once inference is updated to consume the nested config, these tests should be updated (or duplicated) to cover the Hydra-shaped config so they validate the new behavior rather than the deprecated flat format.
    encoder = build_encoder(encoder_config, device)
    model = FlowWaterGVP(encoder=encoder)

    assert model.encoder.output_dims == (128, 0)


def test_inference_build_model_from_config_uses_embedding_dim(device):
    config = {
        "encoder_type": "slae",
        "hidden_s": 128,
        "hidden_v": 32,
        "flow_layers": 2,
        "node_scalar_in": 16,
        "embedding_dim": 128,
        "k_pw": 8,
        "k_ww": 8,
    }

    model = build_model_from_config(config, device)

    assert model.encoder.output_dims == (128, 0)


def test_inference_extracts_filter_config_from_training_config():
    """Verify inference correctly extracts filter params from training config."""
    from scripts.inference import _extract_dataset_filter_config

    training_config = {
        "min_water_residue_ratio": 0.7,
        "max_protein_dist": 4.5,
        "filter_by_edia": False,
    }

scripts/inference.py:84

  • The inference script still assumes the training config is a flat dict (e.g., config.get('max_com_dist')), but scripts/train.py now writes a Hydra-composed config.yaml with nested keys (e.g., data.quality.max_com_dist, data.water_filter.*). This will silently fall back to defaults and run inference with different filtering than training. Update _extract_dataset_filter_config to read from the nested structure (and optionally keep backward-compat for old flat configs).


def _extract_dataset_filter_config(config: dict) -> dict:
    """Extract dataset filter params from training config with fallback to defaults."""
    return {
        "max_com_dist": config.get("max_com_dist", 25.0),
        "max_clash_fraction": config.get("max_clash_fraction", 0.05),
        "clash_dist": config.get("clash_dist", 2.0),
        "interface_dist_threshold": config.get("interface_dist_threshold", 4.0),
        "min_water_residue_ratio": config.get("min_water_residue_ratio", 0.6),
        "edia_dir": config.get("edia_dir"),
        "max_protein_dist": config.get("max_protein_dist", 5.0),
        "min_edia": config.get("min_edia", 0.4),
        "max_bfactor_zscore": config.get("max_bfactor_zscore", 1.5),
        "filter_by_distance": config.get("filter_by_distance", True),
        "filter_by_edia": config.get("filter_by_edia", True),

scripts/inference.py:295

  • load_config() now loads a Hydra-composed config.yaml, but build_model_from_config() (and later FlowMatcher(p_self_cond=config.get(...))) still expect flat, top-level keys like encoder_type, hidden_s, flow_layers, p_self_cond, etc. With the new nested config layout (model.*, flow.*, ...), inference will build the wrong model and checkpoint loading is likely to fail (e.g., cached encoders when embedding_dim was inferred during training). Update inference to read the nested structure or have training write a resolved/flattened config that matches what inference expects.
    logger.info(f"Using device: {device}")

    # load config and build model
    logger.info(f"Loading model from: {run_dir}")
    config = load_config(run_dir)
    model = build_model_from_config(config, device)

    # load checkpoint
    checkpoint_path = run_dir / "checkpoints" / cfg.inference.checkpoint
    epoch = load_checkpoint(model, checkpoint_path, device)
    logger.info(f"Loaded checkpoint: {checkpoint_path} (epoch {epoch})")

    # Create FlowMatcher
    flow_matcher = FlowMatcher(
        model=model,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/inference.py
Comment on lines 312 to 320
filter_config = _extract_dataset_filter_config(config)

dataset = ProteinWaterDataset(
pdb_list_file=args.pdb_list,
processed_dir=args.processed_dir,
base_pdb_dir=args.base_pdb_dir,
pdb_list_file=cfg.pdb_list,
processed_dir=cfg.processed_dir,
base_pdb_dir=cfg.base_pdb_dir,
encoder_type=encoder_type,
include_mates=include_mates,
geometry_cache_name=geometry_cache_name,
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProteinWaterDataset requires processed_dir (str) and now also requires base_pdb_dir (raises if None), but inference passes cfg.processed_dir / cfg.base_pdb_dir directly. Since configs/inference.yaml sets both to null, running inference without overrides will crash. Consider making these fields required in the Hydra config (???) and/or falling back to the training run’s config.yaml values when they’re not provided in the inference config.

Copilot uses AI. Check for mistakes.
Comment thread scripts/train.py
Comment on lines 661 to 665

logger.info("=" * 60)
logger.info(f"Run name: {args.run_name}")
logger.info(f"Run name: {run_name}")
logger.info(f"Run directory: {run_dir}")
logger.info(f"Log file: {log_file}")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training writes config.yaml before computing and logging node_scalar_in and resolved_encoder_config. For cached encoders (SLAE/ESM), embedding_dim is often inferred at runtime; if it isn’t persisted into the saved config, scripts/inference.py can’t reliably reconstruct the model architecture for checkpoint loading. Consider writing a resolved config after resolve_encoder_config() (e.g., update cfg.model.embedding_dim / add node_scalar_in / resolved_encoder_config, or write a separate resolved_config.yaml) so inference can build an exact match for the checkpoint.

Copilot uses AI. Check for mistakes.
Comment thread configs/data/default.yaml
Comment on lines +3 to +4
processed_dir: null
base_pdb_dir: null
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configs/data/default.yaml leaves processed_dir and base_pdb_dir as null, but the training pipeline passes these directly into get_dataloader() / ProteinWaterDataset, where processed_dir is required and base_pdb_dir=None now raises. Mark these as required (???) or add explicit validation early in scripts/train.py so users get a clear Hydra/ValueError message instead of a downstream type/path error.

Suggested change
processed_dir: null
base_pdb_dir: null
processed_dir: ???
base_pdb_dir: ???

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vratins I would favor validation in train.py over requiring them here. One thing I don't like about Hydra is that it easily becomes opaque, so that you don't know how, and can't easily figure out, what the required argument needs to be. I don't know of a good way to document options when using Hydra configs. One way is to make sure to validate all the parameters up front in your scripts and provide useful error messages, although frankly that's pretty tedious.

Comment thread README.md

| Config Key | Default | Description |
|------------|---------|-------------|
| `run_dir` | required | Path to training run directory (contains config.json) |
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README still says the training run directory “contains config.json”, but training/inference were switched to config.yaml. Update this row (and any nearby references) to avoid sending users looking for a non-existent file.

Suggested change
| `run_dir` | required | Path to training run directory (contains config.json) |
| `run_dir` | required | Path to training run directory (contains config.yaml) |

Copilot uses AI. Check for mistakes.
Comment thread scripts/inference.py
Comment on lines +300 to +307
logger.info(f"Loading PDBs from: {cfg.pdb_list}")

# Determine include_mates from args or config
include_mates = args.include_mates or config.get("include_mates", False)
# Determine include_mates from cfg or training config
include_mates = cfg.inference.include_mates or config.get("include_mates", False)
encoder_type = config.get("encoder_type", "gvp")

# Use --geometry_cache if provided, otherwise use config's geometry_cache_name
geometry_cache_name = args.geometry_cache or config.get(
# Use geometry_cache if provided, otherwise use config's geometry_cache_name
geometry_cache_name = cfg.inference.geometry_cache or config.get(
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_mates, encoder_type, and geometry_cache_name are read from config.get(...), but the training config saved by Hydra will store these under nested keys (data.include_mates, model.encoder_type, data.geometry_cache_name). As written, inference will default to include_mates=False, encoder_type='gvp', and geometry_cache_name='geometry' even if training used different values. Update these lookups to use the nested structure (and optionally keep backward compatibility for older flat configs).

Copilot uses AI. Check for mistakes.
Comment thread configs/inference.yaml
Comment on lines +12 to +14
# Data directories
processed_dir: null
base_pdb_dir: null
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processed_dir and base_pdb_dir default to null, but inference always instantiates ProteinWaterDataset and will fail if these aren’t provided. Consider making them required (???) or documenting/implementing fallback to the training run’s config (e.g., data.processed_dir / data.base_pdb_dir) when omitted.

Suggested change
# Data directories
processed_dir: null
base_pdb_dir: null
# Data directories (must be set via CLI or override)
processed_dir: ???
base_pdb_dir: ???

Copilot uses AI. Check for mistakes.
Comment thread scripts/inference.py
--num_steps 100 \
--save_gifs
python scripts/inference.py run_dir=... pdb_list=... output_dir=...
python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inference usage string shows overrides like integration.method=euler, but in configs/inference.yaml the integration config is nested under inference.integration.*. Using integration.method won’t override anything. Update the usage examples to inference.integration.method=... (and similarly for other nested keys) so CLI invocation matches the actual config structure.

Suggested change
python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler
python scripts/inference.py run_dir=... pdb_list=... output_dir=... inference.integration.method=euler

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, definitely need to fix this.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/train.py (1)

193-199: ⚠️ Potential issue | 🟡 Minor

Update error message to reflect Hydra config paths.

The error message still references argparse-style --processed_dir syntax, which is inconsistent with the Hydra migration.

📝 Suggested fix
     if field not in sample_data["protein"]:
         raise ValueError(
             f"Selected encoder '{encoder_type}' requires protein.{field}, "
             f"but it is missing from dataset samples. "
-            f"Expected cached embeddings in data['protein'].embedding from "
-            f"--processed_dir/{encoder_type}/<cache_key>.pt."
+            f"Expected cached embeddings in data['protein'].embedding from "
+            f"data.processed_dir/{encoder_type}/<cache_key>.pt."
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/train.py` around lines 193 - 199, The ValueError raised in the block
checking sample_data["protein"] uses argparse-style text (“--processed_dir/...”)
but the project now uses Hydra; update the message in the if block that
references encoder_type and field so it points to the Hydra config key (e.g.,
data.processed_dir) and the expected cached path for embeddings under the
configured processed_dir for the given encoder_type and cache_key, instead of
using the CLI flag syntax. Ensure the new message still includes encoder_type,
the missing field name, and guidance that cached embeddings should be found in
the configured processed_dir path for that encoder (referencing encoder_type and
<cache_key> placeholders).
🧹 Nitpick comments (1)
scripts/train.py (1)

625-636: Consider warning on unknown scheduler type.

If cfg.training.scheduler.type is neither "cosine" nor "step", main_scheduler silently remains None. This could lead to confusion if a user mistypes the scheduler type.

💡 Suggested improvement
     if cfg.training.scheduler.type == "cosine":
         main_scheduler = CosineAnnealingLR(
             optimizer,
             T_max=cfg.training.epochs,
             eta_min=cfg.training.optimizer.lr * cfg.training.scheduler.eta_min_factor,
         )
     elif cfg.training.scheduler.type == "step":
         main_scheduler = StepLR(
             optimizer,
             step_size=cfg.training.scheduler.step_size,
             gamma=cfg.training.scheduler.step_gamma,
         )
+    elif cfg.training.scheduler.type not in (None, "none"):
+        logger.warning(f"Unknown scheduler type '{cfg.training.scheduler.type}', no main scheduler will be used.")

     return warmup_scheduler, main_scheduler
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/train.py` around lines 625 - 636, The scheduler selection block can
leave main_scheduler as None if cfg.training.scheduler.type is neither "cosine"
nor "step"; update the branch that sets main_scheduler (the code using
CosineAnnealingLR and StepLR) to handle unknown values by either raising a clear
ValueError or emitting a warning via the project's logger and falling back to a
sensible default; reference cfg.training.scheduler.type, main_scheduler,
CosineAnnealingLR and StepLR so the change is made in that exact selection block
and ensure the message includes the invalid type value for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@configs/inference.yaml`:
- Around line 13-14: The inference config currently leaves processed_dir and
base_pdb_dir null which causes Path(None) and the runtime guard in
src/dataset.py when main() constructs ProteinWaterDataset; update the code so
these dirs are either required or inherited: either set sensible defaults in
configs/inference.yaml (e.g., point to the training run's output) or change
main() to fetch the training run's processed_dir/base_pdb_dir and pass them into
ProteinWaterDataset when present, and add a clear validation in
ProteinWaterDataset.__init__ (or the factory that creates it) to raise a
descriptive error if processed_dir or base_pdb_dir are still None to prevent
Path(None) from being constructed.

In `@README.md`:
- Around line 334-339: Update the README table entry for the `run_dir` key to
reflect that inference now reads `config.yaml` (not `config.json`): change the
description or default value from "contains config.json" to "contains
config.yaml" and ensure any mention of file name matches `scripts/inference.py`
which loads `run_dir/config.yaml`.

In `@scripts/inference.py`:
- Around line 278-280: The code computes a resolved torch.device in the local
variable device but continues to pass the raw config value
(cfg.inference.hardware.device) into run_inference_batch (and the other
batch/inference call around lines ~382-389); update all calls to
run_inference_batch (and any similar inference entry points) to pass the
resolved device variable instead of the raw config string so the model and
inference both use the same actual device (reference symbols: device,
cfg.inference.hardware.device, run_inference_batch).
- Around line 302-303: The current assignment include_mates =
cfg.inference.include_mates or config.get("include_mates", False) treats False
as unset; change it to explicitly check for None so a False in cfg.inference
overrides training: if cfg.inference.include_mates is not None then use that
value, otherwise fall back to config.get("include_mates", False). Update the
logic around include_mates (references: include_mates,
cfg.inference.include_mates, config.get("include_mates", False)) to implement
this tri-state behavior.

In `@scripts/train.py`:
- Around line 472-485: The block uses variable "step" (from the training loop
over train_loader) but if train_loader is empty "step" will be undefined causing
a NameError; initialize step = -1 (or another safe sentinel) before the loop
that iterates over train_loader (the loop that sets step) so this post-loop
logic using step, optimizer, optimizer_step_count,
cfg.training.optimizer.grad_accum_steps, and warmup_scheduler can run safely
even when train_loader has zero batches; ensure the sentinel value preserves the
intended conditional behavior ((step + 1) %
cfg.training.optimizer.grad_accum_steps) and the warmup_scheduler check that
references optimizer_step_count and cfg.training.scheduler.warmup_steps.

In `@src/dataset.py`:
- Around line 1165-1173: When loading the cached geometry dict (the code that
assigns to protein_feat, protein_res_idx, pp_edge_index, pp_edge_unit_vectors,
pp_edge_rbf, num_asu_protein, num_protein_residues, water_pos, water_feat —
e.g., the loader invoked from _preprocess_all()), add a compatibility shim that
checks for the new keys and, if they are missing, falls back to legacy key names
from older caches; for example, if "protein_feat" is absent but
"protein_features" (or whatever the previous key was) exists, set protein_feat =
cached["protein_features"]; do the same fallback for "water_feat" (e.g.,
"water_features") and any other renamed keys so old geometry*.pt files can still
be read without requiring users to wipe caches. Ensure this mapping happens
before any use of those variables and log a debug/info message when a legacy key
is used.

---

Outside diff comments:
In `@scripts/train.py`:
- Around line 193-199: The ValueError raised in the block checking
sample_data["protein"] uses argparse-style text (“--processed_dir/...”) but the
project now uses Hydra; update the message in the if block that references
encoder_type and field so it points to the Hydra config key (e.g.,
data.processed_dir) and the expected cached path for embeddings under the
configured processed_dir for the given encoder_type and cache_key, instead of
using the CLI flag syntax. Ensure the new message still includes encoder_type,
the missing field name, and guidance that cached embeddings should be found in
the configured processed_dir path for that encoder (referencing encoder_type and
<cache_key> placeholders).

---

Nitpick comments:
In `@scripts/train.py`:
- Around line 625-636: The scheduler selection block can leave main_scheduler as
None if cfg.training.scheduler.type is neither "cosine" nor "step"; update the
branch that sets main_scheduler (the code using CosineAnnealingLR and StepLR) to
handle unknown values by either raising a clear ValueError or emitting a warning
via the project's logger and falling back to a sensible default; reference
cfg.training.scheduler.type, main_scheduler, CosineAnnealingLR and StepLR so the
change is made in that exact selection block and ensure the message includes the
invalid type value for clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a50d22d-5567-4691-93df-d595a2e0e8b4

📥 Commits

Reviewing files that changed from the base of the PR and between 99b3945 and bdd319d.

⛔ Files ignored due to path filters (7)
  • figures/water_quality/01_ediam_waters.png is excluded by !**/*.png
  • figures/water_quality/02_ediam_pdbs.png is excluded by !**/*.png
  • figures/water_quality/03_rsccs_waters.png is excluded by !**/*.png
  • figures/water_quality/04_rsccs_pdbs.png is excluded by !**/*.png
  • figures/water_quality/05_bfactor_waters.png is excluded by !**/*.png
  • figures/water_quality/06_bfactor_pdbs.png is excluded by !**/*.png
  • figures/water_quality/07_ediam_bfactor_correlation.png is excluded by !**/*.png
📒 Files selected for processing (17)
  • README.md
  • configs/data/default.yaml
  • configs/flow/default.yaml
  • configs/inference.yaml
  • configs/logging/default.yaml
  • configs/model/esm.yaml
  • configs/model/gvp.yaml
  • configs/model/slae.yaml
  • configs/train.yaml
  • scripts/generate_esm_embeddings.py
  • scripts/generate_slae_embeddings.py
  • scripts/inference.py
  • scripts/train.py
  • src/dataset.py
  • src/flow.py
  • tests/test_dataset.py
  • tests/test_train_config.py

Comment thread configs/inference.yaml
Comment on lines +13 to +14
processed_dir: null
base_pdb_dir: null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the dataset directories required or inherit them from the training run.

main() passes both values straight into ProteinWaterDataset; processed_dir=null hits Path(None) and base_pdb_dir=null triggers the new runtime guard in src/dataset.py. The documented inference command does not override either field, so this config cannot start as-is.

Safer default if you want callers to provide them explicitly
-processed_dir: null
-base_pdb_dir: null
+processed_dir: ???
+base_pdb_dir: ???
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
processed_dir: null
base_pdb_dir: null
processed_dir: ???
base_pdb_dir: ???
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/inference.yaml` around lines 13 - 14, The inference config currently
leaves processed_dir and base_pdb_dir null which causes Path(None) and the
runtime guard in src/dataset.py when main() constructs ProteinWaterDataset;
update the code so these dirs are either required or inherited: either set
sensible defaults in configs/inference.yaml (e.g., point to the training run's
output) or change main() to fetch the training run's processed_dir/base_pdb_dir
and pass them into ProteinWaterDataset when present, and add a clear validation
in ProteinWaterDataset.__init__ (or the factory that creates it) to raise a
descriptive error if processed_dir or base_pdb_dir are still None to prevent
Path(None) from being constructed.

Comment thread README.md
Comment on lines +334 to +339
### Key Inference Options

| Config Key | Default | Description |
|------------|---------|-------------|
| `run_dir` | required | Path to training run directory (contains config.json) |
| `pdb_list` | required | Text file with PDB entries (one per line) |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

run_dir now points to config.yaml, not config.json.

scripts/inference.py now loads run_dir/config.yaml, so this table entry will send users to the wrong file when they inspect a saved run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 334 - 339, Update the README table entry for the
`run_dir` key to reflect that inference now reads `config.yaml` (not
`config.json`): change the description or default value from "contains
config.json" to "contains config.yaml" and ensure any mention of file name
matches `scripts/inference.py` which loads `run_dir/config.yaml`.

Comment thread scripts/inference.py
Comment on lines +56 to +66
config_path = run_dir / "config.yaml"
if not config_path.exists():
raise FileNotFoundError(f"Config file not found: {config_path}")

with open(config_path, "r") as f:
config = json.load(f)
from typing import cast

from omegaconf import OmegaConf

config = cast(
dict, OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)
)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Normalize the loaded Hydra config before returning it.

config.yaml is now the composed Hydra tree, so fields like encoder_type, hidden_s, flow_layers, include_mates, and the dataset filters live under nested sections (model.*, data.*, etc.). The rest of this module still reads flat keys, so returning the raw container here will silently fall back to defaults or build the wrong architecture for non-default runs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to mean that you need to go through the rest of the code and make sure wherever these options are used, you are looking at the correct part of the configuration. I haven't looked at the unit tests yet, but these should thoroughly look at the config and make sure it is loaded and used correctly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me inspect the full scope of flat config accesses and what the trained run's config.yaml structure looks like.


🏁 Script executed:

#!/bin/bash
echo "=== All config.get() / config[] calls in inference.py ==="
rg -n "config\.get\|config\[" scripts/inference.py

echo ""
echo "=== How train.py writes config.yaml (OmegaConf save) ==="
rg -n "config\.yaml\|OmegaConf.save\|OmegaConf.to_yaml\|resolved_encoder_config" scripts/train.py -A3

echo ""
echo "=== resolved_encoder_config key usage in train.py ==="
rg -n "resolved_encoder_config" scripts/train.py -C3

echo ""
echo "=== test_train_config.py content ==="
cat tests/test_train_config.py

echo ""
echo "=== configs/train.yaml ==="
cat configs/train.yaml

echo ""
echo "=== configs/data/default.yaml ==="
cat configs/data/default.yaml

echo ""
echo "=== configs/model/gvp.yaml ==="
cat configs/model/gvp.yaml

echo ""
echo "=== configs/flow/default.yaml ==="
cat configs/flow/default.yaml

Length of output: 7018


@marcuscollins, here is the full audit of every place the loaded config is consumed and how the tests fail to catch the breakage.


Broken flat-key reads vs. the nested config.yaml structure

The Hydra-composed config written by train.py has this shape:

model:          ← encoder_type, hidden_s, hidden_v, flow_layers, …
data:
  quality:      ← max_com_dist, max_clash_fraction, …
  water_filter: ← filter_by_distance, max_protein_dist, min_edia, …
  include_mates, geometry_cache_name, …
flow:           ← p_self_cond, use_self_cond, …
logging:        ← …
training:       ← …
# flat extras added by train.py before saving:
node_scalar_in: …
resolved_encoder_config: {…}

1. build_model_from_config() — every model param silently wrong

What the code reads Where it actually lives Consequence of miss
config.get("encoder_type", "gvp") config["model"]["encoder_type"] Always builds a GVP model regardless of what was trained
config.get("hidden_s") or 256 config["model"]["hidden_s"] Silently uses 256; wrong architecture for any non-default run
config.get("hidden_v") or 64 config["model"]["hidden_v"] Same
config.get("freeze_encoder", …) config["model"]["freeze_encoder"] Encoder always treated as unfrozen
config.get("encoder_ckpt") config["model"]["encoder_ckpt"] Encoder checkpoint never passed in
config.get("embedding_dim") config["model"]["embedding_dim"] Embedding dim always None → breaks SLAE/ESM
config.get("flow_layers") or 3 config["model"]["flow_layers"] Always 3 layers regardless of training config
config.get("drop_rate", 0.1) config["model"]["drop_rate"] Same
config.get("n_message_gvps", 2) config["model"]["n_message_gvps"] Same
config.get("n_update_gvps", 2) config["model"]["n_update_gvps"] Same
config.get("k_pw") or 16 config["model"]["k_pw"] Same
config.get("k_ww") or 16 config["model"]["k_ww"] Same

The one saving grace is the resolved_encoder_config branch — if train.py stores that as a top-level flat key, build_model_from_config will take that path and skip most of the broken reads. However the remaining hidden_s, hidden_v, flow_layers, etc. that build FlowWaterGVP are still read flat after that branch.

2. _extract_dataset_filter_config() — all filter thresholds fall back to hardcoded defaults

# inference.py (current)
config.get("max_com_dist", 25.0)          # should be config["data"]["quality"]["max_com_dist"]
config.get("max_clash_fraction", 0.05)    # config["data"]["quality"]["max_clash_fraction"]
config.get("clash_dist", 2.0)             # config["data"]["quality"]["clash_dist"]
config.get("interface_dist_threshold", 4.0)  # config["data"]["quality"]["interface_dist_threshold"]
config.get("min_water_residue_ratio", 0.6)   # config["data"]["quality"]["min_water_residue_ratio"]
config.get("max_protein_dist", 5.0)       # config["data"]["water_filter"]["max_protein_dist"]
config.get("min_edia", 0.4)               # config["data"]["water_filter"]["min_edia"]
config.get("max_bfactor_zscore", 1.5)     # config["data"]["water_filter"]["max_bfactor_zscore"]
config.get("filter_by_distance", True)    # config["data"]["water_filter"]["filter_by_distance"]
config.get("filter_by_edia", True)        # config["data"]["water_filter"]["filter_by_edia"]
config.get("filter_by_bfactor", True)     # config["data"]["water_filter"]["filter_by_bfactor"]

All eleven reads will miss and return hardcoded defaults — so inference will always run with the hardcoded thresholds regardless of what was used during training, breaking the train/inference consistency guarantee.

3. main() — three more broken reads

config.get("p_self_cond", 0.5)         # → config["flow"]["p_self_cond"]
config.get("include_mates", False)     # → config["data"]["include_mates"]
config.get("geometry_cache_name", …)   # → config["data"]["geometry_cache_name"]
config.get("encoder_type", "gvp")      # → config["model"]["encoder_type"]  (used for dataset)

Test gaps

Both existing inference-related tests provide flat dicts directly — they will continue to pass even though the bug exists:

# test_train_config.py — passes flat dict, not the Hydra-written nested structure
def test_inference_build_model_from_config_uses_embedding_dim(device):
    config = {
        "encoder_type": "slae",   # ← flat key, bypasses the real problem
        "hidden_s": 128,
        ...
    }
    model = build_model_from_config(config, device)

def test_inference_extracts_filter_config_from_training_config():
    training_config = {
        "min_water_residue_ratio": 0.7,   # ← flat key, bypasses the real problem
        ...
    }
    extracted = _extract_dataset_filter_config(training_config)

What's missing:

  1. A round-trip test that writes a real Hydra-structured config (as train.py would produce) to a temp file, calls load_config(), then asserts every downstream accessor returns the correct value.
  2. A test for _extract_dataset_filter_config with a nested input ({"data": {"quality": {...}, "water_filter": {...}}}).
  3. A test that build_model_from_config with a nested config produces a model with the correct architecture dimensions.

Recommended fix

The cleanest approach is to normalize the nested dict in load_config() (or in a helper) so that all existing callers keep working without changes:

def _flatten_training_config(config: dict) -> dict:
    """
    Flatten a Hydra-structured training config into the legacy flat layout
    expected by build_model_from_config / _extract_dataset_filter_config.
    """
    flat = {}
    # model params
    flat.update(config.get("model", {}))
    # flow params
    flat.update(config.get("flow", {}))
    # data top-level (include_mates, geometry_cache_name, …)
    data = config.get("data", {})
    flat["include_mates"] = data.get("include_mates", False)
    flat["geometry_cache_name"] = data.get("geometry_cache_name", "geometry")
    # quality thresholds
    flat.update(data.get("quality", {}))
    # water filter
    flat.update(data.get("water_filter", {}))
    # preserve flat extras written by train.py (resolved_encoder_config, node_scalar_in, …)
    for k, v in config.items():
        if k not in {"model", "flow", "data", "logging", "training"}:
            flat[k] = v
    return flat


def load_config(run_dir: Path) -> dict:
    ...
    config = cast(dict, OmegaConf.to_container(OmegaConf.load(config_path), resolve=True))
    return _flatten_training_config(config)   # ← add this line

This also makes the existing tests continue to pass without modification (they already pass flat dicts that look like the output of _flatten_training_config), and the new round-trip test can exercise the full path.

Comment thread scripts/inference.py
Comment on lines +278 to +280
device = torch.device(
cfg.inference.hardware.device if torch.cuda.is_available() else "cpu"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pass the resolved device into batch inference.

The code falls back to CPU at Lines 278-280, but run_inference_batch() still receives the raw config value. On a CPU-only host that means model setup succeeds on CPU and then integration still tries to run on "cuda".

Suggested fix
         batch_results = run_inference_batch(
             flow_matcher,
             batch_graphs,
             method=method,
             num_steps=num_steps,
             use_sc=use_sc,
-            device=cfg.inference.hardware.device,
+            device=str(device),
             water_ratio=water_ratio,
         )

Also applies to: 382-389

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/inference.py` around lines 278 - 280, The code computes a resolved
torch.device in the local variable device but continues to pass the raw config
value (cfg.inference.hardware.device) into run_inference_batch (and the other
batch/inference call around lines ~382-389); update all calls to
run_inference_batch (and any similar inference entry points) to pass the
resolved device variable instead of the raw config string so the model and
inference both use the same actual device (reference symbols: device,
cfg.inference.hardware.device, run_inference_batch).

Comment thread scripts/inference.py
Comment on lines +302 to +303
# Determine include_mates from cfg or training config
include_mates = cfg.inference.include_mates or config.get("include_mates", False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid or for the include_mates override.

An explicit inference.include_mates=false is treated as “unset” here, so a run trained with mates cannot be forced off at inference. This needs a tri-state config (null = inherit) or an explicit is None check.

Suggested fix
-    include_mates = cfg.inference.include_mates or config.get("include_mates", False)
+    include_mates = (
+        config.get("include_mates", False)
+        if cfg.inference.include_mates is None
+        else cfg.inference.include_mates
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Determine include_mates from cfg or training config
include_mates = cfg.inference.include_mates or config.get("include_mates", False)
# Determine include_mates from cfg or training config
include_mates = (
config.get("include_mates", False)
if cfg.inference.include_mates is None
else cfg.inference.include_mates
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/inference.py` around lines 302 - 303, The current assignment
include_mates = cfg.inference.include_mates or config.get("include_mates",
False) treats False as unset; change it to explicitly check for None so a False
in cfg.inference overrides training: if cfg.inference.include_mates is not None
then use that value, otherwise fall back to config.get("include_mates", False).
Update the logic around include_mates (references: include_mates,
cfg.inference.include_mates, config.get("include_mates", False)) to implement
this tri-state behavior.

Comment thread scripts/train.py
Comment on lines +472 to 485
if (step + 1) % cfg.training.optimizer.grad_accum_steps != 0:
if cfg.training.optimizer.grad_clip > 0:
torch.nn.utils.clip_grad_norm_(
[p for p in flow_matcher.model.parameters() if p.requires_grad],
max_norm=args.grad_clip,
max_norm=cfg.training.optimizer.grad_clip,
)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
optimizer_step_count += 1
if warmup_scheduler is not None and optimizer_step_count <= args.warmup_steps:
if (
warmup_scheduler is not None
and optimizer_step_count <= cfg.training.scheduler.warmup_steps
):
warmup_scheduler.step()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential NameError if train_loader is empty.

If train_loader has no batches, the for loop at line 402 never executes, leaving step undefined. This would cause a NameError when accessing step on line 472.

🛡️ Suggested defensive fix
+    # Handle remaining gradients at end of epoch
+    if processed_batches == 0:
+        # No batches processed, skip gradient handling
+        final_global_step = (epoch - 1) * len(train_loader)
+        logger.warning(
+            f"Epoch {epoch}: skipped all {skipped_batches} train batches (no waters)."
+        )
+        return (
+            {"train/epoch_loss": float("inf"), "train/epoch_rmsd": float("inf")},
+            final_global_step,
+            optimizer_step_count,
+        )
+
     # Handle remaining gradients at end of epoch
-    if (step + 1) % cfg.training.optimizer.grad_accum_steps != 0:
+    if processed_batches > 0 and (step + 1) % cfg.training.optimizer.grad_accum_steps != 0:

Alternatively, initialize step = -1 before the loop to avoid the NameError while preserving the current logic flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/train.py` around lines 472 - 485, The block uses variable "step"
(from the training loop over train_loader) but if train_loader is empty "step"
will be undefined causing a NameError; initialize step = -1 (or another safe
sentinel) before the loop that iterates over train_loader (the loop that sets
step) so this post-loop logic using step, optimizer, optimizer_step_count,
cfg.training.optimizer.grad_accum_steps, and warmup_scheduler can run safely
even when train_loader has zero batches; ensure the sentinel value preserves the
intended conditional behavior ((step + 1) %
cfg.training.optimizer.grad_accum_steps) and the warmup_scheduler check that
references optimizer_step_count and cfg.training.scheduler.warmup_steps.

Comment thread src/dataset.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/dataset.py (1)

665-727: ⚠️ Potential issue | 🟠 Major

Don’t require base_pdb_dir for cache-only datasets.

Line 726 makes base_pdb_dir mandatory even when preprocess=False, but __getitem__() only reads the cached .pt files. That regresses cache-only inference/eval setups that don't have the raw PDB tree mounted. Please gate this validation on preprocessing (or defer pdb_path construction until _preprocess_one()).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/dataset.py` around lines 665 - 727, The constructor currently raises
ValueError("base_pdb_dir is required") unconditionally which breaks cache-only
runs; change the validation to only require base_pdb_dir when preprocessing is
enabled (i.e., if preprocess is True and base_pdb_dir is None, then raise), or
alternatively defer any use/creation of pdb paths until _preprocess_one so
__init__ no longer mandates base_pdb_dir; update references in __init__
(geometry_dir/cache_dir setup) and ensure __getitem__ continues to only read .pt
caches without needing base_pdb_dir.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/dataset.py`:
- Around line 665-727: The constructor currently raises ValueError("base_pdb_dir
is required") unconditionally which breaks cache-only runs; change the
validation to only require base_pdb_dir when preprocessing is enabled (i.e., if
preprocess is True and base_pdb_dir is None, then raise), or alternatively defer
any use/creation of pdb paths until _preprocess_one so __init__ no longer
mandates base_pdb_dir; update references in __init__ (geometry_dir/cache_dir
setup) and ensure __getitem__ continues to only read .pt caches without needing
base_pdb_dir.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35590ad4-3b65-491c-901c-546573b5b419

📥 Commits

Reviewing files that changed from the base of the PR and between 75d01d1 and 45bd4ec.

📒 Files selected for processing (4)
  • README.md
  • scripts/inference.py
  • src/dataset.py
  • tests/test_dataset.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/test_dataset.py
  • README.md
  • scripts/inference.py

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 26 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

tests/test_train_config.py:155

  • These tests still exercise inference helpers with a legacy flat training config (e.g., encoder_type, min_water_residue_ratio at the top level). Since scripts/train.py now writes a structured Hydra config, consider updating/adding tests that validate load_config() + build_model_from_config() + _extract_dataset_filter_config() against the nested model.* / data.* schema.
def test_inference_build_model_from_config_uses_embedding_dim(device):
    config = {
        "encoder_type": "slae",
        "hidden_s": 128,
        "hidden_v": 32,
        "flow_layers": 2,
        "node_scalar_in": 16,
        "embedding_dim": 128,
        "k_pw": 8,
        "k_ww": 8,
    }

    model = build_model_from_config(config, device)

    assert model.encoder.output_dims == (128, 0)


def test_inference_extracts_filter_config_from_training_config():
    """Verify inference correctly extracts filter params from training config."""
    from scripts.inference import _extract_dataset_filter_config

    training_config = {
        "min_water_residue_ratio": 0.7,
        "max_protein_dist": 4.5,
        "filter_by_edia": False,
    }

    extracted = _extract_dataset_filter_config(training_config)

    assert extracted["min_water_residue_ratio"] == 0.7

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/train.py
Comment on lines +668 to +672
# Save resolved Hydra config
config_file = run_dir / "config.yaml"
with open(config_file, "w") as f:
f.write(OmegaConf.to_yaml(cfg))
logger.info(f"Configuration saved to: {config_file}")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.yaml is written before derived fields are computed (e.g., node_scalar_in and resolved_encoder_config). Inference relies on these to rebuild the encoder shapes when embedding_dim is left null, so the run directory config may be insufficient to reproduce/load the model. Consider saving an enriched config after encoder_config is resolved (or persist a separate resolved config file) and updating inference to read it.

Suggested change
# Save resolved Hydra config
config_file = run_dir / "config.yaml"
with open(config_file, "w") as f:
f.write(OmegaConf.to_yaml(cfg))
logger.info(f"Configuration saved to: {config_file}")
# Save the initial Hydra config; this may not include all runtime-derived fields.
config_file = run_dir / "raw_config.yaml"
with open(config_file, "w") as f:
f.write(OmegaConf.to_yaml(cfg))
logger.info(f"Initial configuration saved to: {config_file}")

Copilot uses AI. Check for mistakes.
Comment thread scripts/train.py
Comment on lines 678 to +681
train_loader = get_dataloader(
pdb_list_file=args.train_list,
processed_dir=args.processed_dir,
batch_size=args.batch_size,
pdb_list_file=cfg.train_list,
processed_dir=cfg.data.processed_dir,
batch_size=cfg.training.batch_size,
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hydra changes the working directory by default, so passing relative paths (e.g., splits/train.txt, data.processed_dir) directly into get_dataloader can break when run from a Hydra output dir. Either disable chdir (hydra.job.chdir=false) or convert user paths to absolute (e.g., hydra.utils.to_absolute_path) before opening files/directories.

Copilot uses AI. Check for mistakes.
Comment thread scripts/inference.py
Comment on lines 292 to 296
# Create FlowMatcher
flow_matcher = FlowMatcher(
model=model,
p_self_cond=config.get("p_self_cond", 0.5),
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p_self_cond is now configured under flow.p_self_cond in the saved Hydra config, but inference still reads config.get("p_self_cond"). This will ignore the training setting and can change sampling behavior; read from the structured config instead.

Copilot uses AI. Check for mistakes.
Comment thread scripts/inference.py
Comment on lines +301 to 308
# Determine include_mates from cfg or training config
include_mates = cfg.inference.include_mates or config.get("include_mates", False)
encoder_type = config.get("encoder_type", "gvp")

# Use --geometry_cache if provided, otherwise use config's geometry_cache_name
geometry_cache_name = args.geometry_cache or config.get(
# Use geometry_cache if provided, otherwise use config's geometry_cache_name
geometry_cache_name = cfg.inference.geometry_cache or config.get(
"geometry_cache_name", "geometry"
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_mates, encoder_type, and geometry_cache_name are no longer top-level keys in the saved Hydra config (data.include_mates, model.encoder_type, data.geometry_cache_name). Using config.get(...) here will ignore the training-time settings and can load the wrong dataset/model. Please read from the nested keys (and optionally support legacy configs).

Copilot uses AI. Check for mistakes.
Comment thread scripts/inference.py
Comment on lines 313 to 317
dataset = ProteinWaterDataset(
pdb_list_file=args.pdb_list,
processed_dir=args.processed_dir,
base_pdb_dir=args.base_pdb_dir,
pdb_list_file=cfg.pdb_list,
processed_dir=cfg.processed_dir,
base_pdb_dir=cfg.base_pdb_dir,
encoder_type=encoder_type,
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processed_dir and base_pdb_dir are passed from cfg but default to null in configs/inference.yaml, which will cause a runtime failure when constructing ProteinWaterDataset. Either make them required in the Hydra config (use ???) or default them from the loaded training config (config['data']['processed_dir'], config['data']['base_pdb_dir']).

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines 203 to 207
```bash
uv run python -m scripts.train \
--train_list splits/train_list_0.95.txt \
--val_list splits/valid_list_0.05.txt \
--encoder_type gvp \
--batch_size 4
train_list=splits/train_list_0.95.txt \
val_list=splits/valid_list_0.05.txt
```
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training examples omit required data paths. With data.processed_dir and data.base_pdb_dir defaulting to null (and base_pdb_dir now required by the dataset), these commands will fail unless the user overrides those keys. Please update the example invocations to include data.processed_dir=... and data.base_pdb_dir=... (or explain where they come from).

Copilot uses AI. Check for mistakes.
Comment thread README.md
| `training.optimizer.grad_accum_steps` | `1` | Gradient accumulation steps |
| `training.epochs` | `200` | Number of training epochs |
| `training.optimizer.lr` | `1e-3` | Learning rate |
| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The training options table documents data.processed_dir but not data.base_pdb_dir, which is now required for dataset creation. Add it (and note whether it’s required) to prevent users from running into a runtime ValueError without guidance.

Suggested change
| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |
| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |
| `data.base_pdb_dir` | required | Base directory containing PDB/mmCIF files for dataset creation |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely fix.

Comment thread scripts/inference.py
Comment on lines 282 to 286
@@ -423,35 +285,35 @@
model = build_model_from_config(config, device)

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_config() now reads the structured Hydra config.yaml, but build_model_from_config() still expects the legacy flat schema (e.g., encoder_type, hidden_s at the top level). This call will therefore rebuild the wrong architecture (often defaulting to GVP) for runs trained with cfg.model.*. Update the inference-side model builder (or flatten/map the loaded config) so it matches the new training config format.

Copilot uses AI. Check for mistakes.
Comment thread scripts/inference.py
Comment on lines 309 to 313

# Extract dataset filter config from training config for consistency
filter_config = _extract_dataset_filter_config(config)

dataset = ProteinWaterDataset(
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter_config = _extract_dataset_filter_config(config) uses the training run config, but the helper currently only reads flat keys. Since config.yaml now nests these under data.quality / data.water_filter, inference will silently fall back to defaults and may preprocess/validate different data than training. Update the extraction logic to consume the structured config (or support both formats).

Copilot uses AI. Check for mistakes.
Comment thread scripts/train.py
Comment on lines +66 to 76
def _extract_water_filter_config(cfg: DictConfig) -> dict:
"""Extract per-water filtering parameters (toggleable)."""
return {
"max_protein_dist": args.max_protein_dist,
"min_edia": args.min_edia,
"max_bfactor_zscore": args.max_bfactor_zscore,
"filter_by_distance": args.filter_by_distance,
"filter_by_edia": args.filter_by_edia,
"filter_by_bfactor": args.filter_by_bfactor,
"edia_dir": cfg.data.edia_dir,
"max_protein_dist": cfg.data.water_filter.max_protein_dist,
"min_edia": cfg.data.water_filter.min_edia,
"max_bfactor_zscore": cfg.data.water_filter.max_bfactor_zscore,
"filter_by_distance": cfg.data.water_filter.filter_by_distance,
"filter_by_edia": cfg.data.water_filter.filter_by_edia,
"filter_by_bfactor": cfg.data.water_filter.filter_by_bfactor,
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_water_filter_config includes edia_dir, but ProteinWaterDataset.__init__ (src/dataset.py) does not accept an edia_dir kwarg. Because water_filter_kwargs is merged into dataset_kwargs and passed through get_dataloader(**dataset_kwargs), training will raise TypeError: __init__() got an unexpected keyword argument 'edia_dir'. Either remove edia_dir from dataset_kwargs or add explicit edia_dir support in the dataset constructor/logic.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run a test training run to make sure this works?

Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly I want to make sure this code actually runs--from the Copilot comments it seems like it won't, but the fact that you've been training models suggest it will run fine. There's a third, worse possibility--that it will run, but with opaque defaults. So having more config tests should be included, to make sure the model is created the way you expect.

Comment thread configs/data/default.yaml
Comment on lines +3 to +4
processed_dir: null
base_pdb_dir: null
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vratins I would favor validation in train.py over requiring them here. One thing I don't like about Hydra is that it easily becomes opaque, so that you don't know how, and can't easily figure out, what the required argument needs to be. I don't know of a good way to document options when using Hydra configs. One way is to make sure to validate all the parameters up front in your scripts and provide useful error messages, although frankly that's pretty tedious.

Comment thread configs/inference.yaml

# Inference settings
inference:
include_mates: false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this (and similar options) depend on how the model was trained? Perhaps you need to inherit from the training config?

Comment thread scripts/inference.py
--num_steps 100 \
--save_gifs
python scripts/inference.py run_dir=... pdb_list=... output_dir=...
python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, definitely need to fix this.

Comment thread scripts/inference.py
Comment on lines +56 to +66
config_path = run_dir / "config.yaml"
if not config_path.exists():
raise FileNotFoundError(f"Config file not found: {config_path}")

with open(config_path, "r") as f:
config = json.load(f)
from typing import cast

from omegaconf import OmegaConf

config = cast(
dict, OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to mean that you need to go through the rest of the code and make sure wherever these options are used, you are looking at the correct part of the configuration. I haven't looked at the unit tests yet, but these should thoroughly look at the config and make sure it is loaded and used correctly.

Comment thread scripts/inference.py
(output_dir / "plots").mkdir(exist_ok=True)
if args.save_gifs:
if cfg.inference.visualization.save_gifs:
(output_dir / "gifs").mkdir(exist_ok=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why gifs?

Comment thread scripts/train.py

log_encoder_sample_stats(sample_data, args.encoder_type)
log_encoder_sample_stats(sample_data, cfg.model.encoder_type)
encoder_config = resolve_encoder_config(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this what is needed for inference? Should you log it and/or save it in the config file you save above, and to wandb?

Comment thread scripts/train.py
)

model = build_model(args, device, encoder_config=encoder_config)
model = build_model(cfg, device, encoder_config=encoder_config)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the encoder_config derived just from cfg and maybe a couple other parameters? Could you just create the encoder config inside build_model.

"hidden_v": 64,
"freeze_encoder": False,
"encoder_ckpt": None,
"embedding_dim": None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not need the rest of the parameters in here?


import pytest
import torch
from omegaconf import OmegaConf
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest doing a lot more testing here, in particular testing the pattern of defaults and making sure things change the way you expect. As the AI reviewers above noted, it's easy in a config system that has a hierarchy of defaults to specify and/or read things incorrectly. I would make sure the values you expect come through the way you expect.

Comment thread README.md
| `training.optimizer.grad_accum_steps` | `1` | Gradient accumulation steps |
| `training.epochs` | `200` | Number of training epochs |
| `training.optimizer.lr` | `1e-3` | Learning rate |
| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rename <atom_type>_x to <atom_type>_feat for clarity Configuration files for training runs Remove hard coded paths everywhere in the codebase

3 participants