Hydra Configs by vratins · Pull Request #70 · diff-use/WaterFlow

vratins · 2026-03-24T21:57:46Z

PR towards:

Replacing argparse with hydra; defaults stored in configs
Replace all hard-coded paths with relative paths or no default
Replace _x with _feat for all nodes (trivial change hence in this PR)

Summary by CodeRabbit

Chores
- Migrated training and inference CLIs to Hydra; scripts now use structured configs and require explicit data directory inputs.
- Updated runtime deps to include Hydra/OmegaConf and adjusted visualization package.
Documentation
- README overhauled for Hydra usage; CLI flag docs replaced with configuration structure and key mappings.
- Added comprehensive YAML configs for data, flow, models, logging, training, and inference.
Bug Fixes
- Standardized cached feature naming and updated cache handling.
Tests
- Tests updated to use structured configs (OmegaConf) instead of argparse.

coderabbitai · 2026-03-24T21:58:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a2d7a24-2124-4518-88b8-702c38a61a96

📥 Commits

Reviewing files that changed from the base of the PR and between 45bd4ec and d615f1a.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

pyproject.toml

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

📝 Walkthrough

Walkthrough

This PR migrates CLI argument parsing to Hydra, adds new Hydra configs (train, inference, logging, data, flow, model variants), updates scripts to accept DictConfig (train.py, inference.py), renames cached tensor keys to *_feat, makes base_pdb_dir required, updates docs/tests, and adds hydra/omegaconf deps.

Changes

Cohort / File(s)	Summary
Hydra config files `configs/train.yaml`, `configs/inference.yaml`, `configs/logging/default.yaml`, `configs/data/default.yaml`, `configs/flow/default.yaml`, `configs/model/esm.yaml`, `configs/model/gvp.yaml`, `configs/model/slae.yaml`	Add new Hydra YAML configs defining training, inference, logging, data quality/water filters, flow options, and model variants; declare required top-level keys (`train_list`, `val_list`, `run_dir`, `pdb_list`, `output_dir`) and nested structured options.
Train & inference scripts `scripts/train.py`, `scripts/inference.py`	Replace argparse with `@hydra.main` DictConfig entrypoints; change function signatures to accept `cfg: DictConfig`; load resolved config YAMLs via OmegaConf; rewire scheduler/optimizer/dataloader/W&B/metrics/checkpoint plumbing to use `cfg.*`.
Dataset & feature naming `src/dataset.py`, `src/flow.py`, `tests/test_dataset.py`	Rename cached/runtime feature keys from `protein_x`/`water_x`/`mate_x` → `protein_feat`/`water_feat`/`mate_feat`; update cache read/write and graph `.x` assignments; enforce `base_pdb_dir` required at init (raises if missing); update tests accordingly.
Embedding CLI changes `scripts/generate_esm_embeddings.py`, `scripts/generate_slae_embeddings.py`	Remove hardcoded `--base_pdb_dir` defaults and make `--base_pdb_dir` a required CLI argument.
Tests & test helpers `tests/test_train_config.py`, other tests...`	Migrate tests from `argparse.Namespace` to `OmegaConf.create(...)`/DictConfig, update call sites, and remove argparse-specific tests.
Documentation `README.md`	Update README to document Hydra-based usage (override syntax), config structure and keys, checkpoint/save cadence (`logging.save_every`), cache naming (`*_feat`), and W&B config keys.
Project dependencies `pyproject.toml`	Add `hydra-core` and `omegaconf` to project dependencies; replace `pymol-open-source` with `pymol-open-source-whl>=3.1.0.4`.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Hydra as Hydra (config)
  participant Train as scripts/train.py
  participant Dataset
  participant Model
  participant WAndB as W&B

  User->>Hydra: supply overrides (e.g., model=esm, training.batch_size=..., run_dir=...)
  Hydra->>Train: resolved cfg passed to main(cfg)
  Train->>Dataset: init dataloaders using cfg.data and cfg.training
  Train->>Model: build model using cfg.model and cfg.flow
  Train->>WAndB: init logging using cfg.logging.wandb
  Train->>Model: run training loop per cfg.training
  Model-->>Train: emit metrics & checkpoints
  Train->>WAndB: log metrics & artifacts per cfg.logging.save_every
  Train->>Hydra: write resolved cfg to run_dir/config.yaml

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Creating a Dockerfile #69: Overlapping changes to EDIA handling and data/EDIA sourcing — strong code-level relation.
Updating dataset.py #43: Overlaps in src/dataset.py (constructor signature and cache key/layout renames).

Suggested reviewers

marcuscollins

Poem

🐰
Hydra whispers, YAML leaves unfurl,
Configs hop in—each a tidy pearl,
Features renamed, directories made true,
Training tunes up, checkpoints hum anew,
A carrot-coded cheer for the dev crew! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Hydra Configs' is vague and generic, using non-descriptive terminology that does not clearly convey the scope or significance of the changes.	Consider a more descriptive title such as 'Replace argparse with Hydra configuration system' or 'Migrate to Hydra configs and remove hardcoded paths' to better reflect the actual changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 80.65% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev_paths_clarity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…w into dev_paths_clarity

Copilot

Pull request overview

This PR migrates the training/inference entrypoints from argparse to Hydra configs (stored under configs/), removes hard-coded default paths by making key paths user-supplied, and renames cached node feature keys from *_x to *_feat across the dataset/cache layer.

Changes:

Replaced argparse-driven train.py / inference.py CLIs with Hydra (configs/train.yaml, configs/inference.yaml, and grouped defaults).
Removed hard-coded default base_pdb_dir in dataset + embedding scripts; made it required.
Renamed cached feature keys (protein_x/water_x → protein_feat/water_feat) and updated tests/docs accordingly.

Reviewed changes

Copilot reviewed 18 out of 26 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/test_train_config.py	Updates tests to pass Hydra-style `cfg` into `resolve_encoder_config`; removes argparse-specific tests.
tests/test_dataset.py	Updates cache-key expectations to `protein_feat` / `water_feat`.
src/flow.py	Renames local water feature variable (still assigned to PyG `.x`).
src/dataset.py	Removes hard-coded `base_pdb_dir` default; enforces `base_pdb_dir` presence; renames cached keys to `*_feat`.
scripts/train.py	Converts training pipeline to Hydra; writes `config.yaml` into the run directory.
scripts/inference.py	Converts inference pipeline to Hydra; loads `config.yaml` from training run.
scripts/generate_slae_embeddings.py	Makes `--base_pdb_dir` required (removes hard-coded default).
scripts/generate_esm_embeddings.py	Makes `--base_pdb_dir` required (removes hard-coded default).
configs/train.yaml	New Hydra training entrypoint config with grouped defaults.
configs/model/slae.yaml	New Hydra model group config for SLAE.
configs/model/gvp.yaml	New Hydra model group config for GVP.
configs/model/esm.yaml	New Hydra model group config for ESM.
configs/logging/default.yaml	New Hydra logging group config (save/checkpoint/W&B settings).
configs/inference.yaml	New Hydra inference entrypoint config.
configs/flow/default.yaml	New Hydra flow group config.
configs/data/default.yaml	New Hydra data group config (paths + quality/water filter settings).
README.md	Updates docs for Hydra usage and `*_feat` cache naming.
figures/water_quality/01_ediam_waters.png	Adds figure artifact.
figures/water_quality/02_ediam_pdbs.png	Adds figure artifact.
figures/water_quality/03_rsccs_waters.png	Adds figure artifact.
figures/water_quality/04_rsccs_pdbs.png	Adds figure artifact.
figures/water_quality/05_bfactor_waters.png	Adds figure artifact.
figures/water_quality/06_bfactor_pdbs.png	Adds figure artifact.
figures/water_quality/07_ediam_bfactor_correlation.png	Adds figure artifact.

Comments suppressed due to low confidence (3)

tests/test_train_config.py:151

These inference-related tests still construct a flat training_config / config dict, but the training script now writes a nested Hydra config.yaml (data.*, model.*, etc.). Once inference is updated to consume the nested config, these tests should be updated (or duplicated) to cover the Hydra-shaped config so they validate the new behavior rather than the deprecated flat format.

    encoder = build_encoder(encoder_config, device)
    model = FlowWaterGVP(encoder=encoder)

    assert model.encoder.output_dims == (128, 0)


def test_inference_build_model_from_config_uses_embedding_dim(device):
    config = {
        "encoder_type": "slae",
        "hidden_s": 128,
        "hidden_v": 32,
        "flow_layers": 2,
        "node_scalar_in": 16,
        "embedding_dim": 128,
        "k_pw": 8,
        "k_ww": 8,
    }

    model = build_model_from_config(config, device)

    assert model.encoder.output_dims == (128, 0)


def test_inference_extracts_filter_config_from_training_config():
    """Verify inference correctly extracts filter params from training config."""
    from scripts.inference import _extract_dataset_filter_config

    training_config = {
        "min_water_residue_ratio": 0.7,
        "max_protein_dist": 4.5,
        "filter_by_edia": False,
    }

scripts/inference.py:84

The inference script still assumes the training config is a flat dict (e.g., config.get('max_com_dist')), but scripts/train.py now writes a Hydra-composed config.yaml with nested keys (e.g., data.quality.max_com_dist, data.water_filter.*). This will silently fall back to defaults and run inference with different filtering than training. Update _extract_dataset_filter_config to read from the nested structure (and optionally keep backward-compat for old flat configs).



def _extract_dataset_filter_config(config: dict) -> dict:
    """Extract dataset filter params from training config with fallback to defaults."""
    return {
        "max_com_dist": config.get("max_com_dist", 25.0),
        "max_clash_fraction": config.get("max_clash_fraction", 0.05),
        "clash_dist": config.get("clash_dist", 2.0),
        "interface_dist_threshold": config.get("interface_dist_threshold", 4.0),
        "min_water_residue_ratio": config.get("min_water_residue_ratio", 0.6),
        "edia_dir": config.get("edia_dir"),
        "max_protein_dist": config.get("max_protein_dist", 5.0),
        "min_edia": config.get("min_edia", 0.4),
        "max_bfactor_zscore": config.get("max_bfactor_zscore", 1.5),
        "filter_by_distance": config.get("filter_by_distance", True),
        "filter_by_edia": config.get("filter_by_edia", True),

scripts/inference.py:295

load_config() now loads a Hydra-composed config.yaml, but build_model_from_config() (and later FlowMatcher(p_self_cond=config.get(...))) still expect flat, top-level keys like encoder_type, hidden_s, flow_layers, p_self_cond, etc. With the new nested config layout (model.*, flow.*, ...), inference will build the wrong model and checkpoint loading is likely to fail (e.g., cached encoders when embedding_dim was inferred during training). Update inference to read the nested structure or have training write a resolved/flattened config that matches what inference expects.

    logger.info(f"Using device: {device}")

    # load config and build model
    logger.info(f"Loading model from: {run_dir}")
    config = load_config(run_dir)
    model = build_model_from_config(config, device)

    # load checkpoint
    checkpoint_path = run_dir / "checkpoints" / cfg.inference.checkpoint
    epoch = load_checkpoint(model, checkpoint_path, device)
    logger.info(f"Loaded checkpoint: {checkpoint_path} (epoch {epoch})")

    # Create FlowMatcher
    flow_matcher = FlowMatcher(
        model=model,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T22:03:26Z

    filter_config = _extract_dataset_filter_config(config)

    dataset = ProteinWaterDataset(
-        pdb_list_file=args.pdb_list,
-        processed_dir=args.processed_dir,
-        base_pdb_dir=args.base_pdb_dir,
+        pdb_list_file=cfg.pdb_list,
+        processed_dir=cfg.processed_dir,
+        base_pdb_dir=cfg.base_pdb_dir,
        encoder_type=encoder_type,
        include_mates=include_mates,
        geometry_cache_name=geometry_cache_name,


ProteinWaterDataset requires processed_dir (str) and now also requires base_pdb_dir (raises if None), but inference passes cfg.processed_dir / cfg.base_pdb_dir directly. Since configs/inference.yaml sets both to null, running inference without overrides will crash. Consider making these fields required in the Hydra config (???) and/or falling back to the training run’s config.yaml values when they’re not provided in the inference config.

Copilot · 2026-03-24T22:03:27Z


    logger.info("=" * 60)
-    logger.info(f"Run name: {args.run_name}")
+    logger.info(f"Run name: {run_name}")
    logger.info(f"Run directory: {run_dir}")
    logger.info(f"Log file: {log_file}")


Training writes config.yaml before computing and logging node_scalar_in and resolved_encoder_config. For cached encoders (SLAE/ESM), embedding_dim is often inferred at runtime; if it isn’t persisted into the saved config, scripts/inference.py can’t reliably reconstruct the model architecture for checkpoint loading. Consider writing a resolved config after resolve_encoder_config() (e.g., update cfg.model.embedding_dim / add node_scalar_in / resolved_encoder_config, or write a separate resolved_config.yaml) so inference can build an exact match for the checkpoint.

Copilot · 2026-03-24T22:03:27Z

+processed_dir: null
+base_pdb_dir: null


configs/data/default.yaml leaves processed_dir and base_pdb_dir as null, but the training pipeline passes these directly into get_dataloader() / ProteinWaterDataset, where processed_dir is required and base_pdb_dir=None now raises. Mark these as required (???) or add explicit validation early in scripts/train.py so users get a clear Hydra/ValueError message instead of a downstream type/path error.

Suggested change

processed_dir: null

base_pdb_dir: null

processed_dir: ???

base_pdb_dir: ???

@vratins I would favor validation in train.py over requiring them here. One thing I don't like about Hydra is that it easily becomes opaque, so that you don't know how, and can't easily figure out, what the required argument needs to be. I don't know of a good way to document options when using Hydra configs. One way is to make sure to validate all the parameters up front in your scripts and provide useful error messages, although frankly that's pretty tedious.

Copilot · 2026-03-24T22:03:27Z

+
+| Config Key | Default | Description |
+|------------|---------|-------------|
+| `run_dir` | required | Path to training run directory (contains config.json) |


README still says the training run directory “contains config.json”, but training/inference were switched to config.yaml. Update this row (and any nearby references) to avoid sending users looking for a non-existent file.

Suggested change

| `run_dir` | required | Path to training run directory (contains config.json) |

| `run_dir` | required | Path to training run directory (contains config.yaml) |

Copilot · 2026-03-24T22:03:28Z

+    logger.info(f"Loading PDBs from: {cfg.pdb_list}")

-    # Determine include_mates from args or config
-    include_mates = args.include_mates or config.get("include_mates", False)
+    # Determine include_mates from cfg or training config
+    include_mates = cfg.inference.include_mates or config.get("include_mates", False)
    encoder_type = config.get("encoder_type", "gvp")

-    # Use --geometry_cache if provided, otherwise use config's geometry_cache_name
-    geometry_cache_name = args.geometry_cache or config.get(
+    # Use geometry_cache if provided, otherwise use config's geometry_cache_name
+    geometry_cache_name = cfg.inference.geometry_cache or config.get(


include_mates, encoder_type, and geometry_cache_name are read from config.get(...), but the training config saved by Hydra will store these under nested keys (data.include_mates, model.encoder_type, data.geometry_cache_name). As written, inference will default to include_mates=False, encoder_type='gvp', and geometry_cache_name='geometry' even if training used different values. Update these lookups to use the nested structure (and optionally keep backward compatibility for older flat configs).

Copilot · 2026-03-24T22:03:28Z

+# Data directories
+processed_dir: null
+base_pdb_dir: null


processed_dir and base_pdb_dir default to null, but inference always instantiates ProteinWaterDataset and will fail if these aren’t provided. Consider making them required (???) or documenting/implementing fallback to the training run’s config (e.g., data.processed_dir / data.base_pdb_dir) when omitted.

Suggested change

# Data directories

processed_dir: null

base_pdb_dir: null

# Data directories (must be set via CLI or override)

processed_dir: ???

base_pdb_dir: ???

Copilot · 2026-03-24T22:03:28Z

-        --num_steps 100 \
-        --save_gifs
+    python scripts/inference.py run_dir=... pdb_list=... output_dir=...
+    python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler


The inference usage string shows overrides like integration.method=euler, but in configs/inference.yaml the integration config is nested under inference.integration.*. Using integration.method won’t override anything. Update the usage examples to inference.integration.method=... (and similarly for other nested keys) so CLI invocation matches the actual config structure.

Suggested change

python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler

python scripts/inference.py run_dir=... pdb_list=... output_dir=... inference.integration.method=euler

yeah, definitely need to fix this.

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/train.py (1)

193-199: ⚠️ Potential issue | 🟡 Minor

Update error message to reflect Hydra config paths.

The error message still references argparse-style --processed_dir syntax, which is inconsistent with the Hydra migration.

📝 Suggested fix

     if field not in sample_data["protein"]:
         raise ValueError(
             f"Selected encoder '{encoder_type}' requires protein.{field}, "
             f"but it is missing from dataset samples. "
-            f"Expected cached embeddings in data['protein'].embedding from "
-            f"--processed_dir/{encoder_type}/<cache_key>.pt."
+            f"Expected cached embeddings in data['protein'].embedding from "
+            f"data.processed_dir/{encoder_type}/<cache_key>.pt."
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/train.py` around lines 193 - 199, The ValueError raised in the block
checking sample_data["protein"] uses argparse-style text (“--processed_dir/...”)
but the project now uses Hydra; update the message in the if block that
references encoder_type and field so it points to the Hydra config key (e.g.,
data.processed_dir) and the expected cached path for embeddings under the
configured processed_dir for the given encoder_type and cache_key, instead of
using the CLI flag syntax. Ensure the new message still includes encoder_type,
the missing field name, and guidance that cached embeddings should be found in
the configured processed_dir path for that encoder (referencing encoder_type and
<cache_key> placeholders).

🧹 Nitpick comments (1)

scripts/train.py (1)

625-636: Consider warning on unknown scheduler type.

If cfg.training.scheduler.type is neither "cosine" nor "step", main_scheduler silently remains None. This could lead to confusion if a user mistypes the scheduler type.

💡 Suggested improvement

     if cfg.training.scheduler.type == "cosine":
         main_scheduler = CosineAnnealingLR(
             optimizer,
             T_max=cfg.training.epochs,
             eta_min=cfg.training.optimizer.lr * cfg.training.scheduler.eta_min_factor,
         )
     elif cfg.training.scheduler.type == "step":
         main_scheduler = StepLR(
             optimizer,
             step_size=cfg.training.scheduler.step_size,
             gamma=cfg.training.scheduler.step_gamma,
         )
+    elif cfg.training.scheduler.type not in (None, "none"):
+        logger.warning(f"Unknown scheduler type '{cfg.training.scheduler.type}', no main scheduler will be used.")

     return warmup_scheduler, main_scheduler

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/train.py` around lines 625 - 636, The scheduler selection block can
leave main_scheduler as None if cfg.training.scheduler.type is neither "cosine"
nor "step"; update the branch that sets main_scheduler (the code using
CosineAnnealingLR and StepLR) to handle unknown values by either raising a clear
ValueError or emitting a warning via the project's logger and falling back to a
sensible default; reference cfg.training.scheduler.type, main_scheduler,
CosineAnnealingLR and StepLR so the change is made in that exact selection block
and ensure the message includes the invalid type value for clarity.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@configs/inference.yaml`:
- Around line 13-14: The inference config currently leaves processed_dir and
base_pdb_dir null which causes Path(None) and the runtime guard in
src/dataset.py when main() constructs ProteinWaterDataset; update the code so
these dirs are either required or inherited: either set sensible defaults in
configs/inference.yaml (e.g., point to the training run's output) or change
main() to fetch the training run's processed_dir/base_pdb_dir and pass them into
ProteinWaterDataset when present, and add a clear validation in
ProteinWaterDataset.__init__ (or the factory that creates it) to raise a
descriptive error if processed_dir or base_pdb_dir are still None to prevent
Path(None) from being constructed.

In `@README.md`:
- Around line 334-339: Update the README table entry for the `run_dir` key to
reflect that inference now reads `config.yaml` (not `config.json`): change the
description or default value from "contains config.json" to "contains
config.yaml" and ensure any mention of file name matches `scripts/inference.py`
which loads `run_dir/config.yaml`.

In `@scripts/inference.py`:
- Around line 278-280: The code computes a resolved torch.device in the local
variable device but continues to pass the raw config value
(cfg.inference.hardware.device) into run_inference_batch (and the other
batch/inference call around lines ~382-389); update all calls to
run_inference_batch (and any similar inference entry points) to pass the
resolved device variable instead of the raw config string so the model and
inference both use the same actual device (reference symbols: device,
cfg.inference.hardware.device, run_inference_batch).
- Around line 302-303: The current assignment include_mates =
cfg.inference.include_mates or config.get("include_mates", False) treats False
as unset; change it to explicitly check for None so a False in cfg.inference
overrides training: if cfg.inference.include_mates is not None then use that
value, otherwise fall back to config.get("include_mates", False). Update the
logic around include_mates (references: include_mates,
cfg.inference.include_mates, config.get("include_mates", False)) to implement
this tri-state behavior.

In `@scripts/train.py`:
- Around line 472-485: The block uses variable "step" (from the training loop
over train_loader) but if train_loader is empty "step" will be undefined causing
a NameError; initialize step = -1 (or another safe sentinel) before the loop
that iterates over train_loader (the loop that sets step) so this post-loop
logic using step, optimizer, optimizer_step_count,
cfg.training.optimizer.grad_accum_steps, and warmup_scheduler can run safely
even when train_loader has zero batches; ensure the sentinel value preserves the
intended conditional behavior ((step + 1) %
cfg.training.optimizer.grad_accum_steps) and the warmup_scheduler check that
references optimizer_step_count and cfg.training.scheduler.warmup_steps.

In `@src/dataset.py`:
- Around line 1165-1173: When loading the cached geometry dict (the code that
assigns to protein_feat, protein_res_idx, pp_edge_index, pp_edge_unit_vectors,
pp_edge_rbf, num_asu_protein, num_protein_residues, water_pos, water_feat —
e.g., the loader invoked from _preprocess_all()), add a compatibility shim that
checks for the new keys and, if they are missing, falls back to legacy key names
from older caches; for example, if "protein_feat" is absent but
"protein_features" (or whatever the previous key was) exists, set protein_feat =
cached["protein_features"]; do the same fallback for "water_feat" (e.g.,
"water_features") and any other renamed keys so old geometry*.pt files can still
be read without requiring users to wipe caches. Ensure this mapping happens
before any use of those variables and log a debug/info message when a legacy key
is used.

---

Outside diff comments:
In `@scripts/train.py`:
- Around line 193-199: The ValueError raised in the block checking
sample_data["protein"] uses argparse-style text (“--processed_dir/...”) but the
project now uses Hydra; update the message in the if block that references
encoder_type and field so it points to the Hydra config key (e.g.,
data.processed_dir) and the expected cached path for embeddings under the
configured processed_dir for the given encoder_type and cache_key, instead of
using the CLI flag syntax. Ensure the new message still includes encoder_type,
the missing field name, and guidance that cached embeddings should be found in
the configured processed_dir path for that encoder (referencing encoder_type and
<cache_key> placeholders).

---

Nitpick comments:
In `@scripts/train.py`:
- Around line 625-636: The scheduler selection block can leave main_scheduler as
None if cfg.training.scheduler.type is neither "cosine" nor "step"; update the
branch that sets main_scheduler (the code using CosineAnnealingLR and StepLR) to
handle unknown values by either raising a clear ValueError or emitting a warning
via the project's logger and falling back to a sensible default; reference
cfg.training.scheduler.type, main_scheduler, CosineAnnealingLR and StepLR so the
change is made in that exact selection block and ensure the message includes the
invalid type value for clarity.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a50d22d-5567-4691-93df-d595a2e0e8b4

📥 Commits

Reviewing files that changed from the base of the PR and between 99b3945 and bdd319d.

⛔ Files ignored due to path filters (7)

figures/water_quality/01_ediam_waters.png is excluded by !**/*.png
figures/water_quality/02_ediam_pdbs.png is excluded by !**/*.png
figures/water_quality/03_rsccs_waters.png is excluded by !**/*.png
figures/water_quality/04_rsccs_pdbs.png is excluded by !**/*.png
figures/water_quality/05_bfactor_waters.png is excluded by !**/*.png
figures/water_quality/06_bfactor_pdbs.png is excluded by !**/*.png
figures/water_quality/07_ediam_bfactor_correlation.png is excluded by !**/*.png

📒 Files selected for processing (17)

README.md
configs/data/default.yaml
configs/flow/default.yaml
configs/inference.yaml
configs/logging/default.yaml
configs/model/esm.yaml
configs/model/gvp.yaml
configs/model/slae.yaml
configs/train.yaml
scripts/generate_esm_embeddings.py
scripts/generate_slae_embeddings.py
scripts/inference.py
scripts/train.py
src/dataset.py
src/flow.py
tests/test_dataset.py
tests/test_train_config.py

coderabbitai · 2026-03-24T22:12:23Z

+processed_dir: null
+base_pdb_dir: null


⚠️ Potential issue | 🟠 Major

Make the dataset directories required or inherit them from the training run.

main() passes both values straight into ProteinWaterDataset; processed_dir=null hits Path(None) and base_pdb_dir=null triggers the new runtime guard in src/dataset.py. The documented inference command does not override either field, so this config cannot start as-is.

Safer default if you want callers to provide them explicitly

-processed_dir: null -base_pdb_dir: null +processed_dir: ??? +base_pdb_dir: ???

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

processed_dir: null

base_pdb_dir: null

processed_dir: ???

base_pdb_dir: ???

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@configs/inference.yaml` around lines 13 - 14, The inference config currently leaves processed_dir and base_pdb_dir null which causes Path(None) and the runtime guard in src/dataset.py when main() constructs ProteinWaterDataset; update the code so these dirs are either required or inherited: either set sensible defaults in configs/inference.yaml (e.g., point to the training run's output) or change main() to fetch the training run's processed_dir/base_pdb_dir and pass them into ProteinWaterDataset when present, and add a clear validation in ProteinWaterDataset.__init__ (or the factory that creates it) to raise a descriptive error if processed_dir or base_pdb_dir are still None to prevent Path(None) from being constructed.

coderabbitai · 2026-03-24T22:12:23Z

+### Key Inference Options
+
+| Config Key | Default | Description |
+|------------|---------|-------------|
+| `run_dir` | required | Path to training run directory (contains config.json) |
+| `pdb_list` | required | Text file with PDB entries (one per line) |


⚠️ Potential issue | 🟡 Minor

run_dir now points to config.yaml, not config.json.

scripts/inference.py now loads run_dir/config.yaml, so this table entry will send users to the wrong file when they inspect a saved run.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 334 - 339, Update the README table entry for the `run_dir` key to reflect that inference now reads `config.yaml` (not `config.json`): change the description or default value from "contains config.json" to "contains config.yaml" and ensure any mention of file name matches `scripts/inference.py` which loads `run_dir/config.yaml`.

coderabbitai · 2026-03-24T22:12:23Z

+    config_path = run_dir / "config.yaml"
    if not config_path.exists():
        raise FileNotFoundError(f"Config file not found: {config_path}")

-    with open(config_path, "r") as f:
-        config = json.load(f)
+    from typing import cast
+
+    from omegaconf import OmegaConf
+
+    config = cast(
+        dict, OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)
+    )


⚠️ Potential issue | 🔴 Critical

Normalize the loaded Hydra config before returning it.

config.yaml is now the composed Hydra tree, so fields like encoder_type, hidden_s, flow_layers, include_mates, and the dataset filters live under nested sections (model.*, data.*, etc.). The rest of this module still reads flat keys, so returning the raw container here will silently fall back to defaults or build the wrong architecture for non-default runs.

I think this is going to mean that you need to go through the rest of the code and make sure wherever these options are used, you are looking at the correct part of the configuration. I haven't looked at the unit tests yet, but these should thoroughly look at the config and make sure it is loaded and used correctly.

🧩 Analysis chain

Let me inspect the full scope of flat config accesses and what the trained run's config.yaml structure looks like.

🏁 Script executed:

#!/bin/bash echo "=== All config.get() / config[] calls in inference.py ===" rg -n "config\.get\|config\[" scripts/inference.py echo "" echo "=== How train.py writes config.yaml (OmegaConf save) ===" rg -n "config\.yaml\|OmegaConf.save\|OmegaConf.to_yaml\|resolved_encoder_config" scripts/train.py -A3 echo "" echo "=== resolved_encoder_config key usage in train.py ===" rg -n "resolved_encoder_config" scripts/train.py -C3 echo "" echo "=== test_train_config.py content ===" cat tests/test_train_config.py echo "" echo "=== configs/train.yaml ===" cat configs/train.yaml echo "" echo "=== configs/data/default.yaml ===" cat configs/data/default.yaml echo "" echo "=== configs/model/gvp.yaml ===" cat configs/model/gvp.yaml echo "" echo "=== configs/flow/default.yaml ===" cat configs/flow/default.yaml

Length of output: 7018

@marcuscollins, here is the full audit of every place the loaded config is consumed and how the tests fail to catch the breakage.

Broken flat-key reads vs. the nested config.yaml structure

The Hydra-composed config written by train.py has this shape:

model: ← encoder_type, hidden_s, hidden_v, flow_layers, … data: quality: ← max_com_dist, max_clash_fraction, … water_filter: ← filter_by_distance, max_protein_dist, min_edia, … include_mates, geometry_cache_name, … flow: ← p_self_cond, use_self_cond, … logging: ← … training: ← … # flat extras added by train.py before saving: node_scalar_in: … resolved_encoder_config: {…}

1. build_model_from_config() — every model param silently wrong

What the code reads Where it actually lives Consequence of miss

config.get("encoder_type", "gvp") config["model"]["encoder_type"] Always builds a GVP model regardless of what was trained

config.get("hidden_s") or 256 config["model"]["hidden_s"] Silently uses 256; wrong architecture for any non-default run

config.get("hidden_v") or 64 config["model"]["hidden_v"] Same

config.get("freeze_encoder", …) config["model"]["freeze_encoder"] Encoder always treated as unfrozen

config.get("encoder_ckpt") config["model"]["encoder_ckpt"] Encoder checkpoint never passed in

config.get("embedding_dim") config["model"]["embedding_dim"] Embedding dim always None → breaks SLAE/ESM

config.get("flow_layers") or 3 config["model"]["flow_layers"] Always 3 layers regardless of training config

config.get("drop_rate", 0.1) config["model"]["drop_rate"] Same

config.get("n_message_gvps", 2) config["model"]["n_message_gvps"] Same

config.get("n_update_gvps", 2) config["model"]["n_update_gvps"] Same

config.get("k_pw") or 16 config["model"]["k_pw"] Same

config.get("k_ww") or 16 config["model"]["k_ww"] Same

The one saving grace is the resolved_encoder_config branch — if train.py stores that as a top-level flat key, build_model_from_config will take that path and skip most of the broken reads. However the remaining hidden_s, hidden_v, flow_layers, etc. that build FlowWaterGVP are still read flat after that branch.

2. _extract_dataset_filter_config() — all filter thresholds fall back to hardcoded defaults

# inference.py (current) config.get("max_com_dist", 25.0) # should be config["data"]["quality"]["max_com_dist"] config.get("max_clash_fraction", 0.05) # config["data"]["quality"]["max_clash_fraction"] config.get("clash_dist", 2.0) # config["data"]["quality"]["clash_dist"] config.get("interface_dist_threshold", 4.0) # config["data"]["quality"]["interface_dist_threshold"] config.get("min_water_residue_ratio", 0.6) # config["data"]["quality"]["min_water_residue_ratio"] config.get("max_protein_dist", 5.0) # config["data"]["water_filter"]["max_protein_dist"] config.get("min_edia", 0.4) # config["data"]["water_filter"]["min_edia"] config.get("max_bfactor_zscore", 1.5) # config["data"]["water_filter"]["max_bfactor_zscore"] config.get("filter_by_distance", True) # config["data"]["water_filter"]["filter_by_distance"] config.get("filter_by_edia", True) # config["data"]["water_filter"]["filter_by_edia"] config.get("filter_by_bfactor", True) # config["data"]["water_filter"]["filter_by_bfactor"]

All eleven reads will miss and return hardcoded defaults — so inference will always run with the hardcoded thresholds regardless of what was used during training, breaking the train/inference consistency guarantee.

3. main() — three more broken reads

config.get("p_self_cond", 0.5) # → config["flow"]["p_self_cond"] config.get("include_mates", False) # → config["data"]["include_mates"] config.get("geometry_cache_name", …) # → config["data"]["geometry_cache_name"] config.get("encoder_type", "gvp") # → config["model"]["encoder_type"] (used for dataset)

Test gaps

Both existing inference-related tests provide flat dicts directly — they will continue to pass even though the bug exists:

# test_train_config.py — passes flat dict, not the Hydra-written nested structure def test_inference_build_model_from_config_uses_embedding_dim(device): config = { "encoder_type": "slae", # ← flat key, bypasses the real problem "hidden_s": 128, ... } model = build_model_from_config(config, device) def test_inference_extracts_filter_config_from_training_config(): training_config = { "min_water_residue_ratio": 0.7, # ← flat key, bypasses the real problem ... } extracted = _extract_dataset_filter_config(training_config)

What's missing:

A round-trip test that writes a real Hydra-structured config (as train.py would produce) to a temp file, calls load_config(), then asserts every downstream accessor returns the correct value.

A test for _extract_dataset_filter_config with a nested input ({"data": {"quality": {...}, "water_filter": {...}}}).

A test that build_model_from_config with a nested config produces a model with the correct architecture dimensions.

Recommended fix

The cleanest approach is to normalize the nested dict in load_config() (or in a helper) so that all existing callers keep working without changes:

def _flatten_training_config(config: dict) -> dict: """ Flatten a Hydra-structured training config into the legacy flat layout expected by build_model_from_config / _extract_dataset_filter_config. """ flat = {} # model params flat.update(config.get("model", {})) # flow params flat.update(config.get("flow", {})) # data top-level (include_mates, geometry_cache_name, …) data = config.get("data", {}) flat["include_mates"] = data.get("include_mates", False) flat["geometry_cache_name"] = data.get("geometry_cache_name", "geometry") # quality thresholds flat.update(data.get("quality", {})) # water filter flat.update(data.get("water_filter", {})) # preserve flat extras written by train.py (resolved_encoder_config, node_scalar_in, …) for k, v in config.items(): if k not in {"model", "flow", "data", "logging", "training"}: flat[k] = v return flat def load_config(run_dir: Path) -> dict: ... config = cast(dict, OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)) return _flatten_training_config(config) # ← add this line

This also makes the existing tests continue to pass without modification (they already pass flat dicts that look like the output of _flatten_training_config), and the new round-trip test can exercise the full path.

coderabbitai · 2026-03-24T22:12:24Z

+    device = torch.device(
+        cfg.inference.hardware.device if torch.cuda.is_available() else "cpu"
+    )


⚠️ Potential issue | 🟠 Major

Pass the resolved device into batch inference.

The code falls back to CPU at Lines 278-280, but run_inference_batch() still receives the raw config value. On a CPU-only host that means model setup succeeds on CPU and then integration still tries to run on "cuda".

Suggested fix

batch_results = run_inference_batch( flow_matcher, batch_graphs, method=method, num_steps=num_steps, use_sc=use_sc, - device=cfg.inference.hardware.device, + device=str(device), water_ratio=water_ratio, )

Also applies to: 382-389

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/inference.py` around lines 278 - 280, The code computes a resolved torch.device in the local variable device but continues to pass the raw config value (cfg.inference.hardware.device) into run_inference_batch (and the other batch/inference call around lines ~382-389); update all calls to run_inference_batch (and any similar inference entry points) to pass the resolved device variable instead of the raw config string so the model and inference both use the same actual device (reference symbols: device, cfg.inference.hardware.device, run_inference_batch).

coderabbitai · 2026-03-24T22:12:24Z

+    # Determine include_mates from cfg or training config
+    include_mates = cfg.inference.include_mates or config.get("include_mates", False)


⚠️ Potential issue | 🟠 Major

Avoid or for the include_mates override.

An explicit inference.include_mates=false is treated as “unset” here, so a run trained with mates cannot be forced off at inference. This needs a tri-state config (null = inherit) or an explicit is None check.

Suggested fix

- include_mates = cfg.inference.include_mates or config.get("include_mates", False) + include_mates = ( + config.get("include_mates", False) + if cfg.inference.include_mates is None + else cfg.inference.include_mates + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Determine include_mates from cfg or training config

include_mates = cfg.inference.include_mates or config.get("include_mates", False)

# Determine include_mates from cfg or training config

include_mates = (

config.get("include_mates", False)

if cfg.inference.include_mates is None

else cfg.inference.include_mates

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/inference.py` around lines 302 - 303, The current assignment include_mates = cfg.inference.include_mates or config.get("include_mates", False) treats False as unset; change it to explicitly check for None so a False in cfg.inference overrides training: if cfg.inference.include_mates is not None then use that value, otherwise fall back to config.get("include_mates", False). Update the logic around include_mates (references: include_mates, cfg.inference.include_mates, config.get("include_mates", False)) to implement this tri-state behavior.

coderabbitai · 2026-03-24T22:12:24Z

+    if (step + 1) % cfg.training.optimizer.grad_accum_steps != 0:
+        if cfg.training.optimizer.grad_clip > 0:
            torch.nn.utils.clip_grad_norm_(
                [p for p in flow_matcher.model.parameters() if p.requires_grad],
-                max_norm=args.grad_clip,
+                max_norm=cfg.training.optimizer.grad_clip,
            )
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        optimizer_step_count += 1
-        if warmup_scheduler is not None and optimizer_step_count <= args.warmup_steps:
+        if (
+            warmup_scheduler is not None
+            and optimizer_step_count <= cfg.training.scheduler.warmup_steps
+        ):
            warmup_scheduler.step()


⚠️ Potential issue | 🟡 Minor

Potential NameError if train_loader is empty.

If train_loader has no batches, the for loop at line 402 never executes, leaving step undefined. This would cause a NameError when accessing step on line 472.

🛡️ Suggested defensive fix

+ # Handle remaining gradients at end of epoch + if processed_batches == 0: + # No batches processed, skip gradient handling + final_global_step = (epoch - 1) * len(train_loader) + logger.warning( + f"Epoch {epoch}: skipped all {skipped_batches} train batches (no waters)." + ) + return ( + {"train/epoch_loss": float("inf"), "train/epoch_rmsd": float("inf")}, + final_global_step, + optimizer_step_count, + ) + # Handle remaining gradients at end of epoch - if (step + 1) % cfg.training.optimizer.grad_accum_steps != 0: + if processed_batches > 0 and (step + 1) % cfg.training.optimizer.grad_accum_steps != 0:

Alternatively, initialize step = -1 before the loop to avoid the NameError while preserving the current logic flow.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/train.py` around lines 472 - 485, The block uses variable "step" (from the training loop over train_loader) but if train_loader is empty "step" will be undefined causing a NameError; initialize step = -1 (or another safe sentinel) before the loop that iterates over train_loader (the loop that sets step) so this post-loop logic using step, optimizer, optimizer_step_count, cfg.training.optimizer.grad_accum_steps, and warmup_scheduler can run safely even when train_loader has zero batches; ensure the sentinel value preserves the intended conditional behavior ((step + 1) % cfg.training.optimizer.grad_accum_steps) and the warmup_scheduler check that references optimizer_step_count and cfg.training.scheduler.warmup_steps.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/dataset.py (1)
665-727: ⚠️ Potential issue | 🟠 Major

Don’t require base_pdb_dir for cache-only datasets.

Line 726 makes base_pdb_dir mandatory even when preprocess=False, but __getitem__() only reads the cached .pt files. That regresses cache-only inference/eval setups that don't have the raw PDB tree mounted. Please gate this validation on preprocessing (or defer pdb_path construction until _preprocess_one()).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/dataset.py` around lines 665 - 727, The constructor currently raises
ValueError("base_pdb_dir is required") unconditionally which breaks cache-only
runs; change the validation to only require base_pdb_dir when preprocessing is
enabled (i.e., if preprocess is True and base_pdb_dir is None, then raise), or
alternatively defer any use/creation of pdb paths until _preprocess_one so
__init__ no longer mandates base_pdb_dir; update references in __init__
(geometry_dir/cache_dir setup) and ensure __getitem__ continues to only read .pt
caches without needing base_pdb_dir.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/dataset.py`:
- Around line 665-727: The constructor currently raises ValueError("base_pdb_dir
is required") unconditionally which breaks cache-only runs; change the
validation to only require base_pdb_dir when preprocessing is enabled (i.e., if
preprocess is True and base_pdb_dir is None, then raise), or alternatively defer
any use/creation of pdb paths until _preprocess_one so __init__ no longer
mandates base_pdb_dir; update references in __init__ (geometry_dir/cache_dir
setup) and ensure __getitem__ continues to only read .pt caches without needing
base_pdb_dir.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35590ad4-3b65-491c-901c-546573b5b419

📥 Commits

Reviewing files that changed from the base of the PR and between 75d01d1 and 45bd4ec.

📒 Files selected for processing (4)

README.md
scripts/inference.py
src/dataset.py
tests/test_dataset.py

🚧 Files skipped from review as they are similar to previous changes (3)

tests/test_dataset.py
README.md
scripts/inference.py

Copilot

Pull request overview

Copilot reviewed 18 out of 26 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

tests/test_train_config.py:155

These tests still exercise inference helpers with a legacy flat training config (e.g., encoder_type, min_water_residue_ratio at the top level). Since scripts/train.py now writes a structured Hydra config, consider updating/adding tests that validate load_config() + build_model_from_config() + _extract_dataset_filter_config() against the nested model.* / data.* schema.

def test_inference_build_model_from_config_uses_embedding_dim(device):
    config = {
        "encoder_type": "slae",
        "hidden_s": 128,
        "hidden_v": 32,
        "flow_layers": 2,
        "node_scalar_in": 16,
        "embedding_dim": 128,
        "k_pw": 8,
        "k_ww": 8,
    }

    model = build_model_from_config(config, device)

    assert model.encoder.output_dims == (128, 0)


def test_inference_extracts_filter_config_from_training_config():
    """Verify inference correctly extracts filter params from training config."""
    from scripts.inference import _extract_dataset_filter_config

    training_config = {
        "min_water_residue_ratio": 0.7,
        "max_protein_dist": 4.5,
        "filter_by_edia": False,
    }

    extracted = _extract_dataset_filter_config(training_config)

    assert extracted["min_water_residue_ratio"] == 0.7

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T20:03:37Z

+    # Save resolved Hydra config
+    config_file = run_dir / "config.yaml"
+    with open(config_file, "w") as f:
+        f.write(OmegaConf.to_yaml(cfg))
+    logger.info(f"Configuration saved to: {config_file}")


config.yaml is written before derived fields are computed (e.g., node_scalar_in and resolved_encoder_config). Inference relies on these to rebuild the encoder shapes when embedding_dim is left null, so the run directory config may be insufficient to reproduce/load the model. Consider saving an enriched config after encoder_config is resolved (or persist a separate resolved config file) and updating inference to read it.

Suggested change

# Save resolved Hydra config

config_file = run_dir / "config.yaml"

with open(config_file, "w") as f:

f.write(OmegaConf.to_yaml(cfg))

logger.info(f"Configuration saved to: {config_file}")

# Save the initial Hydra config; this may not include all runtime-derived fields.

config_file = run_dir / "raw_config.yaml"

with open(config_file, "w") as f:

f.write(OmegaConf.to_yaml(cfg))

logger.info(f"Initial configuration saved to: {config_file}")

Copilot · 2026-03-30T20:03:38Z

    train_loader = get_dataloader(
-        pdb_list_file=args.train_list,
-        processed_dir=args.processed_dir,
-        batch_size=args.batch_size,
+        pdb_list_file=cfg.train_list,
+        processed_dir=cfg.data.processed_dir,
+        batch_size=cfg.training.batch_size,


Hydra changes the working directory by default, so passing relative paths (e.g., splits/train.txt, data.processed_dir) directly into get_dataloader can break when run from a Hydra output dir. Either disable chdir (hydra.job.chdir=false) or convert user paths to absolute (e.g., hydra.utils.to_absolute_path) before opening files/directories.

Copilot · 2026-03-30T20:03:38Z

    # Create FlowMatcher
    flow_matcher = FlowMatcher(
        model=model,
        p_self_cond=config.get("p_self_cond", 0.5),
    )


p_self_cond is now configured under flow.p_self_cond in the saved Hydra config, but inference still reads config.get("p_self_cond"). This will ignore the training setting and can change sampling behavior; read from the structured config instead.

Copilot · 2026-03-30T20:03:38Z

+    # Determine include_mates from cfg or training config
+    include_mates = cfg.inference.include_mates or config.get("include_mates", False)
    encoder_type = config.get("encoder_type", "gvp")

-    # Use --geometry_cache if provided, otherwise use config's geometry_cache_name
-    geometry_cache_name = args.geometry_cache or config.get(
+    # Use geometry_cache if provided, otherwise use config's geometry_cache_name
+    geometry_cache_name = cfg.inference.geometry_cache or config.get(
        "geometry_cache_name", "geometry"
    )


include_mates, encoder_type, and geometry_cache_name are no longer top-level keys in the saved Hydra config (data.include_mates, model.encoder_type, data.geometry_cache_name). Using config.get(...) here will ignore the training-time settings and can load the wrong dataset/model. Please read from the nested keys (and optionally support legacy configs).

Copilot · 2026-03-30T20:03:38Z

    dataset = ProteinWaterDataset(
-        pdb_list_file=args.pdb_list,
-        processed_dir=args.processed_dir,
-        base_pdb_dir=args.base_pdb_dir,
+        pdb_list_file=cfg.pdb_list,
+        processed_dir=cfg.processed_dir,
+        base_pdb_dir=cfg.base_pdb_dir,
        encoder_type=encoder_type,


processed_dir and base_pdb_dir are passed from cfg but default to null in configs/inference.yaml, which will cause a runtime failure when constructing ProteinWaterDataset. Either make them required in the Hydra config (use ???) or default them from the loaded training config (config['data']['processed_dir'], config['data']['base_pdb_dir']).

Copilot · 2026-03-30T20:03:39Z

 ```bash
 uv run python -m scripts.train \
-    --train_list splits/train_list_0.95.txt \
-    --val_list splits/valid_list_0.05.txt \
-    --encoder_type gvp \
-    --batch_size 4
+    train_list=splits/train_list_0.95.txt \
+    val_list=splits/valid_list_0.05.txt
 ```


Training examples omit required data paths. With data.processed_dir and data.base_pdb_dir defaulting to null (and base_pdb_dir now required by the dataset), these commands will fail unless the user overrides those keys. Please update the example invocations to include data.processed_dir=... and data.base_pdb_dir=... (or explain where they come from).

Copilot · 2026-03-30T20:03:39Z

+| `training.optimizer.grad_accum_steps` | `1` | Gradient accumulation steps |
+| `training.epochs` | `200` | Number of training epochs |
+| `training.optimizer.lr` | `1e-3` | Learning rate |
+| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |


The training options table documents data.processed_dir but not data.base_pdb_dir, which is now required for dataset creation. Add it (and note whether it’s required) to prevent users from running into a runtime ValueError without guidance.

Suggested change

| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |

| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |

| `data.base_pdb_dir` | required | Base directory containing PDB/mmCIF files for dataset creation |

Definitely fix.

Copilot · 2026-03-30T20:03:39Z

@@ -423,35 +285,35 @@
    model = build_model_from_config(config, device)



load_config() now reads the structured Hydra config.yaml, but build_model_from_config() still expects the legacy flat schema (e.g., encoder_type, hidden_s at the top level). This call will therefore rebuild the wrong architecture (often defaulting to GVP) for runs trained with cfg.model.*. Update the inference-side model builder (or flatten/map the loaded config) so it matches the new training config format.

Copilot · 2026-03-30T20:03:40Z


    # Extract dataset filter config from training config for consistency
    filter_config = _extract_dataset_filter_config(config)

    dataset = ProteinWaterDataset(


filter_config = _extract_dataset_filter_config(config) uses the training run config, but the helper currently only reads flat keys. Since config.yaml now nests these under data.quality / data.water_filter, inference will silently fall back to defaults and may preprocess/validate different data than training. Update the extraction logic to consume the structured config (or support both formats).

Copilot · 2026-03-30T20:03:40Z

+def _extract_water_filter_config(cfg: DictConfig) -> dict:
    """Extract per-water filtering parameters (toggleable)."""
    return {
-        "max_protein_dist": args.max_protein_dist,
-        "min_edia": args.min_edia,
-        "max_bfactor_zscore": args.max_bfactor_zscore,
-        "filter_by_distance": args.filter_by_distance,
-        "filter_by_edia": args.filter_by_edia,
-        "filter_by_bfactor": args.filter_by_bfactor,
+        "edia_dir": cfg.data.edia_dir,
+        "max_protein_dist": cfg.data.water_filter.max_protein_dist,
+        "min_edia": cfg.data.water_filter.min_edia,
+        "max_bfactor_zscore": cfg.data.water_filter.max_bfactor_zscore,
+        "filter_by_distance": cfg.data.water_filter.filter_by_distance,
+        "filter_by_edia": cfg.data.water_filter.filter_by_edia,
+        "filter_by_bfactor": cfg.data.water_filter.filter_by_bfactor,
    }


_extract_water_filter_config includes edia_dir, but ProteinWaterDataset.__init__ (src/dataset.py) does not accept an edia_dir kwarg. Because water_filter_kwargs is merged into dataset_kwargs and passed through get_dataloader(**dataset_kwargs), training will raise TypeError: __init__() got an unexpected keyword argument 'edia_dir'. Either remove edia_dir from dataset_kwargs or add explicit edia_dir support in the dataset constructor/logic.

Have you run a test training run to make sure this works?

marcuscollins

Mainly I want to make sure this code actually runs--from the Copilot comments it seems like it won't, but the fact that you've been training models suggest it will run fine. There's a third, worse possibility--that it will run, but with opaque defaults. So having more config tests should be included, to make sure the model is created the way you expect.

marcuscollins · 2026-04-06T15:55:59Z

+processed_dir: null
+base_pdb_dir: null


@vratins I would favor validation in train.py over requiring them here. One thing I don't like about Hydra is that it easily becomes opaque, so that you don't know how, and can't easily figure out, what the required argument needs to be. I don't know of a good way to document options when using Hydra configs. One way is to make sure to validate all the parameters up front in your scripts and provide useful error messages, although frankly that's pretty tedious.

marcuscollins · 2026-04-06T16:00:25Z

+
+# Inference settings
+inference:
+  include_mates: false


Shouldn't this (and similar options) depend on how the model was trained? Perhaps you need to inherit from the training config?

marcuscollins · 2026-04-06T16:12:14Z

-        --num_steps 100 \
-        --save_gifs
+    python scripts/inference.py run_dir=... pdb_list=... output_dir=...
+    python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler


yeah, definitely need to fix this.

marcuscollins · 2026-04-06T17:20:01Z

+    config_path = run_dir / "config.yaml"
    if not config_path.exists():
        raise FileNotFoundError(f"Config file not found: {config_path}")

-    with open(config_path, "r") as f:
-        config = json.load(f)
+    from typing import cast
+
+    from omegaconf import OmegaConf
+
+    config = cast(
+        dict, OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)
+    )


I think this is going to mean that you need to go through the rest of the code and make sure wherever these options are used, you are looking at the correct part of the configuration. I haven't looked at the unit tests yet, but these should thoroughly look at the config and make sure it is loaded and used correctly.

marcuscollins · 2026-04-06T17:21:06Z

    (output_dir / "plots").mkdir(exist_ok=True)
-    if args.save_gifs:
+    if cfg.inference.visualization.save_gifs:
        (output_dir / "gifs").mkdir(exist_ok=True)


marcuscollins · 2026-04-06T20:41:35Z


-    log_encoder_sample_stats(sample_data, args.encoder_type)
+    log_encoder_sample_stats(sample_data, cfg.model.encoder_type)
    encoder_config = resolve_encoder_config(


is this what is needed for inference? Should you log it and/or save it in the config file you save above, and to wandb?

marcuscollins · 2026-04-06T20:44:28Z

    )

-    model = build_model(args, device, encoder_config=encoder_config)
+    model = build_model(cfg, device, encoder_config=encoder_config)


Isn't the encoder_config derived just from cfg and maybe a couple other parameters? Could you just create the encoder config inside build_model.

marcuscollins · 2026-04-06T20:51:32Z

+                "hidden_v": 64,
+                "freeze_encoder": False,
+                "encoder_ckpt": None,
+                "embedding_dim": None,


do you not need the rest of the parameters in here?

marcuscollins · 2026-04-06T21:01:59Z

-
 import pytest
 import torch
+from omegaconf import OmegaConf


I strongly suggest doing a lot more testing here, in particular testing the pattern of defaults and making sure things change the way you expect. As the AI reviewers above noted, it's easy in a config system that has a hierarchy of defaults to specify and/or read things incorrectly. I would make sure the values you expect come through the way you expect.

marcuscollins · 2026-04-06T21:03:16Z

+| `training.optimizer.grad_accum_steps` | `1` | Gradient accumulation steps |
+| `training.epochs` | `200` | Number of training epochs |
+| `training.optimizer.lr` | `1e-3` | Learning rate |
+| `training.scheduler.type` | `cosine` | LR scheduler: `cosine`, `step`, `none` |


Definitely fix.

vratins added 6 commits March 24, 2026 11:46

configs

7c7e776

resolve gitignore conflict

8bb65bc

fixing config loading in inference and ty checks

efa739a

hydra config details

6c829fb

update readme, get rid of final hard coded paths

288a7c7

edit gitignore

070ab57

Copilot AI review requested due to automatic review settings March 24, 2026 21:57

Auto-commit ruff fixes [skip ci]

bdd319d

vratins linked an issue Mar 24, 2026 that may be closed by this pull request

Remove hard coded paths everywhere in the codebase #13

Open

Copilot started reviewing on behalf of vratins March 24, 2026 21:58 View session

This was linked to issues Mar 24, 2026

Rename <atom_type>_x to <atom_type>_feat for clarity #32

Open

Configuration files for training runs #29

Open

vratins added 2 commits March 24, 2026 17:02

adding hydra and omegaconf to lock and toml files

43afd33

Merge branch 'dev_paths_clarity' of github.com:Wankowicz-Lab/WaterFlo…

75d01d1

…w into dev_paths_clarity

Copilot AI reviewed Mar 24, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

vratins mentioned this pull request Mar 26, 2026

Making the CLI args for train and inference consistent #57

Open

Merge branch 'main' into dev_paths_clarity

45bd4ec

vratins had a problem deploying to testing March 26, 2026 23:18 — with GitHub Actions Failure

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

toml and lock changes

d615f1a

Copilot AI review requested due to automatic review settings March 30, 2026 19:55

vratins had a problem deploying to testing March 30, 2026 19:55 — with GitHub Actions Failure

Copilot started reviewing on behalf of vratins March 30, 2026 19:56 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

marcuscollins requested changes Apr 6, 2026

View reviewed changes

	\| `run_dir` \| required \| Path to training run directory (contains config.json) \|
	\| `run_dir` \| required \| Path to training run directory (contains config.yaml) \|

	python scripts/inference.py run_dir=... pdb_list=... output_dir=... integration.method=euler
	python scripts/inference.py run_dir=... pdb_list=... output_dir=... inference.integration.method=euler

What the code reads	Where it actually lives	Consequence of miss
`config.get("encoder_type", "gvp")`	`config["model"]["encoder_type"]`	Always builds a GVP model regardless of what was trained
`config.get("hidden_s") or 256`	`config["model"]["hidden_s"]`	Silently uses 256; wrong architecture for any non-default run
`config.get("hidden_v") or 64`	`config["model"]["hidden_v"]`	Same
`config.get("freeze_encoder", …)`	`config["model"]["freeze_encoder"]`	Encoder always treated as unfrozen
`config.get("encoder_ckpt")`	`config["model"]["encoder_ckpt"]`	Encoder checkpoint never passed in
`config.get("embedding_dim")`	`config["model"]["embedding_dim"]`	Embedding dim always `None` → breaks SLAE/ESM
`config.get("flow_layers") or 3`	`config["model"]["flow_layers"]`	Always 3 layers regardless of training config
`config.get("drop_rate", 0.1)`	`config["model"]["drop_rate"]`	Same
`config.get("n_message_gvps", 2)`	`config["model"]["n_message_gvps"]`	Same
`config.get("n_update_gvps", 2)`	`config["model"]["n_update_gvps"]`	Same
`config.get("k_pw") or 16`	`config["model"]["k_pw"]`	Same
`config.get("k_ww") or 16`	`config["model"]["k_ww"]`	Same

		# Determine include_mates from cfg or training config
		include_mates = cfg.inference.include_mates or config.get("include_mates", False)

	\| `training.scheduler.type` \| `cosine` \| LR scheduler: `cosine`, `step`, `none` \|
	\| `training.scheduler.type` \| `cosine` \| LR scheduler: `cosine`, `step`, `none` \|
	\| `data.base_pdb_dir` \| required \| Base directory containing PDB/mmCIF files for dataset creation \|

		@@ -423,35 +285,35 @@
		model = build_model_from_config(config, device)

Conversation

vratins commented Mar 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 6, 2026

Choose a reason for hiding this comment

Broken flat-key reads vs. the nested config.yaml structure

1. build_model_from_config() — every model param silently wrong

2. _extract_dataset_filter_config() — all filter thresholds fall back to hardcoded defaults

3. main() — three more broken reads

Test gaps

Recommended fix

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vratins commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading

coderabbitai Bot Mar 24, 2026 •

edited

Loading

Broken flat-key reads vs. the nested `config.yaml` structure

1. `build_model_from_config()` — every model param silently wrong

2. `_extract_dataset_filter_config()` — all filter thresholds fall back to hardcoded defaults

3. `main()` — three more broken reads