Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #50 by aligning scripts/train.py and scripts/inference.py command-line flags and logging configuration so the two entrypoints use consistent naming and offer the same logging controls.
Changes:
- Renamed
--rk4_stepsto--num_stepsinscripts/train.pyand updated the corresponding usage sites. - Updated
scripts/inference.pyCLI flags to match training (--geometry_cache_name,--use_self_cond) and added--log_level/--log_filewhile removing the early top-level logging setup in favor of configuring logging inmain(). - Added several
# ty: ignore[...]suppressions around model/dataset construction and metric handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| scripts/train.py | Renames the integration-steps flag and updates eval sampling usage; adds ty ignore suppressions around model construction and training loop typing. |
| scripts/inference.py | Renames CLI flags for consistency, adds logging CLI options and moves logging setup into main(); adds ty ignore suppressions and updates self-conditioning plumbing. |
Comments suppressed due to low confidence (1)
scripts/inference.py:604
- The metrics.json config block now stores values under keys like "use_sc" and "geometry_cache", while the CLI flags are "--use_self_cond" and "--geometry_cache_name". This makes the output config inconsistent with the CLI and can confuse downstream consumers. Consider renaming these keys (or writing both old+new keys for backward compatibility) to match the new flag names.
"checkpoint": args.checkpoint,
"method": args.method,
"num_steps": args.num_steps,
"use_sc": args.use_self_cond,
"threshold": args.threshold,
"include_mates": include_mates,
"water_ratio": args.water_ratio,
"geometry_cache": geometry_cache_name,
},
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughCLI flags renamed and logging initialization moved into inference.py's main; added log options; type annotation tightened; geometry cache resolution and metric keys updated; train.py RK4 flag renamed and metrics accumulation/conditioning logic hardened. (48 words) Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
src/utils.py (1)
187-193: Use the shared RBF defaults here too.This helper still hardcodes
16and8.0, so callers that rely on its defaults will drift the next timeNUM_RBForRBF_CUTOFFchanges.♻️ Suggested change
def compute_edge_features( pos: Tensor, edge_index: Tensor, pos_dst: Tensor | None = None, - num_gaussians: int = 16, - cutoff: float = 8.0, + num_gaussians: int = NUM_RBF, + cutoff: float = RBF_CUTOFF, clamp_min: float = 1e-5, ) -> tuple[Tensor, Tensor]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/utils.py` around lines 187 - 193, compute_edge_features currently hardcodes num_gaussians=16 and cutoff=8.0; change its signature to use the shared defaults NUM_RBF and RBF_CUTOFF (e.g. num_gaussians: int = NUM_RBF, cutoff: float = RBF_CUTOFF) and add the necessary import of NUM_RBF and RBF_CUTOFF into this module so callers follow the global RBF defaults; keep clamp_min as-is.src/dataset.py (1)
1115-1127: Consider referencing NUM_RBF constant in docstring.The docstring hardcodes
edge_rbf: (E, 16)but this dimension is defined byNUM_RBFconstant. Consider documenting asedge_rbf: (E, NUM_RBF)for consistency, especially if this value might change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/dataset.py` around lines 1115 - 1127, The docstring for Dataset.__getitem__ documents edge_rbf as "(E, 16)" but that dimension is defined by the NUM_RBF constant; update the docstring to reference NUM_RBF (e.g., "edge_rbf: (E, NUM_RBF)") so the HeteroData description stays consistent with the actual feature size used by the code (refer to __getitem__, NUM_RBF, and the 'edge_rbf' field).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/inference.py`:
- Around line 428-431: The current logic for determining geometry_cache_name
only checks args.geometry_cache_name and config.get("geometry_cache_name"),
which ignores older saved runs using the legacy "geometry_cache" key; update the
resolution in scripts/inference.py so that geometry_cache_name =
args.geometry_cache_name or config.get("geometry_cache_name") or
config.get("geometry_cache") or "geometry" (i.e., fall back to the legacy
"geometry_cache" key before defaulting to "geometry"), and ensure any references
to geometry_cache_name (e.g., where it's used to locate cache directories)
continue to use the resolved value.
In `@scripts/train.py`:
- Around line 666-670: The call to flow_matcher.training_step uses a
now-mismatched signature: FlowMatcher.training_step(batch, optimizer,
grad_clip=1.0, use_self_conditioning=True). Remove the unsupported
accumulation_steps kwarg and pass the optimizer instance (e.g., the same
optimizer used for training) as the second positional argument; keep or pass
grad_clip and use_self_conditioning explicitly if needed (use args.use_self_cond
and args.grad_clip). If you still need gradient accumulation, implement
accumulation outside training_step by looping/accumulating gradients across
args.grad_accum_steps and only calling optimizer.step()/optimizer.zero_grad()
after that loop rather than relying on an accumulation_steps parameter.
In `@src/dataset.py`:
- Around line 662-669: Move the encoder_type validation to run before any
preprocessing to avoid wasted work: in src/dataset.py, check self.encoder_type
against {"gvp","slae","esm"} and raise the ValueError prior to calling
self._preprocess_all(); specifically, reorder the block so the validation of
encoder_type occurs before the call to self._preprocess_all(), keeping the same
error message and allowed set.
In `@tests/test_dataset.py`:
- Around line 49-88: The fixtures pdb_base_dir, pdb_6eey, pdb_2b5w, pdb_8dzt,
and pdb_1deu in tests/test_dataset.py shadow the portable fixtures defined in
tests/conftest.py (which use ENV_PDB_DIR / tests/test_files); remove these
duplicate fixture definitions from tests/test_dataset.py so the module uses the
conftest-provided fixtures instead (or if special behavior is required, rename
them to avoid shadowing and/or delegate to the conftest fixtures inside the new
fixture implementations such as calling the conftest pdb_base_dir fixture).
---
Nitpick comments:
In `@src/dataset.py`:
- Around line 1115-1127: The docstring for Dataset.__getitem__ documents
edge_rbf as "(E, 16)" but that dimension is defined by the NUM_RBF constant;
update the docstring to reference NUM_RBF (e.g., "edge_rbf: (E, NUM_RBF)") so
the HeteroData description stays consistent with the actual feature size used by
the code (refer to __getitem__, NUM_RBF, and the 'edge_rbf' field).
In `@src/utils.py`:
- Around line 187-193: compute_edge_features currently hardcodes
num_gaussians=16 and cutoff=8.0; change its signature to use the shared defaults
NUM_RBF and RBF_CUTOFF (e.g. num_gaussians: int = NUM_RBF, cutoff: float =
RBF_CUTOFF) and add the necessary import of NUM_RBF and RBF_CUTOFF into this
module so callers follow the global RBF defaults; keep clamp_min as-is.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b8c4a5d4-291a-4ea1-953b-8adacf86c6bc
📒 Files selected for processing (10)
.github/workflows/build.ymlscripts/inference.pyscripts/train.pysrc/constants.pysrc/dataset.pysrc/utils.pytests/conftest.pytests/test_dataset.pytests/test_embedding_generation.pytests/test_utils.py
💤 Files with no reviewable changes (1)
- tests/test_embedding_generation.py
| metrics = flow_matcher.training_step( | ||
| batch, | ||
| use_self_conditioning=args.use_self_cond, | ||
| accumulation_steps=args.grad_accum_steps, | ||
| accumulation_steps=args.grad_accum_steps, # ty: ignore[unknown-argument] | ||
| ) |
There was a problem hiding this comment.
This call no longer matches FlowMatcher.training_step().
The current signature in src/flow.py (Lines 418-424) is training_step(batch, optimizer, grad_clip=1.0, use_self_conditioning=True). Line 669 adds an unsupported accumulation_steps keyword, and the required optimizer argument is no longer passed, so training will raise on the first batch instead of entering the accumulation loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/train.py` around lines 666 - 670, The call to
flow_matcher.training_step uses a now-mismatched signature:
FlowMatcher.training_step(batch, optimizer, grad_clip=1.0,
use_self_conditioning=True). Remove the unsupported accumulation_steps kwarg and
pass the optimizer instance (e.g., the same optimizer used for training) as the
second positional argument; keep or pass grad_clip and use_self_conditioning
explicitly if needed (use args.use_self_cond and args.grad_clip). If you still
need gradient accumulation, implement accumulation outside training_step by
looping/accumulating gradients across args.grad_accum_steps and only calling
optimizer.step()/optimizer.zero_grad() after that loop rather than relying on an
accumulation_steps parameter.
There was a problem hiding this comment.
@vratins please check whether this is true, and if so fix it.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| ) | ||
| p.add_argument( | ||
| "--use_sc", | ||
| "--use_self_cond", |
There was a problem hiding this comment.
Are these new names reflected in the Dockerfile?
There was a problem hiding this comment.
The docker file only exposes the input and output arguments, everything else is taken in as arguments by the use (and hydra configs soon)
| p.add_argument("--eval_every", type=int, default=5) | ||
| p.add_argument("--n_eval_samples", type=int, default=3) | ||
| p.add_argument("--rk4_steps", type=int, default=100) | ||
| p.add_argument("--num_steps", type=int, default=100) |
There was a problem hiding this comment.
@vratins please add an issue to add help strings to each argument.
There was a problem hiding this comment.
Actually, is this worth an issue given that we will be switching to hydra configs as in #70
There was a problem hiding this comment.
Well, probably not the same issue, but the config items need to be documented somewhere.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
scripts/inference.py:107
- Changing the flag name from
--geometry_cacheto--geometry_cache_namewill break existing inference invocations that still pass the old flag. You already handle backward compatibility for the config key later; consider also accepting--geometry_cacheas an alias that maps to the same destination (geometry_cache_name) and (optionally) mark it deprecated.
p.add_argument(
"--geometry_cache_name",
type=str,
default=None,
help="Subdirectory name within processed_dir specifying which water coordinate set to use. "
"Options include 'geometry' (filtered waters meeting quality criteria) or "
"'geometry_unfiltered' (all crystallographic waters). Overrides the model's config if specified.",
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
For issue #50
Summary by CodeRabbit