feat: RSCC eval updates#224
Conversation
📝 WalkthroughWalkthroughThis PR extends protein structure evaluation capabilities by introducing new shared utilities for handling alternate conformations (altlocs), refactoring the density computation pipeline, adding two new evaluation scripts for classifying and analyzing altloc regions, parallelizing RSCC grid-search computations, and supporting integration tests. Dependencies include the addition of Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Main as rscc_grid_search_script.main()
participant JobLib as joblib.Parallel
participant Group as process_group()
participant Setup as Transformer Setup
participant Trial as Per-Trial Loop
participant Select as Per-Selection Loop
participant Output as Write CSV
User->>Main: invoke with input CSV
Main->>Main: load protein_configs.csv<br/>group by (protein, occ_key)
Main->>JobLib: submit groups in parallel
loop for each (protein, occ_key) group
JobLib->>Group: process_group()
Group->>Setup: load base_xmap<br/>build_density_transformer()
Setup->>Setup: parse reference atoms<br/>initialize group cache
loop for each trial in group
Trial->>Trial: parse/refine alignment
Trial->>Trial: run_density_transformer()<br/>compute full density once
loop for each selection in trial
Select->>Select: extract base/computed<br/>sub-regions (cached)
Select->>Select: compute RSCC
Select->>Trial: append result row<br/>(handle per-selection errors)
end
end
Group->>JobLib: return trial results
end
JobLib->>Main: collect all group results
Main->>Output: write rscc_results.csv
Output->>User: return output path
sequenceDiagram
actor User
participant Main as find_max_rmsd_subsegment.main()
participant LoadStruct as Load Structure
participant BuildPair as build_pairwise_altloc_arrays()
participant Window as _find_max_rmsd_window()
participant RMSD as Compute RMSD
participant Output as Write CSV
User->>Main: invoke with input CSV
Main->>Main: load protein_configs.csv
loop for each protein
Main->>LoadStruct: resolve CIF path<br/>load structure with altlocs
LoadStruct->>BuildPair: extract all altloc IDs
BuildPair->>BuildPair: construct filtered atom pairs<br/>for every unordered pair
BuildPair->>Window: return pair_arrays dict
loop for each selection in protein
Window->>Window: parse chain/residue span<br/>enumerate windows
loop for each altloc pair
loop for each window in span
RMSD->>RMSD: check compositional<br/>heterogeneity
alt no heterogeneity
RMSD->>RMSD: compute all-atom RMSD<br/>on window atoms
end
RMSD->>Window: track max RMSD window
end
end
Window->>Main: return best window,<br/>max_rmsd, altloc_pair
end
Main->>Output: aggregate per-protein<br/>selections by semicolon-join
end
Main->>Output: write output CSV<br/>+ optional diagnostic CSV
Output->>User: return output paths
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
scripts/eval/find_max_rmsd_subsegment.py (1)
38-39: Expand these function docstrings to NumPy style and document side effects.
_process_structure()loads CIF data, andmain()reads/writes CSV files; those side effects should be explicit in the function docs.As per coding guidelines, “Always include NumPy-style docstrings for every function and class” and “ALWAYS annotate complex array shapes and note side effects.”
Also applies to: 99-104, 185-185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 38 - 39, Expand the short docstring for _has_compositional_heterogeneity into a full NumPy-style docstring: document Parameters (arr_i: np.ndarray, arr_j: np.ndarray — expected shapes and dtype/contents, mask: np.ndarray[bool] — shape must match arr_i/arr_j), the Returns (bool) and add a Notes/Side effects section stating there are no I/O side effects and the function only inspects arrays; do the same for _process_structure (document parameters, return type, and explicitly state the side effect that it loads CIF data from disk/parses CIF content and any mutation it performs on in-memory structure) and for main (document CLI args, that it reads and writes CSV files, and that it has disk I/O side effects and exit behavior); also update the other function docstrings called out in the review (the functions referenced at the other noted ranges) to NumPy-style including array shapes and explicit side-effect notes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 265-267: The CLI currently accepts non-positive --window-size
values which break the sliding-window logic; add a validation after parsing
(before calling main(args)) that checks args.window_size is an integer > 0 and
rejects otherwise (raise SystemExit or parser.error with a clear message).
Reference the parser variable and the main(args) call in
find_max_rmsd_subsegment.py so the guard runs immediately after
parser.parse_args() and prevents running the sliding-window routine with 0 or
negative sizes.
- Around line 81-88: The current check only ensures arr_i has some atoms for the
window, allowing partial residue coverage to be scored; change the logic to
require complete residue coverage for both altloc arrays before computing RMSD:
compute masks for both arr_i and arr_j (e.g., mask_i = (arr_i.chain_id == chain)
& np.isin(arr_i.res_id, window_res) and mask_j = (arr_j.chain_id == chain) &
np.isin(arr_j.res_id, window_res)), then verify that the set/unique res_id
values present in arr_i[mask_i] and arr_j[mask_j] exactly match window_res (or
that the count of unique res_ids equals len(window_res)); if either side is
incomplete, continue and do not call _has_compositional_heterogeneity or
biotite_rmsd on that window.
In `@scripts/eval/rscc_grid_search_script.py`:
- Line 247: The current exception handler "except (FileNotFoundError, OSError,
ValueError, RuntimeError) as e" can let AttributeError or TypeError raised
during parsing/filtering bubble up and abort the whole RSCC run; update that
except tuple to also include AttributeError and TypeError so trial-level
parsing/attribute/type failures are caught and treated as per-selection errors
(emitting the error row and continuing), i.e., change the handler around "except
(FileNotFoundError, OSError, ValueError, RuntimeError) as e" to "except
(FileNotFoundError, OSError, ValueError, RuntimeError, AttributeError,
TypeError) as e" and ensure the existing per-trial error logging/emission logic
is used for these cases.
In `@src/sampleworks/eval/grid_search_eval_utils.py`:
- Around line 27-39: The code builds Path objects from row["structure"] and
row["structure_pattern"] without guarding against pandas NaN (np.nan) which
truth-tests truthy and can cause TypeError; before constructing Path, normalize
and validate these cells: use pandas.isna(row["structure"]) (or isinstance
check) to treat NaN as missing, coerce non-empty values with
str(row["structure"]).strip() and only call Path(...) when that normalized
string is non-empty, and apply the same normalization/validation for
row["structure_pattern"] (variables p, pattern, and usage of cif_root should
remain the same); if both normalized values are missing raise the intended
ValueError with row.to_dict().
In `@src/sampleworks/utils/density_utils.py`:
- Around line 115-120: The code sets use_cuda_kernels based only on host CUDA
availability which can enable CUDA paths even when callers pass
device=torch.device("cpu"); change the call that constructs
DifferentiableTransformer so use_cuda_kernels is true only when the requested
device is CUDA and CUDA is available (e.g., use_cuda_kernels = (device.type ==
"cuda" and torch.cuda.is_available())); update the instantiation at
DifferentiableTransformer(...) and ensure any downstream code (e.g.,
dilate_atom_centric which calls torch.cuda.synchronize) will only run CUDA paths
when that flag reflects the requested device.
---
Nitpick comments:
In `@scripts/eval/find_max_rmsd_subsegment.py`:
- Around line 38-39: Expand the short docstring for
_has_compositional_heterogeneity into a full NumPy-style docstring: document
Parameters (arr_i: np.ndarray, arr_j: np.ndarray — expected shapes and
dtype/contents, mask: np.ndarray[bool] — shape must match arr_i/arr_j), the
Returns (bool) and add a Notes/Side effects section stating there are no I/O
side effects and the function only inspects arrays; do the same for
_process_structure (document parameters, return type, and explicitly state the
side effect that it loads CIF data from disk/parses CIF content and any mutation
it performs on in-memory structure) and for main (document CLI args, that it
reads and writes CSV files, and that it has disk I/O side effects and exit
behavior); also update the other function docstrings called out in the review
(the functions referenced at the other noted ranges) to NumPy-style including
array shapes and explicit side-effect notes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eb4437dd-4cc2-47be-b5e4-b7b713559434
📒 Files selected for processing (14)
.gitignoreCLAUDE.mdscripts/eval/classify_altloc_regions.pyscripts/eval/find_max_rmsd_subsegment.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/eval/grid_search_eval_utils.pysrc/sampleworks/utils/atom_array_utils.pysrc/sampleworks/utils/density_utils.pytests/eval/conftest.pytests/eval/test_rscc_grid_search_script.pytests/resources/1vme/1VME_0.25occA_0.75occB_1.00A.ccp4tests/resources/1vme/1VME_0.5occA_0.5occB_1.00A.ccp4tests/resources/1vme/1VME_single_001_density_input.ciftests/utils/test_density_utils.py
| mask = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res) | ||
| if mask.sum() == 0: | ||
| continue | ||
|
|
||
| if _has_compositional_heterogeneity(arr_i, arr_j, mask): | ||
| continue | ||
|
|
||
| rmsd_val = float(biotite_rmsd(arr_i[mask], arr_j[mask])) |
There was a problem hiding this comment.
Require complete residue coverage before scoring a window.
Line 82 only checks for any atoms, so a window with missing residues in either altloc can still be scored and later emitted as the full best_res[0]-best_res[-1] range. Skip windows unless both masked arrays contain exactly window_res.
🐛 Proposed fix
mask = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res)
if mask.sum() == 0:
continue
+ res_ids_i, _ = get_residues(arr_i[mask])
+ res_ids_j, _ = get_residues(arr_j[mask])
+ if list(map(int, res_ids_i)) != window_res or list(map(int, res_ids_j)) != window_res:
+ continue
+
if _has_compositional_heterogeneity(arr_i, arr_j, mask):
continue
rmsd_val = float(biotite_rmsd(arr_i[mask], arr_j[mask]))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 81 - 88, The current
check only ensures arr_i has some atoms for the window, allowing partial residue
coverage to be scored; change the logic to require complete residue coverage for
both altloc arrays before computing RMSD: compute masks for both arr_i and arr_j
(e.g., mask_i = (arr_i.chain_id == chain) & np.isin(arr_i.res_id, window_res)
and mask_j = (arr_j.chain_id == chain) & np.isin(arr_j.res_id, window_res)),
then verify that the set/unique res_id values present in arr_i[mask_i] and
arr_j[mask_j] exactly match window_res (or that the count of unique res_ids
equals len(window_res)); if either side is incomplete, continue and do not call
_has_compositional_heterogeneity or biotite_rmsd on that window.
| parser.add_argument("--window-size", type=int, default=3) | ||
| args = parser.parse_args() | ||
| main(args) |
There was a problem hiding this comment.
Reject non-positive --window-size values.
0 or negative values make the sliding-window calculation degenerate and can silently keep original selections instead of producing a valid max-RMSD subsegment.
🛡️ Proposed fix
parser.add_argument("--window-size", type=int, default=3)
args = parser.parse_args()
+ if args.window_size <= 0:
+ parser.error("--window-size must be a positive integer")
main(args)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| parser.add_argument("--window-size", type=int, default=3) | |
| args = parser.parse_args() | |
| main(args) | |
| parser.add_argument("--window-size", type=int, default=3) | |
| args = parser.parse_args() | |
| if args.window_size <= 0: | |
| parser.error("--window-size must be a positive integer") | |
| main(args) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/eval/find_max_rmsd_subsegment.py` around lines 265 - 267, The CLI
currently accepts non-positive --window-size values which break the
sliding-window logic; add a validation after parsing (before calling main(args))
that checks args.window_size is an integer > 0 and rejects otherwise (raise
SystemExit or parser.error with a clear message). Reference the parser variable
and the main(args) call in find_max_rmsd_subsegment.py so the guard runs
immediately after parser.parse_args() and prevents running the sliding-window
routine with 0 or negative sizes.
| f"density shape {computed_xmap.array.shape} does not match base map " | ||
| f"shape {base_xmap.array.shape}" | ||
| ) | ||
| except (FileNotFoundError, OSError, ValueError, RuntimeError) as e: |
There was a problem hiding this comment.
Preserve trial-level error isolation for attribute/type failures.
Line 185 raises AttributeError, and parsing/filtering can also raise TypeError; neither is caught here, so one bad trial can still abort the full RSCC run instead of emitting per-selection error rows.
🛡️ Proposed fix
- except (FileNotFoundError, OSError, ValueError, RuntimeError) as e:
+ except (
+ FileNotFoundError,
+ OSError,
+ ValueError,
+ RuntimeError,
+ TypeError,
+ AttributeError,
+ ) as e:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except (FileNotFoundError, OSError, ValueError, RuntimeError) as e: | |
| except ( | |
| FileNotFoundError, | |
| OSError, | |
| ValueError, | |
| RuntimeError, | |
| TypeError, | |
| AttributeError, | |
| ) as e: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/eval/rscc_grid_search_script.py` at line 247, The current exception
handler "except (FileNotFoundError, OSError, ValueError, RuntimeError) as e" can
let AttributeError or TypeError raised during parsing/filtering bubble up and
abort the whole RSCC run; update that except tuple to also include
AttributeError and TypeError so trial-level parsing/attribute/type failures are
caught and treated as per-selection errors (emitting the error row and
continuing), i.e., change the handler around "except (FileNotFoundError,
OSError, ValueError, RuntimeError) as e" to "except (FileNotFoundError, OSError,
ValueError, RuntimeError, AttributeError, TypeError) as e" and ensure the
existing per-trial error logging/emission logic is used for these cases.
| transformer = DifferentiableTransformer( | ||
| xmap=xmap_torch, | ||
| scattering_params=scattering_params, | ||
| em=em_mode, | ||
| device=device, | ||
| use_cuda_kernels=torch.cuda.is_available(), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect how DifferentiableTransformer handles CUDA kernels and device placement.
rg -n -C5 'class\s+DifferentiableTransformer|use_cuda_kernels|cuda' --iglob '*.py'Repository: diff-use/sampleworks
Length of output: 35412
🏁 Script executed:
# Search for calls to build_density_transformer and device parameter usage
rg -n 'build_density_transformer|def run_density_transformer' src/sampleworks/utils/density_utils.py -A 10Repository: diff-use/sampleworks
Length of output: 2149
🏁 Script executed:
# Check how device is used in the transformer and if device.type == "cuda" is standard
rg -n 'device.type' src/sampleworks --iglob '*.py' | head -20Repository: diff-use/sampleworks
Length of output: 46
🏁 Script executed:
# Verify device handling in the DifferentiableTransformer __init__
sed -n '407,475p' src/sampleworks/core/forward_models/xray/real_space_density.pyRepository: diff-use/sampleworks
Length of output: 2853
🏁 Script executed:
# Check what happens when CUDA kernels are called with CPU device
rg -n 'dilate_atom_centric\|_compute_radial_derivatives' src/sampleworks/core/forward_models/xray/real_space_density.py -A 5 | head -40Repository: diff-use/sampleworks
Length of output: 46
🏁 Script executed:
# Check what happens when CUDA kernels are invoked - look at the forward path
sed -n '560,580p' src/sampleworks/core/forward_models/xray/real_space_density.pyRepository: diff-use/sampleworks
Length of output: 906
🏁 Script executed:
# Check if CPU device is ever passed to build_density_transformer by searching callers
rg -n 'build_density_transformer|device.*cpu|torch.device.*cpu' src/sampleworks --iglob '*.py' -B 2 -A 2 | head -50Repository: diff-use/sampleworks
Length of output: 3595
🏁 Script executed:
# Look for where atom_reconciler device is used
rg -n 'to\(device\)|\.device' src/sampleworks/utils/density_utils.py -B 2 -A 2Repository: diff-use/sampleworks
Length of output: 1348
🏁 Script executed:
# Check the compute_density function to see all device paths
sed -n '165,230p' src/sampleworks/utils/density_utils.pyRepository: diff-use/sampleworks
Length of output: 2589
🏁 Script executed:
# Check what error/behavior occurs in dilate_atom_centric when called with CPU tensors
sed -n '280,345p' src/sampleworks/core/forward_models/xray/real_space_density_deps/ops/dilate_points_cuda.pyRepository: diff-use/sampleworks
Length of output: 2545
🏁 Script executed:
# Verify device.type is the correct PyTorch idiom by checking its actual usage in the codebase
rg -n '\.device\.type\s*==' src/sampleworks --iglob '*.py'Repository: diff-use/sampleworks
Length of output: 46
🏁 Script executed:
# Check if there are tests or explicit CPU device usages that would trigger this issue
rg -n 'device.*cpu\|torch.device.*cpu' src/sampleworks tests --iglob '*.py' | grep -E 'build_density|run_density|compute_density'Repository: diff-use/sampleworks
Length of output: 46
🏁 Script executed:
# Look at line 338 of dilate_points_cuda more carefully and surrounding context
sed -n '330,345p' src/sampleworks/core/forward_models/xray/real_space_density_deps/ops/dilate_points_cuda.pyRepository: diff-use/sampleworks
Length of output: 686
Gate CUDA kernels on the requested device, not just host availability.
Line 120 enables CUDA kernels whenever the host has CUDA, even when callers explicitly pass device=torch.device("cpu"). This causes a runtime error: if use_cuda_kernels=True but the transformer uses CPU, the code at line 568 invokes dilate_atom_centric, which unconditionally calls torch.cuda.synchronize(device) with a CPU device, raising an error. Tie the flag to the requested device instead.
Proposed fix
scattering_params=scattering_params,
em=em_mode,
device=device,
- use_cuda_kernels=torch.cuda.is_available(),
+ use_cuda_kernels=device.type == "cuda" and torch.cuda.is_available(),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| transformer = DifferentiableTransformer( | |
| xmap=xmap_torch, | |
| scattering_params=scattering_params, | |
| em=em_mode, | |
| device=device, | |
| use_cuda_kernels=torch.cuda.is_available(), | |
| transformer = DifferentiableTransformer( | |
| xmap=xmap_torch, | |
| scattering_params=scattering_params, | |
| em=em_mode, | |
| device=device, | |
| use_cuda_kernels=device.type == "cuda" and torch.cuda.is_available(), |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/density_utils.py` around lines 115 - 120, The code sets
use_cuda_kernels based only on host CUDA availability which can enable CUDA
paths even when callers pass device=torch.device("cpu"); change the call that
constructs DifferentiableTransformer so use_cuda_kernels is true only when the
requested device is CUDA and CUDA is available (e.g., use_cuda_kernels =
(device.type == "cuda" and torch.cuda.is_available())); update the instantiation
at DifferentiableTransformer(...) and ensure any downstream code (e.g.,
dilate_atom_centric which calls torch.cuda.synchronize) will only run CUDA paths
when that flag reflects the requested device.
ab2de06 to
d2e5ca0
Compare
d2e5ca0 to
1b781ea
Compare
4cf1ffc to
a657989
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/eval/rscc_grid_search_script.py (1)
124-124:⚠️ Potential issue | 🟠 MajorException tuples still let
AttributeError/TypeErrorescape.Both the setup handler (Line 124) and the per-trial handler (Line 209) catch
(FileNotFoundError, OSError, ValueError, RuntimeError), but:
- Line 147 explicitly raises
AttributeError("AtomArray | AtomArrayStack is missing coordinates").atom_array.set_annotation(...)(Line 153) and biotite parsing/attribute access can plausibly raiseAttributeError/TypeErroron malformed inputs.Under the per-trial handler, these uncaught errors currently abort the whole
process_groupcall, which — becausejoblib.Parallelre-raises worker exceptions — will terminate the entire RSCC run instead of emitting per-selectionrscc=nanrows. Apply the same widening to both handlers.🛡️ Proposed fix
- except (FileNotFoundError, OSError, ValueError, RuntimeError) as e: + except ( + FileNotFoundError, + OSError, + ValueError, + RuntimeError, + AttributeError, + TypeError, + ) as e: logger.error(f"ERROR setting up group {protein}/{trials[0].altloc_occupancies}: {e}")- except (FileNotFoundError, OSError, ValueError, RuntimeError) as e: + except ( + FileNotFoundError, + OSError, + ValueError, + RuntimeError, + AttributeError, + TypeError, + ) as e: logger.error(f"ERROR processing trial {trial.trial_dir}: {e}")Also applies to: 209-209
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/eval/rscc_grid_search_script.py` at line 124, The except clauses that currently catch (FileNotFoundError, OSError, ValueError, RuntimeError) should also include AttributeError and TypeError so malformed inputs don't crash the whole run; update the two handlers (the setup handler around the code that may raise AttributeError like the explicit raise and atom_array.set_annotation, and the per-trial handler inside process_group/worker) to catch (FileNotFoundError, OSError, ValueError, RuntimeError, AttributeError, TypeError) as e and preserve the existing fallback behavior (e.g., emit rscc=nan or continue) so errors become per-selection failures rather than terminating the job.
🧹 Nitpick comments (3)
tests/eval/test_rscc_grid_search_script.py (1)
118-118: Unused default argument inflaky_rscc.
_sel=bad_selectionis never referenced inside the function body; the failure is decided purely by thetarget_is_nextflag. Drop it to avoid confusion.♻️ Proposed refactor
- def flaky_rscc(a, b, _sel=bad_selection): + def flaky_rscc(a, b): if flaky_rscc.target_is_next: # type: ignore[attr-defined] flaky_rscc.target_is_next = False # type: ignore[attr-defined] raise RuntimeError("simulated rscc failure") return real_rscc(a, b)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/test_rscc_grid_search_script.py` at line 118, The function flaky_rscc currently declares an unused default argument `_sel=bad_selection`; remove this unused parameter from the `flaky_rscc` signature (drop `_sel` and its `bad_selection` default) and update any call sites or test invocations to no longer pass or expect that argument; ensure only `a`, `b`, and the existing `target_is_next` logic remain in `flaky_rscc`.tests/eval/conftest.py (2)
87-90: Assertion-based validation disappears underpython -O.These
assertcalls are the only validation for fixture inputs; they’re silently skipped when Python is run with-O. Pytest doesn’t use-Oby default, but if anyone invokes it that way, invalid arguments will produce confusing downstream filesystem errors. Prefer explicitraise ValueError(...)/FileNotFoundError(...)for input validation and reserveassertfor invariants.♻️ Proposed refactor
- assert _REAL_CIF.exists(), _REAL_CIF - assert 1 <= n_groups <= len(_GROUP_CCP4) - assert trials_per_group >= 1 - assert len(selections) >= 1 + if not _REAL_CIF.exists(): + raise FileNotFoundError(_REAL_CIF) + if not 1 <= n_groups <= len(_GROUP_CCP4): + raise ValueError(f"n_groups must be in [1, {len(_GROUP_CCP4)}], got {n_groups}") + if trials_per_group < 1: + raise ValueError(f"trials_per_group must be >= 1, got {trials_per_group}") + if len(selections) < 1: + raise ValueError("selections must not be empty")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/conftest.py` around lines 87 - 90, Replace the assertion-based input checks that disappear under python -O with explicit exceptions: check existence of _REAL_CIF and raise FileNotFoundError with a helpful message if missing; validate n_groups against len(_GROUP_CCP4) and raise ValueError when out of range; ensure trials_per_group >= 1 and selections length >= 1 and raise ValueError with clear messages for each invalid condition; update the checks located near the fixture that references variables _REAL_CIF, n_groups, _GROUP_CCP4, trials_per_group, and selections accordingly.
71-78: Symlink-only approach is Linux-centric.
dst.symlink_to(src)requires privileges on Windows and is also silently broken inside some sandboxed tempdirs. Given tests undertests/eval/depend exclusively on symlinks, consider either:
- Falling back to
shutil.copy2ifsymlink_toraisesOSError/NotImplementedError, or- Skipping this module on non-POSIX via a module-level
pytest.importorskip/skipifguard.Not blocking for this PR if Linux-only CI is acceptable, but worth documenting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/eval/conftest.py` around lines 71 - 78, The _link function currently uses dst.symlink_to(src) which fails on Windows or in sandboxed tempdirs; update _link to catch OSError and NotImplementedError around dst.symlink_to(src) and fall back to copying with shutil.copy2(src, dst) (ensure dst.parent exists and remove any existing dst before the fallback), and alternatively consider adding a module-level guard using pytest.importorskip("posix") or pytest.mark.skipif(not os.name == "posix", reason="symlink-only tests") if you prefer skipping on non-POSIX environments; refer to the _link function name and dst.symlink_to and use shutil.copy2 and pytest.importorskip/skipif in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/sampleworks/utils/atom_array_utils.py`:
- Around line 759-776: Update the function-level docstring to a NumPy-style
docstring: add a Parameters section documenting atom_array (type:
AtomArray-like, expected shape/contents) and altloc_ids (iterable of str), a
Returns section describing that the function returns a dict mapping unordered
altloc id pairs (tuple of str) to tuples of AtomArray (array_i, array_j) and
explicitly note that pairs which raise RuntimeError inside
filter_to_common_atoms are omitted from the returned dict (and that a warning is
emitted), and a Raises section documenting any raised exceptions (e.g.,
propagate unexpected exceptions but note that RuntimeError from
filter_to_common_atoms is handled by omission). Reference
select_altloc(return_full_array=True) and filter_to_common_atoms in the
description so callers know how arrays are built and filtered.
---
Duplicate comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Line 124: The except clauses that currently catch (FileNotFoundError, OSError,
ValueError, RuntimeError) should also include AttributeError and TypeError so
malformed inputs don't crash the whole run; update the two handlers (the setup
handler around the code that may raise AttributeError like the explicit raise
and atom_array.set_annotation, and the per-trial handler inside
process_group/worker) to catch (FileNotFoundError, OSError, ValueError,
RuntimeError, AttributeError, TypeError) as e and preserve the existing fallback
behavior (e.g., emit rscc=nan or continue) so errors become per-selection
failures rather than terminating the job.
---
Nitpick comments:
In `@tests/eval/conftest.py`:
- Around line 87-90: Replace the assertion-based input checks that disappear
under python -O with explicit exceptions: check existence of _REAL_CIF and raise
FileNotFoundError with a helpful message if missing; validate n_groups against
len(_GROUP_CCP4) and raise ValueError when out of range; ensure trials_per_group
>= 1 and selections length >= 1 and raise ValueError with clear messages for
each invalid condition; update the checks located near the fixture that
references variables _REAL_CIF, n_groups, _GROUP_CCP4, trials_per_group, and
selections accordingly.
- Around line 71-78: The _link function currently uses dst.symlink_to(src) which
fails on Windows or in sandboxed tempdirs; update _link to catch OSError and
NotImplementedError around dst.symlink_to(src) and fall back to copying with
shutil.copy2(src, dst) (ensure dst.parent exists and remove any existing dst
before the fallback), and alternatively consider adding a module-level guard
using pytest.importorskip("posix") or pytest.mark.skipif(not os.name == "posix",
reason="symlink-only tests") if you prefer skipping on non-POSIX environments;
refer to the _link function name and dst.symlink_to and use shutil.copy2 and
pytest.importorskip/skipif in your change.
In `@tests/eval/test_rscc_grid_search_script.py`:
- Line 118: The function flaky_rscc currently declares an unused default
argument `_sel=bad_selection`; remove this unused parameter from the
`flaky_rscc` signature (drop `_sel` and its `bad_selection` default) and update
any call sites or test invocations to no longer pass or expect that argument;
ensure only `a`, `b`, and the existing `target_is_next` logic remain in
`flaky_rscc`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 32784a89-eb17-409e-b26c-e9fc7049a8eb
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
CLAUDE.mdpyproject.tomlscripts/eval/classify_altloc_regions.pyscripts/eval/find_max_rmsd_subsegment.pyscripts/eval/rscc_grid_search_script.pysrc/sampleworks/utils/atom_array_utils.pysrc/sampleworks/utils/density_utils.pytests/eval/conftest.pytests/eval/test_rscc_grid_search_script.pytests/resources/1vme/1VME_0.25occA_0.75occB_1.00A.ccp4tests/resources/1vme/1VME_0.5occA_0.5occB_1.00A.ccp4tests/resources/1vme/1VME_single_001_density_input.ciftests/utils/test_density_utils.py
✅ Files skipped from review due to trivial changes (2)
- pyproject.toml
- CLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (2)
- src/sampleworks/utils/density_utils.py
- scripts/eval/find_max_rmsd_subsegment.py
| """Return ``{(id_i, id_j): (array_i, array_j)}`` pre-filtered to common atoms. | ||
|
|
||
| For each unordered altloc pair we build the two per-altloc AtomArrays | ||
| via ``select_altloc(return_full_array=True)``, which includes blank-altloc | ||
| atoms as shared context and then run ``filter_to_common_atoms`` so the two | ||
| inputs have identical atom order and count. | ||
|
|
||
| We build per-pair rather than using ``map_altlocs_to_stack`` so residues whose | ||
| altloc set is a subset of those in the whole structure (e.g. 2YL0 res 60–64 | ||
| carry only altlocs A and B, not C) still get scored for the pairs where they | ||
| exist. A stack level ``filter_to_common_atoms`` would drop them entirely. | ||
|
|
||
| TODO: this helper hits the broader issue in how we | ||
| handle structures with >2 altlocs. | ||
| Fixing that upstream would let us replace this helper | ||
| with a direct ``map_altlocs_to_stack`` call and remove a source of | ||
| duplication. | ||
| """ |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Docstring is missing NumPy-style Parameters/Returns/Raises sections.
The description is clear, but the signature (atom_array, altloc_ids) and the non-obvious return semantics (pairs that raise RuntimeError inside filter_to_common_atoms are silently omitted, with a warning) are only documented implicitly. Please add explicit sections so callers don't have to read the implementation to learn that the returned dict can be missing pairs.
♻️ Suggested docstring
def build_pairwise_altloc_arrays(
atom_array: AtomArray, altloc_ids: list[str]
) -> dict[tuple[str, str], tuple[AtomArrayStack, AtomArrayStack]]:
"""Return ``{(id_i, id_j): (array_i, array_j)}`` pre-filtered to common atoms.
For each unordered altloc pair we build the two per-altloc AtomArrays
via ``select_altloc(return_full_array=True)``, which includes blank-altloc
atoms as shared context and then run ``filter_to_common_atoms`` so the two
inputs have identical atom order and count.
We build per-pair rather than using ``map_altlocs_to_stack`` so residues whose
altloc set is a subset of those in the whole structure (e.g. 2YL0 res 60–64
carry only altlocs A and B, not C) still get scored for the pairs where they
exist. A stack level ``filter_to_common_atoms`` would drop them entirely.
+ Parameters
+ ----------
+ atom_array
+ Input structure with ``altloc_id`` annotations.
+ altloc_ids
+ Altloc identifiers to enumerate over (unordered pairs, ``i < j``).
+
+ Returns
+ -------
+ dict[tuple[str, str], tuple[AtomArrayStack, AtomArrayStack]]
+ Mapping from ordered ``(id_i, id_j)`` pairs (``i < j`` in ``altloc_ids``)
+ to the common-atom-filtered stacks. Pairs for which no common atoms
+ exist are omitted (a warning is logged).
+
TODO: this helper hits the broader issue in how we
handle structures with >2 altlocs.
Fixing that upstream would let us replace this helper
with a direct ``map_altlocs_to_stack`` call and remove a source of
duplication.
"""As per coding guidelines: "Always include NumPy-style docstrings for every function and class."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/sampleworks/utils/atom_array_utils.py` around lines 759 - 776, Update the
function-level docstring to a NumPy-style docstring: add a Parameters section
documenting atom_array (type: AtomArray-like, expected shape/contents) and
altloc_ids (iterable of str), a Returns section describing that the function
returns a dict mapping unordered altloc id pairs (tuple of str) to tuples of
AtomArray (array_i, array_j) and explicitly note that pairs which raise
RuntimeError inside filter_to_common_atoms are omitted from the returned dict
(and that a warning is emitted), and a Raises section documenting any raised
exceptions (e.g., propagate unexpected exceptions but note that RuntimeError
from filter_to_common_atoms is handled by omission). Reference
select_altloc(return_full_array=True) and filter_to_common_atoms in the
description so callers know how arrays are built and filtered.
| # find_altloc_selections.py appends a combined all altloc selection | ||
| # (atomworks-style with " or " clauses) at the end of each row. That one is | ||
| # a union over every span we already processed individually, so skip it. | ||
| # NOTE: This will need to be addressed when we |
There was a problem hiding this comment.
What will need to be addressed?
| # a union over every span we already processed individually, so skip it. | ||
| # NOTE: This will need to be addressed when we | ||
| # migrate to atomworks-style selections for everything | ||
| if " or " in selection_str: |
There was a problem hiding this comment.
We probably want to allow selections that have "or" in them, presumably there will be some that have discontinuous but physically contacting residues that we care about?
| identifies the contiguous subsegment of that size with the highest | ||
| RMSD between any pair of alternate conformations. | ||
|
|
||
| Only residues with identical residue names across altlocs are considered. |
There was a problem hiding this comment.
We should fix this. Maybe make an issue.
| ) | ||
|
|
||
|
|
||
| def _has_compositional_heterogeneity(arr_i, arr_j, mask: np.ndarray) -> bool: |
There was a problem hiding this comment.
A couple things:
-
Didn't one of us create a method for this somewhere else?
-
This will return True if there are simply different sequences that are the same length, which I wouldn't think of as compositional heterogeneity. Do you want to check for modified residues like CYS->CSO?
| if mask.sum() == 0: | ||
| continue | ||
|
|
||
| if _has_compositional_heterogeneity(arr_i, arr_j, mask): |
There was a problem hiding this comment.
Is this to prevent an error for biotite_rmsd? If so, again I'd think about whether it might make sense to check for modified but otherwise the same residues (e.g. sulfonates). You could take the rmsd on only the common atoms.
| raise ValueError(f"Input CSV missing required columns: {missing}") | ||
|
|
||
| all_rows: list[dict] = [] | ||
| for _, row in input_df.iterrows(): |
There was a problem hiding this comment.
There's a separate module for parallelizing things over rows of a dataframe, I forget what it is called, but it can speed things up a lot if _process_structure is a slow step.
| ) | ||
| else: | ||
| final_rows = [] | ||
| for protein, group in detail_df.groupby("protein", sort=False): |
There was a problem hiding this comment.
You should use a DataFrame.groupby.agg pattern here. No reason to use a for loop.
| 4. Aggregate and visualize results by ensemble size, guidance weight, and scaler type | ||
|
|
||
| Depending on the GPU, --n-jobs=8-16 work well. A CUDA RuntimeError in a worker is caught per-trial | ||
| (the row gets ``rscc=nan``) but may affect other trials in the same worker. |
There was a problem hiding this comment.
This is cryptic. Why is there always a row with RSCC=nan? It seems like we should just fix that rather than tolerate it.
Or do you mean that if there is a CUDA error, the row gets RSCC=nan?
| valid_selections = [s for s in protein_config.selection if s in group_ref_coords] | ||
| rows: list[dict] = [] | ||
|
|
||
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
There was a problem hiding this comment.
We should make some effort to spread these out over the available GPUs.
| if base_xmap is None: | ||
| raise ValueError(f"Failed to load base map from {base_map_path}") | ||
|
|
||
| transformer, _ = build_density_transformer(base_xmap, em_mode=False, device=device) |
There was a problem hiding this comment.
Probably this needs a comment to explain what the method does and what the transformer is used for later. This is the object that actually builds the map from coordinates, right?
Script for finding the max rmsd subsegment in altlocs greater than a certain length, and speedups for the RSCC grid search script.
This depends on #223 and should only be merged afterwards
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores