Skip to content

feat: Completing evaluation suite for v0 and documenting how to run evaluations#186

Merged
marcuscollins merged 3 commits intomainfrom
mdc/evaluation-fixes
Mar 25, 2026
Merged

feat: Completing evaluation suite for v0 and documenting how to run evaluations#186
marcuscollins merged 3 commits intomainfrom
mdc/evaluation-fixes

Conversation

@marcuscollins
Copy link
Copy Markdown
Collaborator

@marcuscollins marcuscollins commented Mar 25, 2026

Addresses two issues, one is bringing the tortoize script into the sampleworks repo, and documenting how to run evaluations in EVALUATION.md

fix:add documentation of evaluation scripts, resolves https://github.com/diff-use/sampleworks/issues/123
feat:add tortoize (bb and sc dihedral z-score) script

Summary by CodeRabbit

  • New Features

    • Added evaluation pipeline that generates residue-level and protein-level statistical quality metrics, outputting analysis results to CSV files.
    • Supports efficient parallel execution of multiple evaluation trials with configurable concurrency.
  • Documentation

    • Added comprehensive evaluation workflow guide covering dependencies, setup procedures, command-line interface usage, and output format specifications.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 25, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

A new tortoize evaluation script and documentation are introduced. The script executes tortoize on trial CIF files in parallel, parses JSON results into per-residue and per-protein DataFrames, and writes aggregated statistics to CSV files. The documentation describes the full evaluation workflow, external tool dependencies, and output schemas for all evaluation scripts.

Changes

Cohort / File(s) Summary
Tortoize Evaluation Script
scripts/eval/run_and_process_tortoize.py
New executable script implementing a parallel evaluation pipeline that validates tortoize availability, runs it on refined CIF files, parses JSON output into residue-level and protein-level DataFrames via helper functions, and writes aggregated results to CSVs with full error handling.
Evaluation Documentation
scripts/eval/EVALUATION.md
New documentation describing end-to-end evaluation workflow, external dependencies (tortoize, phenix.clashscore), CIF file preparation steps, shared CLI interface via parse_eval_args, and output schemas for all available evaluation scripts including column definitions and example data structures.

Sequence Diagram

sequenceDiagram
    participant Script as Evaluation Script
    participant Setup as Parameter Setup
    participant Parallel as Parallel Executor
    participant Tortoize as Tortoize Executable
    participant Parser as JSON Parser
    participant DF as DataFrame Builder
    participant Writer as CSV Writer

    Script->>Script: Validate tortoize availability
    Script->>Setup: setup_evaluation_parameters(args)
    Setup-->>Script: List of trial configs
    Script->>Parallel: Process each trial in parallel
    loop For each trial
        Parallel->>Tortoize: Execute tortoize on refined_cif_path
        Tortoize-->>Parallel: JSON output
        Parallel->>Parser: Parse JSON (residues & proteins)
        Parser->>DF: flatten_residues() & get_protein_level_z_scores()
        DF-->>Parallel: residue_df, protein_df
    end
    Script->>DF: Concatenate all residue results
    Script->>DF: Concatenate all protein results
    Script->>Writer: Write tortoize_residues.csv
    Script->>Writer: Write tortoize_protein_stats.csv
    Writer-->>Script: Completion logged
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • PR #157: Eval refactor that introduces parse_eval_args and setup_evaluation_parameters utilities used by the new tortoize evaluation script.

Suggested Reviewers

  • DorisMai

Poem

🐰 A tortoize runs fast, in parallel so neat,
Parsing JSON metrics with twitching rabbit feet,
Residues and proteins in DataFrames bright,
CSV results written through the eval night!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding a tortoize evaluation script and documenting the evaluation workflow. Both changes are described in the raw summary.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mdc/evaluation-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@marcuscollins marcuscollins linked an issue Mar 25, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@EVALUATION.md`:
- Line 138: The table row for the configuration key "gd_steps" is missing the
value cell; update the row for gd_steps in the table so it has a third column
entry (e.g., an example value or "N/A") to match the pattern of other rows and
keep the table consistent; edit the row containing "gd_steps" to include the
value cell alongside the existing description "Deprecated." so the Markdown
table has three columns.
- Around line 182-187: The table rows for entries like "rscc", "base_map_path",
"error", "selection", "occupancies", and "avg_silhouette" have
incomplete/misaligned cells and the row for "avg_silhouette" is missing a
trailing pipe and contains a typo ("silhoutte"); update the markdown table so
every row has the same number of pipe-delimited cells, add the missing trailing
pipe on the avg_silhouette line, fill any empty cells with the intended values
or placeholders consistent with surrounding rows, and correct the typo to
"silhouette" (use the column keys as anchors to find and fix each row).
- Around line 155-156: Replace the misspelled token "psuedo" with "pseudo"
throughout the document (e.g., in the phrase "psuedo-distance" and any other
occurrences of "psuedo") in EVALUATION.md so all instances (including the ones
noted around the input reference structure and LDDT score descriptions) read
"pseudo" consistently.
- Around line 33-37: Remove the duplicated bullet about the input PDB CIF files
and keep the more detailed variant that includes the example path and the
`--input-pdb-pattern` usage; specifically, delete the redundant line that simply
states "the input PDB cif files are stored in /home/ubuntu/grid_search_inputs"
and retain the bullet mentioning
`/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif` and the
`--input-pdb-pattern` python format string (using the pdb_id variable).

In `@scripts/eval/run_and_process_tortoize.py`:
- Around line 124-131: get_stats_for_single_path currently builds the shell
command with f"tortoize {path}".split(), which breaks on paths with spaces and
is unsafe; change the subprocess call to pass arguments as a list (e.g.,
["tortoize", str(path)]) or use subprocess.run with capture_output=True and
check=True to get output reliably, and preserve the existing exception handling;
also add a NumPy-style docstring for get_stats_for_single_path describing
parameters, returns (two pandas.DataFrame), and raises/notes per project
conventions.
- Around line 16-23: Update the main function: fix the incorrect comment to say
you’re checking for "tortoize" (not "phenix"); wrap the
subprocess.call("tortoize", ...) in a try/except as it is now but capture the
caught exception (e.g., "err") and re-raise the RuntimeError with exception
chaining using "raise RuntimeError(...) from err"; add a NumPy-style docstring
to main() describing parameters and return value; and ensure argparse is
imported at top of the file since the function signature expects args.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03b672bc-1692-4310-91c1-e1b0a90f5fe9

📥 Commits

Reviewing files that changed from the base of the PR and between b3783d1 and 8ddb2ea.

📒 Files selected for processing (3)
  • EVALUATION.md
  • scripts/eval/run_and_process_tortoize.py
  • scripts/patch_output_cif_files.py

Comment thread EVALUATION.md Outdated
Comment on lines +33 to +37
- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as required for running the
the grid search (see GRID_SEARCH.md)
- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as, e.g.,
`/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`, see the `--input-pdb-pattern` argument, which
is a python format string which must use the `pdb_id` variable to refer to the RCSB PDB ID.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Duplicate bullet point content.

Lines 33-37 repeat the same information as lines 30-32 about input PDB CIF files. This appears to be a copy-paste error.

📝 Proposed fix
 - your sampleworks output is stored in `/home/ubuntu/grid_search_results`, 
 - the output is organized by RCSB PDB ID in directories like `/home/ubuntu/grid_search_results/1VME/...`,
   see the `--rcsb-pattern` argument which is a regex to match the RCSB PDB ID
-- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as required for running the
-  the grid search (see GRID_SEARCH.md)
-- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as, e.g., 
-  `/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`, see the `--input-pdb-pattern` argument, which
-  is a python format string which must use the `pdb_id` variable to refer to the RCSB PDB ID.
+- the input PDB CIF files are stored in `/home/ubuntu/grid_search_inputs` as, e.g., 
+  `/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`, see the `--input-pdb-pattern` argument, which
+  is a Python format string which must use the `pdb_id` variable to refer to the RCSB PDB ID.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as required for running the
the grid search (see GRID_SEARCH.md)
- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as, e.g.,
`/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`, see the `--input-pdb-pattern` argument, which
is a python format string which must use the `pdb_id` variable to refer to the RCSB PDB ID.
- the input PDB CIF files are stored in `/home/ubuntu/grid_search_inputs` as, e.g.,
`/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`, see the `--input-pdb-pattern` argument, which
is a Python format string which must use the `pdb_id` variable to refer to the RCSB PDB ID.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` around lines 33 - 37, Remove the duplicated bullet about the
input PDB CIF files and keep the more detailed variant that includes the example
path and the `--input-pdb-pattern` usage; specifically, delete the redundant
line that simply states "the input PDB cif files are stored in
/home/ubuntu/grid_search_inputs" and retain the bullet mentioning
`/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif` and the
`--input-pdb-pattern` python format string (using the pdb_id variable).

Comment thread EVALUATION.md Outdated
| scaler | the trajectory scaler | pure_guidance |
| ensemble_size | The number of output models in the `refined_cif_path` | 8 |
| guidance_weight | | 0.1 |
| gd_steps | Deprecated. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Table row is missing the value cell.

This row is incomplete—the "value" column is empty. Consider adding an example value or "N/A" for consistency with other rows.

📝 Proposed fix
-| gd_steps             | Deprecated.                                                                        |
+| gd_steps             | Deprecated.                                                                        | N/A                                                                                                                     |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| gd_steps | Deprecated. |
| gd_steps | Deprecated. | N/A |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 138-138: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` at line 138, The table row for the configuration key
"gd_steps" is missing the value cell; update the row for gd_steps in the table
so it has a third column entry (e.g., an example value or "N/A") to match the
pattern of other rows and keep the table consistent; edit the row containing
"gd_steps" to include the value cell alongside the existing description
"Deprecated." so the Markdown table has three columns.

Comment thread EVALUATION.md Outdated
Comment on lines +155 to +156
in the input reference structure, using as a
psuedo-distance the LDDT scores computed over the selected atoms in each model.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "psuedo" should be "pseudo".

This typo appears multiple times in the document (lines 155, 164, 166).

📝 Proposed fix for this instance
 The script attempts to assign selections in each of the models in the CIF file to the altlocs defined
-in the input reference structure, using as a 
-psuedo-distance the LDDT scores computed over the selected atoms in each model. 
+in the input reference structure, using as a 
+pseudo-distance the LDDT scores computed over the selected atoms in each model. 
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
in the input reference structure, using as a
psuedo-distance the LDDT scores computed over the selected atoms in each model.
in the input reference structure, using as a
pseudo-distance the LDDT scores computed over the selected atoms in each model.
🧰 Tools
🪛 LanguageTool

[grammar] ~155-~155: Ensure spelling is correct
Context: ...e input reference structure, using as a psuedo-distance the LDDT scores computed over ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` around lines 155 - 156, Replace the misspelled token "psuedo"
with "pseudo" throughout the document (e.g., in the phrase "psuedo-distance" and
any other occurrences of "psuedo") in EVALUATION.md so all instances (including
the ones noted around the input reference structure and LDDT score descriptions)
read "pseudo" consistently.

Comment thread EVALUATION.md Outdated
Comment on lines +182 to +187
| rscc | unused |
| base_map_path | unused |
| error | any error, if the calculation cannot be completed |
| selection | the selection string that was used, described above (full string will be available, shortened here) | chain_id == 'A' and ((res_id == 1) or ... or (res_id >= 316 and res_id <= 324) or (res_id == 326) or (res_id == 373)) |
| occupancies | space-separated list of occupancies of the identified clustered states | [0.625, 0.375] |
| avg_silhouette | the average silhoutte score for the clustered conformers/altlocs (see above) | 0.16082633177992867
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Table formatting issues: missing cells and trailing pipe.

Several rows in this table have incomplete cells, and line 187 is missing a trailing pipe. This may cause rendering issues in some Markdown parsers.

📝 Proposed fix
-| rscc                  | unused                                                                                                                                                                     |
-| base_map_path         | unused                                                                                                                                                                     |
-| error                 | any error, if the calculation cannot be completed                                                                                                                          |
+| rscc                  | unused                                                                                                                                                                     | N/A                                                                                                                                        |
+| base_map_path         | unused                                                                                                                                                                     | N/A                                                                                                                                        |
+| error                 | any error, if the calculation cannot be completed                                                                                                                          |                                                                                                                                            |
 | selection             | the selection string that was used, described above (full string will be available, shortened here)                                                                        | chain_id == 'A' and ((res_id == 1) or ... or (res_id >= 316 and res_id <= 324) or (res_id == 326) or (res_id == 373))                      |
 | occupancies           | space-separated list of occupancies of the identified clustered states                                                                                                     | [0.625, 0.375]                                                                                                                             |
-| avg_silhouette        | the average silhoutte score for the clustered conformers/altlocs (see above)                                                                                               | 0.16082633177992867
+| avg_silhouette        | the average silhouette score for the clustered conformers/altlocs (see above)                                                                                              | 0.16082633177992867                                                                                                                        |

Note: Also fixed "silhoutte" → "silhouette" typo on line 187.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| rscc | unused |
| base_map_path | unused |
| error | any error, if the calculation cannot be completed |
| selection | the selection string that was used, described above (full string will be available, shortened here) | chain_id == 'A' and ((res_id == 1) or ... or (res_id >= 316 and res_id <= 324) or (res_id == 326) or (res_id == 373)) |
| occupancies | space-separated list of occupancies of the identified clustered states | [0.625, 0.375] |
| avg_silhouette | the average silhoutte score for the clustered conformers/altlocs (see above) | 0.16082633177992867
| rscc | unused | N/A |
| base_map_path | unused | N/A |
| error | any error, if the calculation cannot be completed | |
| selection | the selection string that was used, described above (full string will be available, shortened here) | chain_id == 'A' and ((res_id == 1) or ... or (res_id >= 316 and res_id <= 324) or (res_id == 326) or (res_id == 373)) |
| occupancies | space-separated list of occupancies of the identified clustered states | [0.625, 0.375] |
| avg_silhouette | the average silhouette score for the clustered conformers/altlocs (see above) | 0.16082633177992867 |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 182-182: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


[warning] 183-183: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


[warning] 184-184: Table column count
Expected: 3; Actual: 2; Too few cells, row will be missing data

(MD056, table-column-count)


[warning] 187-187: Table pipe style
Expected: leading_and_trailing; Actual: leading_only; Missing trailing pipe

(MD055, table-pipe-style)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` around lines 182 - 187, The table rows for entries like
"rscc", "base_map_path", "error", "selection", "occupancies", and
"avg_silhouette" have incomplete/misaligned cells and the row for
"avg_silhouette" is missing a trailing pipe and contains a typo ("silhoutte");
update the markdown table so every row has the same number of pipe-delimited
cells, add the missing trailing pipe on the avg_silhouette line, fill any empty
cells with the intended values or placeholders consistent with surrounding rows,
and correct the typo to "silhouette" (use the column keys as anchors to find and
fix each row).

Comment thread scripts/eval/run_and_process_tortoize.py Outdated
Comment thread scripts/eval/run_and_process_tortoize.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (6)
scripts/eval/run_and_process_tortoize.py (2)

124-131: ⚠️ Potential issue | 🟠 Major

Path with spaces will break; missing docstring; overly broad exception.

  1. f"tortoize {path}".split() will incorrectly split paths containing spaces
  2. Catching bare Exception is too broad (Ruff BLE001) - catch specific exceptions
  3. Missing NumPy-style docstring per coding guidelines
🐛 Proposed fix
 def get_stats_for_single_path(path: Path) -> tuple[DataFrame, DataFrame]:
+    """
+    Run tortoize on a single CIF file and extract residue/protein-level stats.
+
+    Parameters
+    ----------
+    path : Path
+        Path to a CIF file to process.
+
+    Returns
+    -------
+    tuple[DataFrame, DataFrame]
+        (residue_df, protein_df). Both are empty DataFrames on failure.
+    """
     logger.info(f"Processing {path}")
     try:
-        output = subprocess.check_output(f"tortoize {path}".split())
+        output = subprocess.check_output(["tortoize", str(path)])
         result = json.loads(output)
-    except Exception as e:
+    except (subprocess.CalledProcessError, json.JSONDecodeError, OSError) as e:
         logger.error(f"Failed to process {path}: {e}")
         return pd.DataFrame(), pd.DataFrame()

As per coding guidelines: "Always include NumPy-style docstrings for every function and class."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/run_and_process_tortoize.py` around lines 124 - 131,
get_stats_for_single_path currently builds the shell command with f"tortoize
{path}".split() (which breaks on paths with spaces), catches a bare Exception,
and lacks a NumPy-style docstring; update the function to accept a Path safely
by calling subprocess.check_output with an argument list like ["tortoize",
str(path)] (or use subprocess.run with check=True), replace the broad except
with specific exception handlers for subprocess.CalledProcessError and
json.JSONDecodeError (and optionally OSError), and add a proper NumPy-style
docstring describing parameters, returns, and behavior for
get_stats_for_single_path.

16-23: ⚠️ Potential issue | 🟡 Minor

Incorrect comment, missing type hint and docstring, and exception chaining.

  1. Line 17 comment says "phenix" but the code checks for tortoize
  2. args parameter lacks type annotation
  3. Missing NumPy-style docstring per coding guidelines
  4. Per B904, use raise ... from None for proper exception chaining
📝 Proposed fix
+import argparse
+
 # TODO make more general: https://github.com/diff-use/sampleworks/issues/93
-def main(args) -> None:
-    # check that phenix is installed and available, bail early if not.
+def main(args: argparse.Namespace) -> None:
+    """
+    Run tortoize on all trial CIF files and output residue/protein-level CSV stats.
+
+    Parameters
+    ----------
+    args : argparse.Namespace
+        Parsed CLI arguments from parse_eval_args().
+    """
+    # Check that tortoize is installed and available, bail early if not.
     try:
         subprocess.call("tortoize", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
-    except FileNotFoundError:
+    except FileNotFoundError as err:
         raise RuntimeError(
             "tortoize is not available, make sure you have installed it."
-        )
+        ) from err

As per coding guidelines: "Always include NumPy-style docstrings for every function and class."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/run_and_process_tortoize.py` around lines 16 - 23, Fix the main
function: correct the inline comment to reference "tortoize" (not "phenix"), add
a proper type annotation for the args parameter (e.g., argparse.Namespace or
Sequence[str] depending on usage), add a NumPy-style docstring for main
describing parameters and return value, and change the exception raise to use
exception chaining suppression (use "raise RuntimeError(...) from None") when
catching FileNotFoundError; locate these edits around the main function, the
subprocess.call("tortoize") invocation, and the FileNotFoundError handler.
EVALUATION.md (4)

155-156: ⚠️ Potential issue | 🟡 Minor

Typo: "psuedo" should be "pseudo".

This typo appears on lines 155, 164, and 166.

📝 Proposed fix for all instances
-in the input reference structure, using as a 
-psuedo-distance the LDDT scores computed over the selected atoms in each model.
+in the input reference structure, using as a 
+pseudo-distance the LDDT scores computed over the selected atoms in each model.

Line 164:

-which is reported as the `avg_silhouette` score. We also report a psuedo-silhouette score, which is
+which is reported as the `avg_silhouette` score. We also report a pseudo-silhouette score, which is

Line 166:

-and 0.0 indicates a poor clustering. In the example provided, the psuedo-silhouette score is 0.0034,
+and 0.0 indicates a poor clustering. In the example provided, the pseudo-silhouette score is 0.0034,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` around lines 155 - 156, Replace the misspelled token "psuedo"
with the correct spelling "pseudo" in all occurrences; specifically search for
the phrase "psuedo-distance" (and any other "psuedo" occurrences) in
EVALUATION.md and change them to "pseudo-distance" (three instances reported).

138-138: ⚠️ Potential issue | 🟡 Minor

Table row missing value cell.

The gd_steps row has only 2 columns but the table expects 3. Add a value cell (e.g., N/A) for consistency.

📝 Proposed fix
-| gd_steps             | Deprecated.                                                                        |
+| gd_steps             | Deprecated.                                                                        | N/A                                                                                                                     |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` at line 138, The table row for "gd_steps" is missing the third
value cell; update the markdown row (the line containing "| gd_steps            
| Deprecated.                                                                   
|") to include a third column value such as "N/A" so the row has three cells,
e.g. change it to "| gd_steps             | Deprecated.                         
| N/A |" to match the table's column count.

33-37: ⚠️ Potential issue | 🟡 Minor

Duplicate bullet point content.

Lines 33-34 and lines 35-37 both describe input PDB CIF file locations. Remove the less detailed duplicate (lines 33-34) and keep the version with the example path and --input-pdb-pattern explanation.

📝 Proposed fix
 - your sampleworks output is stored in `/home/ubuntu/grid_search_results`, 
 - the output is organized by RCSB PDB ID in directories like `/home/ubuntu/grid_search_results/1VME/...`,
   see the `--rcsb-pattern` argument which is a regex to match the RCSB PDB ID
-- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as required for running the
-  the grid search (see GRID_SEARCH.md)
 - the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as, e.g., 
   `/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`, see the `--input-pdb-pattern` argument, which
   is a python format string which must use the `pdb_id` variable to refer to the RCSB PDB ID.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` around lines 33 - 37, There are two duplicate bullet points
describing input PDB CIF file locations; remove the less-detailed duplicate (the
shorter bullet that only says files are stored in
`/home/ubuntu/grid_search_inputs`) and keep the more detailed bullet that shows
the example path `/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif` and
documents the `--input-pdb-pattern` python format string using the `pdb_id`
variable; update the EVALUATION.md bullet list so only the detailed example and
explanation remain.

182-188: ⚠️ Potential issue | 🟡 Minor

Table formatting issues: missing cells, trailing pipe, and typo.

  • Lines 182-184: Missing third column (value) cells
  • Line 187: Missing trailing pipe and "silhoutte" → "silhouette" typo
📝 Proposed fix
-| rscc                  | unused                                                                                                                                                                     |
-| base_map_path         | unused                                                                                                                                                                     |
-| error                 | any error, if the calculation cannot be completed                                                                                                                          |
+| rscc                  | unused                                                                                                                                                                     | N/A                                                                                                                                        |
+| base_map_path         | unused                                                                                                                                                                     | N/A                                                                                                                                        |
+| error                 | any error, if the calculation cannot be completed                                                                                                                          |                                                                                                                                            |
 | selection             | the selection string that was used, described above (full string will be available, shortened here)                                                                        | chain_id == 'A' and ((res_id == 1) or ... or (res_id >= 316 and res_id <= 324) or (res_id == 326) or (res_id == 373))                      |
 | occupancies           | space-separated list of occupancies of the identified clustered states                                                                                                     | [0.625, 0.375]                                                                                                                             |
-| avg_silhouette        | the average silhoutte score for the clustered conformers/altlocs (see above)                                                                                               | 0.16082633177992867
+| avg_silhouette        | the average silhouette score for the clustered conformers/altlocs (see above)                                                                                              | 0.16082633177992867                                                                                                                        |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@EVALUATION.md` around lines 182 - 188, The table rows for "rscc" and
"base_map_path" are missing their third (example/value) column—add appropriate
example values (e.g., "0.85" for rscc and a sample path string for
base_map_path) so each row has three columns; fix the "avg_silhouette" row by
adding the trailing pipe and correct the typo "silhoutte" → "silhouette" in the
"avg_silhouette" header/description (also verify "error" row has its third
column present and add a value if missing).
🧹 Nitpick comments (1)
scripts/eval/run_and_process_tortoize.py (1)

95-109: Docstring uses Sphinx-style instead of NumPy-style.

The docstring uses :param and :return: syntax. For consistency with flatten_residues and coding guidelines, use NumPy-style with Parameters and Returns sections.

📝 Proposed fix
 def get_protein_level_z_scores(tortoize_json: dict[str, Any]) -> pd.DataFrame:
     """
     Extracts protein-level z-scores for torsion and Ramachandran angles from tortoize JSON output.
-    See test/1cbs.json for an example of the tortoize output JSON
 
+    See test/1cbs.json for an example of the tortoize output JSON.

-    :param tortoize_json:
-    :return: pd.DataFrame with keys:
-      - model (e.g. "1")
-      - ramachandran_z_score
-      - torsion_z_score
-      - ramachandran_jackknife_sd
-      - torsion_jackknife_sd
-      - residue_count
+    Parameters
+    ----------
+    tortoize_json : dict[str, Any]
+        Parsed JSON output from tortoize command.
+
+    Returns
+    -------
+    pd.DataFrame
+        DataFrame with columns: model, ramachandran_z_score, torsion_z_score,
+        ramachandran_jackknife_sd, torsion_jackknife_sd.
     """

As per coding guidelines: "Always include NumPy-style docstrings for every function and class."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/run_and_process_tortoize.py` around lines 95 - 109, The
docstring for get_protein_level_z_scores currently uses Sphinx-style ":param"
and ":return:" blocks; update it to a NumPy-style docstring matching the style
used by flatten_residues and project guidelines by replacing the param/return
directives with "Parameters" and "Returns" sections, describing tortoize_json:
dict[str, Any] and the returned pd.DataFrame (including keys model,
ramachandran_z_score, torsion_z_score, ramachandran_jackknife_sd,
torsion_jackknife_sd, residue_count) and keeping the brief description and
example reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/run_and_process_tortoize.py`:
- Around line 31-42: tortoize_results may contain tuples of empty DataFrames so
before calling pd.concat on all_residue_results and all_protein_results filter
out empty frames (e.g., [df for df in all_residue_results if not df.empty]) and
similarly for all_protein_results, then if either filtered list is empty log an
informative message and skip writing that CSV (or return) to avoid producing
empty CSVs; update the logic around tortoize_results, all_residue_results,
all_protein_results and the pd.concat calls to perform this filtering (similar
to run_and_process_phenix_clashscore.py).

---

Duplicate comments:
In `@EVALUATION.md`:
- Around line 155-156: Replace the misspelled token "psuedo" with the correct
spelling "pseudo" in all occurrences; specifically search for the phrase
"psuedo-distance" (and any other "psuedo" occurrences) in EVALUATION.md and
change them to "pseudo-distance" (three instances reported).
- Line 138: The table row for "gd_steps" is missing the third value cell; update
the markdown row (the line containing "| gd_steps             | Deprecated.     
|") to include a third column value such as "N/A" so the row has three cells,
e.g. change it to "| gd_steps             | Deprecated.                         
| N/A |" to match the table's column count.
- Around line 33-37: There are two duplicate bullet points describing input PDB
CIF file locations; remove the less-detailed duplicate (the shorter bullet that
only says files are stored in `/home/ubuntu/grid_search_inputs`) and keep the
more detailed bullet that shows the example path
`/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif` and documents the
`--input-pdb-pattern` python format string using the `pdb_id` variable; update
the EVALUATION.md bullet list so only the detailed example and explanation
remain.
- Around line 182-188: The table rows for "rscc" and "base_map_path" are missing
their third (example/value) column—add appropriate example values (e.g., "0.85"
for rscc and a sample path string for base_map_path) so each row has three
columns; fix the "avg_silhouette" row by adding the trailing pipe and correct
the typo "silhoutte" → "silhouette" in the "avg_silhouette" header/description
(also verify "error" row has its third column present and add a value if
missing).

In `@scripts/eval/run_and_process_tortoize.py`:
- Around line 124-131: get_stats_for_single_path currently builds the shell
command with f"tortoize {path}".split() (which breaks on paths with spaces),
catches a bare Exception, and lacks a NumPy-style docstring; update the function
to accept a Path safely by calling subprocess.check_output with an argument list
like ["tortoize", str(path)] (or use subprocess.run with check=True), replace
the broad except with specific exception handlers for
subprocess.CalledProcessError and json.JSONDecodeError (and optionally OSError),
and add a proper NumPy-style docstring describing parameters, returns, and
behavior for get_stats_for_single_path.
- Around line 16-23: Fix the main function: correct the inline comment to
reference "tortoize" (not "phenix"), add a proper type annotation for the args
parameter (e.g., argparse.Namespace or Sequence[str] depending on usage), add a
NumPy-style docstring for main describing parameters and return value, and
change the exception raise to use exception chaining suppression (use "raise
RuntimeError(...) from None") when catching FileNotFoundError; locate these
edits around the main function, the subprocess.call("tortoize") invocation, and
the FileNotFoundError handler.

---

Nitpick comments:
In `@scripts/eval/run_and_process_tortoize.py`:
- Around line 95-109: The docstring for get_protein_level_z_scores currently
uses Sphinx-style ":param" and ":return:" blocks; update it to a NumPy-style
docstring matching the style used by flatten_residues and project guidelines by
replacing the param/return directives with "Parameters" and "Returns" sections,
describing tortoize_json: dict[str, Any] and the returned pd.DataFrame
(including keys model, ramachandran_z_score, torsion_z_score,
ramachandran_jackknife_sd, torsion_jackknife_sd, residue_count) and keeping the
brief description and example reference.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e6f0adc2-2398-4717-9d90-f415e3b9fa41

📥 Commits

Reviewing files that changed from the base of the PR and between 8ddb2ea and 3c087f1.

📒 Files selected for processing (3)
  • EVALUATION.md
  • scripts/eval/run_and_process_tortoize.py
  • scripts/patch_output_cif_files.py

Comment on lines +31 to +42
if not tortoize_results:
logger.error("No trials successfully processed, check that result files are available.")
return

all_residue_results, all_protein_results = tuple(zip(*tortoize_results, strict=True))

output_file = "tortoize_residues.csv"
pd.concat(all_residue_results).to_csv(output_file, index=False)
logger.info(f"Residue results saved to {output_file}")

output_file = "tortoize_protein_stats.csv"
pd.concat(all_protein_results).to_csv(output_file, index=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing filter for empty DataFrames before concatenation.

The check on line 31 only catches an empty list. If all trials fail, tortoize_results will be a non-empty list of tuples containing empty DataFrames, and pd.concat() will silently produce empty CSVs. The run_and_process_phenix_clashscore.py script handles this by filtering empty DataFrames before concat.

🐛 Proposed fix
     if not tortoize_results:
         logger.error("No trials successfully processed, check that result files are available.")
         return

     all_residue_results, all_protein_results = tuple(zip(*tortoize_results, strict=True))

+    # Filter out empty DataFrames from failed trials
+    all_residue_results = [df for df in all_residue_results if not df.empty]
+    all_protein_results = [df for df in all_protein_results if not df.empty]
+
+    if not all_residue_results:
+        logger.error("No trials produced residue output, check that result files are available.")
+        return
+
     output_file = "tortoize_residues.csv"
     pd.concat(all_residue_results).to_csv(output_file, index=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/run_and_process_tortoize.py` around lines 31 - 42,
tortoize_results may contain tuples of empty DataFrames so before calling
pd.concat on all_residue_results and all_protein_results filter out empty frames
(e.g., [df for df in all_residue_results if not df.empty]) and similarly for
all_protein_results, then if either filtered list is empty log an informative
message and skip writing that CSV (or return) to avoid producing empty CSVs;
update the logic around tortoize_results, all_residue_results,
all_protein_results and the pd.concat calls to perform this filtering (similar
to run_and_process_phenix_clashscore.py).

@k-chrispens k-chrispens changed the title feat:Completing evaluation suite for v0 and documenting how to run evaluations feat: Completing evaluation suite for v0 and documenting how to run evaluations Mar 25, 2026
Copy link
Copy Markdown
Collaborator

@k-chrispens k-chrispens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few of the coderabbit comments seem worth addressing (typos, docstrings) but won't hold it up. I also think the EVALUATION.md could be put under scripts/eval, that way if someone is viewing it on GitHub it will pop up and format it prettily.

@marcuscollins marcuscollins merged commit 9c1d738 into main Mar 25, 2026
1 check was pending
@marcuscollins marcuscollins deleted the mdc/evaluation-fixes branch March 25, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document how to run all of the evaluation scripts

2 participants