diff --git a/scripts/eval/EVALUATION.md b/scripts/eval/EVALUATION.md new file mode 100644 index 00000000..c129c26e --- /dev/null +++ b/scripts/eval/EVALUATION.md @@ -0,0 +1,190 @@ +# Evaluation of SampleWorks Grid Search Results + +# External software requirements +## tortoize +SampleWorks relies on tortoize to compute backbone and sidechain dihedral angle outliers. +`tortoize` is free software and can be downloaded from https://github.com/PDB-REDO/tortoize. +You should install it following their instructions and make sure it is available in the environment +where you run SampleWorks. The script scripts/eval/run_and_process_tortoize.py will check for the +`tortoize` executable before running and will raise an error if it is not available. + +## phenix +Information about the phenix package can be found at https://phenix-online.org/. Phenix requires a +license which is free to academic users. Others may have to pay a fee. Sampleworks makes use of the +phenix.clashscore command and `run_and_process_phenix_clashscore.py` will check for it before +running, raising an error if it is not available. + +# Running the evaluations +## Preparing the output CIF files +As of this writing, Sampleworks outputs CIF files that primarily contain the output atomic +coordinates, and not the additional information that many programs, like `tortoize` and +`phenix.clashscore`, require. Furthermore, many protein structure predictors effectively +renumber residues. Since our metrics are frequently calculated by comparing selections of atoms or +residues, we must align to the original _sequence_ of the protein as well. Future versions of +Sampleworks will handle these issues automatically. For now, you should run the script +`scripts/patch_output_cif_files.py`. This will use the original PDB inputs to reconstruct proper +output CIF files that are numbered correctly and +have all necessary metadata to reconstruct the protein structure correctly. + +You can run the following command, which assumes: +- your sampleworks output is stored in `/home/ubuntu/grid_search_results`, +- the output is organized by RCSB PDB ID in directories like `/home/ubuntu/grid_search_results/1VME/...`, + see the `--rcsb-pattern` argument which is a regex to match the RCSB PDB ID +- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as required for running the + the grid search (see GRID_SEARCH.md) +- the input PDB cif files are stored in `/home/ubuntu/grid_search_inputs` as required for running the + the grid search (see GRID_SEARCH.md). The files will have paths like, e.g., + `/home/ubuntu/grid_search_inputs/1VME/1VME_original.cif`. See also the `--input-pdb-pattern` + argument, which is a python format string which must use the `pdb_id` variable to refer to the + RCSB PDB ID. + +```shell +pixi run -e analysis python scripts/patch_output_cif_files.py \ + --input-dir /home/ubuntu/grid_search_results \ + --rcsb-pattern 'grid_search_results/(.{4})/...' \ + --cif-pattern 'refined.cif' \ + --grid-search-input-dir /home/ubuntu/grid_search_inputs \ + --input-pdb-pattern '{pdb_id}/{pdb_id}_original.cif' +``` + +This script searches recursively for all CIF files under the input directory, by default up to 4 +levels deep. If you organize the output more deeply, you can specify the depth with the `--depth` +argument. It will output a patched CIF files named `refined-patched.cif` along each original `refined.cif` +file. These `refined-patched.cif` files can be used as input to the remaining evaluation scripts. + +## Running the scripts +The evaluation scripts have a common interface defined by the method +`sampleworks.eval.grid_search_eval_utils.parse_eval_args`. The general form of these commands is: + +```shell +pixi run -e analysis python scripts/eval/