Right now, all of our evaluation scripts require a path to an input data directory, which must contain the maps and input CIF files used for ensemble generation, and a configuration file which, among other things, points to relative paths in that directory which contain the specific map and input files for each protein. Rows of this configuration file look like
23,5MHX,chain A and resi 158-167,5MHX_single_001_density_input.cif,5MHX_uniform_1.00A.ccp4,processed/5MHX,1.0
(in this example there is only one atom selection string; there are usually several, semicolon-separated selections). This line defines a ProteinConfig object (https://github.com/diff-use/sampleworks/blob/main/src/sampleworks/eval/grid_search_eval_utils.py#L224). Assuming the input data directory is "/data/inputs", the evaluation scripts look for the input cif file at /data/inputs/5MHX_single_001_density_input.cif and the input maps used for guidance in /data/inputs/processed/5MHX/ (an additional pattern is used to locate the exact map, see https://github.com/diff-use/sampleworks/blob/main/scripts/eval/rscc_grid_search_script.py#L98
Rather than constructing these paths after the fact, we should obtain them from the ensemble generation trial metadata directly. This data is stored in a file job_metadata.json in each output directory, and will soon be incorporated directly into our output CIF files #209. We should extract the required paths directly from those locations and use them, rather than trying to reconstruct them ad hoc after the fact.
Note that this depends on the paths in the metadata being actual paths on the working filesystem. Since our jobs are usually run inside Docker containers, the paths stored in the metadata today are ephemeral container paths, not the final locations of files, which depend on what external volumes are mounted to the container. See #210
Right now, all of our evaluation scripts require a path to an input data directory, which must contain the maps and input CIF files used for ensemble generation, and a configuration file which, among other things, points to relative paths in that directory which contain the specific map and input files for each protein. Rows of this configuration file look like
23,5MHX,chain A and resi 158-167,5MHX_single_001_density_input.cif,5MHX_uniform_1.00A.ccp4,processed/5MHX,1.0(in this example there is only one atom selection string; there are usually several, semicolon-separated selections). This line defines a ProteinConfig object (https://github.com/diff-use/sampleworks/blob/main/src/sampleworks/eval/grid_search_eval_utils.py#L224). Assuming the input data directory is "/data/inputs", the evaluation scripts look for the input cif file at
/data/inputs/5MHX_single_001_density_input.cifand the input maps used for guidance in/data/inputs/processed/5MHX/(an additional pattern is used to locate the exact map, see https://github.com/diff-use/sampleworks/blob/main/scripts/eval/rscc_grid_search_script.py#L98Rather than constructing these paths after the fact, we should obtain them from the ensemble generation trial metadata directly. This data is stored in a file
job_metadata.jsonin each output directory, and will soon be incorporated directly into our output CIF files #209. We should extract the required paths directly from those locations and use them, rather than trying to reconstruct them ad hoc after the fact.Note that this depends on the paths in the metadata being actual paths on the working filesystem. Since our jobs are usually run inside Docker containers, the paths stored in the metadata today are ephemeral container paths, not the final locations of files, which depend on what external volumes are mounted to the container. See #210