Set up the directory structure:
project_dir="/data/BIDS-HPC/private/projects/dmi2"
working_dir="/home/weismanal/notebook/2020-06-10/dmi"
mkdir "$project_dir" "$working_dir"
cd "$working_dir"
git clone [email protected]:andrew-weisman/target_classification.git "$project_dir/checkout"
mkdir "$project_dir/data"Note: The effort using the data directly from the TARGET data website (as opposed to the GDC Data Portal) is in the target_data_website branch of this repository.
Download the manifest for all the gene expression quantification files in the TARGET program (click on the blue "Manifest" button):
Place the downloaded manifest file as $project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt.
In addition, click on the blue "Add All Files to Cart" button, go to the cart (top right of page), click on the two blue buttons "Sample Sheet" and "Metadata", and save the resulting two files to $project_dir/data. The two files will be named, e.g., gdc_sample_sheet.2020-07-02.tsv and metadata.cart.2020-07-02.json.
Note that these 5,149 files correspond to 1,192 cases (people [for sure that's what it means]).
Download the expression files from the manifest on Helix:
module load gdc-client
mkdir "$project_dir/data/all_gene_expression_files_in_target"
cd !!:1
gdc-client download -m "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt"Extract the resulting compressed files and link to them from a single folder $project_dir/data/all_gene_expression_files_in_target/links:
mkdir links
cd !!:1
for file in $(find ../ -iname "*.gz"); do gunzip "$file"; done
for file in $(find ../ -type f | grep -v "/logs/\|/annotations.txt"); do ln -s $file; done
ln -s "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt" MANIFEST.txtNote that
for file in $(ls | grep -v MANIFEST.txt); do echo $file | awk -v FS="." '{print $1}'; done | sort -u | wc -lshows that, ostensibly, there are 2,481 unique expression files (independent of normalization). This is just based on the filenames, and is not actually correct.
Start an interactive allocation, using, e.g.,
sinteractive --mem=40g # --mem=20g may be fineGo through the Python Jupyter notebook /data/BIDS-HPC/private/projects/dmi2/checkout/main.ipynb. Use the conda environment /data/BIDS-HPC/public/software/conda/envs/r_env. (Note this environment contains pandas version 1.1.0, whereas Biowulf's default python module has pandas version 0.24.2, which is insufficient.) See here for more notes on the environment.
