PDGSC Analysis Pipeline

Introduction

Collection of scripts used in the analysis and QC of the PDGSC data, as well as some useful scripts for other analyses.

Summary of the PDGSC pipeline

To reproduce the PDGSC analysis pipeline, do the following:

Clone this github repository
Install required/useful tools

bash ./pdgsc/scripts/shell/install_tools.sh

Download analysis files (VCF etc)

bash ./pdgsc/scripts/shell/download_analysis_files.sh

Generate depth metrics and calculate individual-level depth statistics

bash ./pdgsc/scripts/shell/get_ad_info.sh

Filter out poor quality samples and perform GWAS-style individual-level QC

bash ./pdgsc/scripts/shell/individual_qc.sh

Match ADSP controls to samples from cohorts with no controls

bash ./pdgsc/scripts/shell/casecontrol_matching.sh

Perform variant QC and generate covariates for analysis

bash ./pdgsc/scripts/shell/variant_qc_generate_covariates.sh

Run Rvtests to generate covariance files

bash ./pdgsc/scripts/shell/run_rvtests.sh

Run Raremetals to meta-analyse

bash ./pdgsc/scripts/shell/run_raremetals.sh

Do some post-processing

bash ./pdgsc/scripts/shell/subset_results_postqc.sh

That's it!

Flowchart etc here

Useful scripts for exome analyses

If you're looking to perform an analysis on a candidate gene(s) of interest, then the following script might be useful. You can use it to extract the gene of interest, keep only samples and variants that are covered well in the gene, generate an annotated table of all the variants found in the gene, and perform burden tests on the gene.

bash ./pdgsc/scripts/shell/extract_annotated_variants_for_gene_of_interest.sh

Some general notes for beginners running analyses on the cloud

To connect to the Google Cloud using your computer, first install the Google Cloud SDK on your system by following the instructions for your operating system here: https://cloud.google.com/sdk/docs/downloads-interactive

Then, to log in and initialise your account, run:

gcloud init

You should receive a prompt to log in to your Google account, and to connect your account to the PDGSC project. If this doesn't happen, you might need to ask to be added to the project.

Once you are all set up, you can start a virtual machine with the following command (just pick a name for your machine, how much hard disk size you think you'll need, and a machine type):

A list of machine types and their specs can be found here: https://cloud.google.com/compute/docs/machine-types

VM_NAME="type your machine name here"
DISK_SIZE="type your disk size here"
MACHINE_TYPE="type your machine type here"

gcloud compute instances create ${VM_NAME} --zone us-central1-f --image-family ubuntu-1804-lts --image-project ubuntu-os-cloud  --machine-type ${MACHINE_TYPE} --maintenance-policy MIGRATE --boot-disk-size ${DISK_SIZE} --boot-disk-type pd-standard --boot-disk-device-name ${VM_NAME}

Or if you are using the Google SDK on Windows, then:

SET VM_NAME="type your machine name here"
SET DISK_SIZE="type your disk size here"
SET MACHINE_TYPE="type your machine type here"

gcloud compute instances create %VM_NAME% --zone us-central1-f --image-family ubuntu-1804-lts --image-project ubuntu-os-cloud  --machine-type %MACHINE_TYPE% --maintenance-policy MIGRATE --boot-disk-size %DISK_SIZE% --boot-disk-type pd-standard --boot-disk-device-name %VM_NAME%

Once you've successfully set up a virtual machine, you can connect to it with by running:

gcloud compute ssh ${VM_NAME}

Or on windows:

gcloud compute ssh %VM_NAME%

And you're good to go! Once you are done with your analysis, remember to delete your virtual machine either by navigating to the virtual machines section on your Google Cloud Dashboard, or by running the following inside the virtual machine:

gcloud compute instances delete ${VM_NAME}

It might be a good idea to add this to the end of your analysis script, so that the virtual machine is deleted once your analysis is finished and is not left running for no reason. Remember to save the output/results of your analysis before you do this though! You can do this by uploading the files into the bucket by running:

OUTPUT_TO_BE_SAVED="name of the file you want to save in the bucket"
BUCKET_ADDRESS="the location in the bucket where you want to save your file"

gsutil -m cp ${OUTPUT_TO_BE_SAVED} ${BUCKET_ADDRESS}

If you're planning to run scripts from this repository, it might be easier to download this repository to the VM, by running the following. It might also be useful to install some basic useful tools that are often needed for analysis after setting up the VM (you can edit the install_tools.sh script to include/remove tools according to your needs):

git clone https://github.com/ipdgc/pdgsc.git
bash ./pdgsc/scripts/shell/install_tools.sh

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
scripts		scripts
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDGSC Analysis Pipeline

Table of contents

Introduction

Summary of the PDGSC pipeline

Useful scripts for exome analyses

Some general notes for beginners running analyses on the cloud

About

Releases

Packages

Languages

CornelisB/pdgsc

Folders and files

Latest commit

History

Repository files navigation

PDGSC Analysis Pipeline

Table of contents

Introduction

Summary of the PDGSC pipeline

Useful scripts for exome analyses

Some general notes for beginners running analyses on the cloud

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages