Week 10: Recap

Objectives

Recap and revision of what you've learnt this term
Reproducing analyses in either R or Python

Gene set enrichment

Let's do some analyses.

Download these files into your working directory:

data
helper.R

Run this to install/load libraries

source("helper.R") 
load("recap.Rdata")

Gene set enrichment (GSE) in R

A method to identify properties of genes or proteins that are over-represented. GSE is a (relatively) straightforward method to sumamrize the results of your experiment, in particualr when you believe there is some link or association to a known phenotype (e.g., enrichment for dopamine receptors in Parkinson's disease). Uses statistical approaches to identify significantly enriched groups of genes, and the main annotation database is the Gene Ontology (GO) and the Molecular Signatures database (MSigDB).

The most well known method is GSEA: http://www.pnas.org/content/102/43/15545.short, but it can be a little complicated to run on other data (mostly for gene expression experiments) and interpret (old dispute here https://www.ncbi.nlm.nih.gov/pubmed/20048385).
There have been many others, and the simplest is a hypergeometric test to measure overlap.
One such method in R is enrichR: https://cran.r-project.org/web/packages/enrichR/vignettes/enrichR.html

install.packages("enrichR")
library(enrichR)
dbs <- listEnrichrDbs()
dbs <- c("GO_Molecular_Function_2015", "GO_Cellular_Component_2015", "GO_Biological_Process_2015")
enriched <- enrichr(c("Runx1", "Gfi1", "Gfi1b", "Spi1", "Gata1", "Kdr"), dbs)

Or, we can build a gene by function matrix and perform our own gene set enrichment (function is in helper.R)

library(EGAD)
annot = make_annotations( (GO.mouse[,c(1,3)][GO.mouse[,4]!="IEA",] ), unique(GO.mouse[,1]), unique(GO.mouse[,3]) )    
enriched2 <- gene_set_enrichment(c("Runx1", "Gfi1", "Gfi1b", "Spi1", "Gata1", "Kdr"), annot, voc)

How do the results compare?

Multifunctionality

One worry we might have with GSE is gene multifunctionality (genes having many functions).
Multifunctional genes will return many functions in an enrichment assessment (not very useful!).
We can assess our results and genes by calculating how multifunctional genes and gene sets/functions are using the EGAD package

multifunc_assessment <- calculate_multifunc(annot)
plot(multifunc_assessment[,4])
auc_mf <- auc_multifunc(annot, multifunc_assessment[,4])
names(auc_mf) = colnames(annot)
hist <- plot_distribution(auc_mf, xlab="AUROC", med=FALSE, avg=FALSE)

How multifunctional are our results?

m = match( enriched2[,1], names(auc_mf))
hist <- plot_distribution(auc_mf[m], xlab="AUROC", med=FALSE, avg=FALSE)

Replicating a figure in R

Let's take a look at this paper: https://genome.cshlp.org/content/22/4/602

They've made their data available: https://genome.cshlp.org/content/suppl/2012/01/03/gr.130468.111.DC1/Supplemental.Database.primateRNAseq.zip
Let's try reproducing the heatmap (Supplementary figure 5). https://genome.cshlp.org/content/suppl/2012/01/03/gr.130468.111.DC1/SupplementalFigures_08_09_11.pdf

Get data

download.file(url="https://genome.cshlp.org/content/suppl/2012/01/03/gr.130468.111.DC1/Supplemental.Database.primateRNAseq.zip", destfile="primateRNAseq.zip")
unzip("primateRNAseq.zip", files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = "primateRNAseq", unzip = "internal", setTimes = FALSE)
unzip("primateRNAseq/Gene.expression.data.zip", files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = "primateRNAseq/", unzip = "internal", setTimes = FALSE)
gene_expression_file = "primateRNAseq/Gene.expression.data/Normalized.expression.data.txt" 
exprs = read.table(gene_expression_file, header=T, row.names=1) 
samples = matrix(unlist(strsplit(colnames(exprs)[33:94] , "_" ) ) , byrow=T, ncol=2)

Clean up data

exprs.means = t(sapply(1:dim(exprs)[1], function(i) tapply(  as.numeric(exprs[i,33:94]), samples[,1], mean, na.rm=T) ))
samples.sub = names(tapply(  as.numeric(exprs[1,33:94]), samples[,1], mean, na.rm=T)) 
colnames(exprs.means) = samples.sub
rownames(exprs.means) = rownames(exprs)

Data analysis

Which genes are present in enough species?
Which species does not have enough data?

colSums(is.na(exprs.means ))
filt.species = which.max(colSums(is.na(exprs.means )))
gene.subset  = rowSums(is.na(exprs.means[,-filt.species ] )  ) >5  

samples.cor = cor(exprs.means[gene.subset,-filt.species ], m="s", use="p")

Plot/graph

heatmap.3(samples.cor)
samples.cor = cor(exprs.means[ ,-filt.species], m="s", use="p")
heatmap.3(samples.cor)

Back to the homepage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recap.md

recap.md

Week 10: Recap

Objectives

Gene set enrichment

Gene set enrichment (GSE) in R

Multifunctionality

Replicating a figure in R

Files

recap.md

Latest commit

History

recap.md

File metadata and controls

Week 10: Recap

Objectives

Gene set enrichment

Gene set enrichment (GSE) in R

Multifunctionality

Replicating a figure in R