Skip to content

Latest commit

 

History

History
106 lines (86 loc) · 4.78 KB

recap.md

File metadata and controls

106 lines (86 loc) · 4.78 KB

Week 10: Recap

Objectives

  • Recap and revision of what you've learnt this term
  • Reproducing analyses in either R or Python

Gene set enrichment

Let's do some analyses.

Download these files into your working directory:

Run this to install/load libraries

source("helper.R") 
load("recap.Rdata")

Gene set enrichment (GSE) in R

A method to identify properties of genes or proteins that are over-represented. GSE is a (relatively) straightforward method to sumamrize the results of your experiment, in particualr when you believe there is some link or association to a known phenotype (e.g., enrichment for dopamine receptors in Parkinson's disease). Uses statistical approaches to identify significantly enriched groups of genes, and the main annotation database is the Gene Ontology (GO) and the Molecular Signatures database (MSigDB).

install.packages("enrichR")
library(enrichR)
dbs <- listEnrichrDbs()
dbs <- c("GO_Molecular_Function_2015", "GO_Cellular_Component_2015", "GO_Biological_Process_2015")
enriched <- enrichr(c("Runx1", "Gfi1", "Gfi1b", "Spi1", "Gata1", "Kdr"), dbs)
  • Or, we can build a gene by function matrix and perform our own gene set enrichment (function is in helper.R)
library(EGAD)
annot = make_annotations( (GO.mouse[,c(1,3)][GO.mouse[,4]!="IEA",] ), unique(GO.mouse[,1]), unique(GO.mouse[,3]) )    
enriched2 <- gene_set_enrichment(c("Runx1", "Gfi1", "Gfi1b", "Spi1", "Gata1", "Kdr"), annot, voc)
  • How do the results compare?

Multifunctionality

  • One worry we might have with GSE is gene multifunctionality (genes having many functions).
  • Multifunctional genes will return many functions in an enrichment assessment (not very useful!).
  • We can assess our results and genes by calculating how multifunctional genes and gene sets/functions are using the EGAD package
multifunc_assessment <- calculate_multifunc(annot)
plot(multifunc_assessment[,4])
auc_mf <- auc_multifunc(annot, multifunc_assessment[,4])
names(auc_mf) = colnames(annot)
hist <- plot_distribution(auc_mf, xlab="AUROC", med=FALSE, avg=FALSE)
  • How multifunctional are our results?
m = match( enriched2[,1], names(auc_mf))
hist <- plot_distribution(auc_mf[m], xlab="AUROC", med=FALSE, avg=FALSE)

Replicating a figure in R

Let's take a look at this paper: https://genome.cshlp.org/content/22/4/602

  1. Get data
download.file(url="https://genome.cshlp.org/content/suppl/2012/01/03/gr.130468.111.DC1/Supplemental.Database.primateRNAseq.zip", destfile="primateRNAseq.zip")
unzip("primateRNAseq.zip", files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = "primateRNAseq", unzip = "internal", setTimes = FALSE)
unzip("primateRNAseq/Gene.expression.data.zip", files = NULL, list = FALSE, overwrite = TRUE, junkpaths = FALSE, exdir = "primateRNAseq/", unzip = "internal", setTimes = FALSE)
gene_expression_file = "primateRNAseq/Gene.expression.data/Normalized.expression.data.txt" 
exprs = read.table(gene_expression_file, header=T, row.names=1) 
samples = matrix(unlist(strsplit(colnames(exprs)[33:94] , "_" ) ) , byrow=T, ncol=2)
  1. Clean up data
exprs.means = t(sapply(1:dim(exprs)[1], function(i) tapply(  as.numeric(exprs[i,33:94]), samples[,1], mean, na.rm=T) ))
samples.sub = names(tapply(  as.numeric(exprs[1,33:94]), samples[,1], mean, na.rm=T)) 
colnames(exprs.means) = samples.sub
rownames(exprs.means) = rownames(exprs)
  1. Data analysis
  • Which genes are present in enough species?
  • Which species does not have enough data?
colSums(is.na(exprs.means ))
filt.species = which.max(colSums(is.na(exprs.means )))
gene.subset  = rowSums(is.na(exprs.means[,-filt.species ] )  ) >5  

samples.cor = cor(exprs.means[gene.subset,-filt.species ], m="s", use="p")
  1. Plot/graph
heatmap.3(samples.cor)
samples.cor = cor(exprs.means[ ,-filt.species], m="s", use="p")
heatmap.3(samples.cor)

Back to the homepage