Skip to content
197 changes: 197 additions & 0 deletions vignettes/analyzing-cna-data.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
title: "Analyzing CNA Data with {gnomeR}"
output: rmarkdown::html_vignette
author: Allison Reiner, Caroline Kostrzewa
vignette: >
%\VignetteIndexEntry{CNA_Vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

## Introduction

Somatic copy number alterations (sCNA, CNA, CNV) are a predominant source of variation in cancer genomes, which contribute to cancer development, progression and resistance to therapy. A CNA refers to a change in the number of copies of a genomic segment, termed amplifications and deletions. Such changes can be broad (entire chromosome arms) or focal (small chromosomal regions larger than 1000 base pairs). Oncogene activation often occurs due to amplification & TSG deactivation often due to deletions (Nabavi & Zare, 2022). In cancer studies, it is useful to examine the similarities and differences in copy number profiles between clinically meaningful patient subgroups or across study cohorts (See Figure 1 of https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6145837/).

In this vignette, we will introduce the forms of copy number data available on cBioPortal and walk through an example of analyzing CNA data.

## Setting up

Make sure {gnomeR} is installed & loaded. We will also use dplyr for the purposes of this vignette.

```{r setup, message = FALSE, warning=FALSE, echo=T, results='hide'}
install.packages("devtools",repos="http://cran.us.r-project.org")
devtools::install_github("MSKCC-Epi-Bio/gnomeR")
install.packages("BiocManager",repos = "http://cran.us.r-project.org")

library(cluster)
library(BiocManager)
library(gnomeR)
library(tidyverse)

BiocManager::install("iClusterPlus")
library(iClusterPlus)

BiocManager::install("DNAcopy")
library(DNAcopy)

library(cluster)
library(ComplexHeatmap)

install.packages("remotes",repos = "http://cran.us.r-project.org")
remotes::install_github("mskcc/facets")
library(facets)

```

To explore copy number data, we will use a sample of 500 patients from cBioPortal. Data on mutations, CNA, and structural variants for these patients are also available in the gnomeR package (`gnomeR::mutations`, `gnomeR::cna`, `gnomeR::sv`).

Note: To access data from cBioPortal, you can use the {cbioportalR} package:
https://github.com/karissawhiting/cbioportalR.


## Data Formats
There are 2 forms of copy number data in cBioPortal.
1. Discrete copy number data
2. Segmentation data (seg.means)

### How to obtain each form
1. Discrete copy number data
- use cbioportalR::get_cna_by_sample()
2. Segmentation data
- Retrieve from cbioportal.mskcc.org by inputting desired patient or sample IDs
- Select the "CN Segments" tab on the upper left hand side of the screen
- Click the download button with the cloud symbol, above the Save SVG button
- cbioportalR::get_segmeans_by_sample() coming soon

## Key Information about each form
1. Discrete copy number:
- Provides gene-level copy number alteration status


```{r, echo=FALSE}
allowed_cna_levels <- tibble::tribble(
~detailed_coding, ~numeric_coding, ~simplified_coding,
"neutral", "0", "neutral",
"homozygous deletion", "-2", "deletion",
"loh", "-1.5", "deletion",
"hemizygous deletion", "-1", "deletion",
"gain", "1", "amplification",
"high level amplification", "2", "amplification")

allowed_cna_levels %>% knitr::kable()

```

<br>

**IMPORTANT**: Genes that do not have a CNA are not returned from the get_cna_by_sample function
- Use cbioportalR::get_panel_by_sample() to obtain which version of IMPACT panel on which your samples were run

2. Segmentation data:
- Loc.start = genomic location where segment starts
- Loc.end = genomic location where segment ends
- Num.mark = number of SNPs in the segment
- Seg.mean = mean log2(copy number in tumor sample/ copy number in pooled normal sample) for a given segment

## How seg.means data gets translated to gene-level discrete copy number data
- In cbioportal, the threshold for a segment to qualify as part of a gene-level copy number alteration is |0.2|
- Gene level copy number alterations must be “signed out” by the clinical bioinformatics team/pathologist
- Can see sign out notes under the “clinical” tab in cBioPortal (column may be called "SO Comments")

## Steps to create a copy number heatmap across samples using segmentation data

![Copy Number Heatmap Example(Hieronymus et al.,2018)](H:/Biostatistics/ReinerA1/Projects/CNA_Info/Images/CNAheatmapex.jpg)
<br>
Copy Number Heatmap Example (Hieronymus et al.,2018)
<br>
<br>


**1. Obtain seg.means data**

```{r}
segdata <- gnomeR::seg

segdata <- segdata %>% mutate(chrom=as.numeric(chrom))
```

**2. Reduce dimensions of the matrix and convert into segment by sample format**

- We recommend the CNregions function from iClusterPlus package to obtain the minimum set of segments that capture the CNA landscape
- All samples must have segments on every chromosome
- Epsilon=0.005 usually yields satisfactory results

```{r}
segdata_condensed <- iClusterPlus::CNregions(seg=segdata)
```

**3. Obtain purity, ploidy and clonal heterogeneity-adjusted copy number estimates**

- We recommend using the emcncf function from FACETS package

```{r}
# Read in SNP read count matrix
# set.seed(1234)
# datafile = system.file("extdata", "stomach.csv.gz", package="facets")
# rcmat = facets::readSnpMatrix(datafile)

# Preprocess SNP matric
# xx = preProcSample(rcmat)

# Process pre-processed data
# oo=procSample(xx,cval=150)

# Calculate tumor purity, ploidy, cell fraction
# fit=emcncf(oo)
```


**4. Plot copy number data using ComplexHeatmap package**
```{r, echo=T, warning=F,message=F}
# Create matrix of ssample by segment data
subber <- as.matrix(segdata_condensed)

# Specify how you want chromosomes to be ordered in the heatmap
noms <- gsub("\\.[0-9]*-[0-9]*","",colnames(subber))
noms <- as.data.frame(noms)%>% dplyr::mutate(chr_num=parse_number(noms))

# Create heatmap
hm_fin <- ComplexHeatmap::Heatmap(
mat = subber,
name = "log2ratio",
column_order = order(as.numeric(gsub("column", "", colnames(subber)))),
column_split=noms$chr_num,
column_gap = unit(1.5, "mm"),
column_title_gp = gpar(fontsize = 6),
#heatmap_legend_param = Legend(title_gp = gpar(col = "red", fontsize = 6),
#labels_gp = gpar(col = "red", fontsize = 4)),
heatmap_legend_param = list(labels_gp = gpar(fontsize = 4),title_gp = gpar(col = "black", fontsize = 6),grid_width = unit(0.25, "cm")),
show_column_names = F,
show_row_names = F,
row_order = order(as.numeric(gsub("row", "", rownames(subber)))),
border = TRUE,
border_gp = grid::gpar(col = "gray50")

)


```

```{r echo=F, include=F}
png("H:/Biostatistics/ReinerA1/Projects/CNA_Info/Images/examp_heatmap5.png",width=3.5,height=2,units="in",res=300)
draw_hm_fin <- draw(hm_fin)
dev.off()
```


![Heatmap Example](H:/Biostatistics/ReinerA1/Projects/CNA_Info/Images/examp_heatmap5.png)

- each row is a sample; each column is a segment