final-project-report.qmd

---
title: "Skin Cancer Dataset Analysis"
author: "Conor Heffron"
format: 
  html:
   embed-resources: true
   code-fold: true
  pdf: default
---

# Part 1: Analysis

## Summary

- The dataset I have selected for this assignment is from the *CBio Portal for Cancer Genomics* website.

- The full data set is in tar / gzip format that can be downloaded directly from <https://www.cbioportal.org/study/summary?id=mel_mskimpact_2020>. 

- I will focus on the clinical data files (data_clinical_patient.txt, data_clinical_sample.txt, and data_mutations.txt), primarily the clinical patient file. These are tab delimited plain text files that can be accessed after the tar is decompressed.

- There is a mix of categorical (factors), continuous and numerical features (some to be inferred by R code).

- This data represents the targeted sequencing (MSK-IMPACT) of 696 melanoma tumour / normal pairs.

- There is a corresponding published medical journal entitled "Therapeutic Implications of Detecting MAPK-Activating Alterations in Cutaneous and Unknown Primary Melanomas".

### Reference:

- Shoushtari AN, Chatila WK, Arora A, Sanchez-Vega F, Kantheti HS, Rojas Zamalloa JA, et al. . Therapeutic implications of detecting MAPK-activating alterations in cutaneous and unknown primary melanomas. Clin Cancer Res. (2021) 27:2226--35. doi: 10.1158/1078-0432.CCR-20-4189, PMID: - DOI - PMC - PubMed: <https://pubmed.ncbi.nlm.nih.gov/33509808/>

- See PMCID link for journal and `Table 1` referenced throughout `Part 1` and `Part 2` of this report, the direct link to this is <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8046739/#S7title>.

- Skin related & unclassified primary melanomas often conceal changes that can activate the MAPK pathway. 
The disease pathway describes the mechanisms by which it evolves & either persists or is resolved. 
The MAPK pathway activation is a prominent step in melanoma pathogenesis.

- The targeted capture of multi-gene sequencing can detect oncogenic RTK-RAS-MAPK pathway alterations (can cause development of a tumour or tumours) in almost all cutaneous and unknown primary melanomas.

## Load packages & re-usable source code / functions.

```{r}
source('final_proj_imports.R')
source('final_proj_utils.R')# Import and load source code from local R script
```

## Prepare working directory and data for import

- Set working directory to `path_wd`
- The working directory should contain .qmd, *.R and tar/gzip files before executing `final-project.qmd` file.

```{r}
path_wd <- "/Users/conorheffron/Downloads/data-prog-with-r-final-project/"

# setwd('Path/To/Your/Folder')
setwd(path_wd)
```

## Decompress Data Directory

- Untar data downloaded directly from <https://www.cbioportal.org/study/summary?id=mel_mskimpact_2020>

![](images/download.png){width="100%"}

### Untar / decompress tarball / gzip file

```{r}
dir_name <- "mel_mskimpact_2020"
extension <- ".tar.gz"
untar(paste(dir_name, extension, sep=""), files = NULL, list = FALSE, exdir = ".",
      extras = NULL, verbose = FALSE,
      restore_times =  TRUE,
      support_old_tars = Sys.getenv("R_SUPPORT_OLD_TARS", FALSE),
      tar = Sys.getenv("TAR"))
```

### Import Relevant Data with genric function from `final_proj_utils.R`

```{r}
data_files <- import_skin_cancer_data(dir_name, "^data_")
```

### List data sets that were impoprted

```{r}
names(data_files)
```

### Merge data sets (clinical patient, clinical sample & data gene panel matrix as they have the same number of observations at 696 each)

```{r}
# Merge by patient ID 
df_clin <- merge(x = data_files$data_clinical_patient, y = data_files$data_clinical_sample, by = "PATIENT_ID", all = TRUE)

# Merge / Join by Sample ID
df_clin <- merge(x = df_clin, y = data_files$data_gene_panel_matrix, by = "SAMPLE_ID", all = TRUE)
```

### Data manipulation (add columns, add proportion percentages, omit NAs etc.)

```{r}
# Add column to represent proportion as a percentage for the male:female samples
df_clin <- df_clin %>%
    group_by(SEX) %>%
    mutate(gender_prop =  100 * round(sum(n() / length(df_clin$SEX)), 2)) %>% 
    ungroup

# Add column to represent proportion as a percentage  for the primary cancer site 
# samples
df_clin <- df_clin %>%
    group_by(DMT_PRIMARY_SITE) %>%
    mutate(site_prop =  100 * round(sum(n() / length(df_clin$DMT_PRIMARY_SITE)), 2)) %>% 
    ungroup

# Add column to represent proportion as a percentage the primary cancer site 
# classified (generic site) samples
df_clin <- df_clin %>%
    group_by(DMT_PRIMARY_SITE_CLASSIFIED) %>%
    mutate(site_classified_prop =  100 * round(sum(n() / length(df_clin$DMT_PRIMARY_SITE_CLASSIFIED)), 2)) %>% 
    ungroup

# extract OS years from OS months column
df_clin$OS_YEARS <- round(df_clin$OS_MONTHS / 12)

# estimate patient current age by adding age at diagnosis + OS years
df_clin$EST_CURR_AGE <- df_clin$OS_YEARS + df_clin$AGE_AT_INITIAL_DIAGNOSIS

# Extract 2 columns from OS_STATUS where OS_STATUS_INT is 1 or 0 
# and OS_STATUS_GRP is Living or Deceased
df_clin <- df_clin %>% separate_wider_delim(OS_STATUS, ":", names = c("OS_STATUS_INT", "OS_STATUS_GRP"))

# Add column to represent age group - this column is an ordered factor based on estimated current age of patient
df_clin$age_group = cut(df_clin$EST_CURR_AGE, c(0, 5, 15, 25, 35, 45, 55, 65, 75, 85, 95, Inf), c("0-5", "5-15", "15-25", "25-35", "35-45", "45-55", "55-65", "65-75", "75-85", "85-95", ">95"), include.lowest=TRUE, ordered = T)

# There NA values but they do not hamper any calculation that I have replicated so commenting out this line
# na.omit(df_clin)
```

### Journal Statistics (Part 1): Tables / Numeric Summaries of Data

- See link to article `Table 1` statistics that I have replicated in R code. This is the 1st section of data analytics <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8046739/#S7title>.

![](images/table%201%20part%201.png){width="100%"}

### Get main data frame `df_clin` dimensions (696 observations x 39 variables)
- This equates to 696 patients with sample data and gene matrix values mapped.

```{r}
dim(df_clin)
```

### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`).

- This is done by calling a genric function from `final_proj_utils.R`
- The grouping field here is `CANCER_TYPE_DETAILED`
- The default settings will be used for optional `count_agg` parameters

```{r}
count_agg(df_clin, "CANCER_TYPE_DETAILED")
```

### Get rounded median value for `AGE_AT_INITIAL_DIAGNOSIS`
- The value of 61 matches `Table 1` from referenced journal.

```{r}
round(median(df_clin$AGE_AT_INITIAL_DIAGNOSIS))
```

### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`).

- This is done by calling a genric function from `final_proj_utils.R`
- The grouping field here is `SEX`
- Male samples make up two thirds of the data for analysis

```{r}
count_agg(df_clin, "SEX")
```

### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`).

- This is done by calling a genric function from `final_proj_utils.R`
- The grouping field here is `SAMPLE_TYPE`
- 84% of samples are not from the primary or neighbouring / local recurrence site of melanoma

```{r}
count_agg(df_clin, "SAMPLE_TYPE")
```

## Journal Statistics (Part 2)

- See link to article `Table 1` statistics that I have replicated in R code. This is the first section of data analytics in this report - <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8046739/#S7title>.

![](images/table%201%20part%202.png){width="100%"}. 

### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`). 
- This is done by calling a genric function from `final_proj_utils.R`
- The grouping field here is `SIMPLIFIED_METASTATIC_SITE`
- The most common sites for cancers to metastasize include the lungs, liver, bones and brain. 
- From this sample set, `Regional Local Recurrences (LR) / In-Transit / Lymph Node (LN)` is by far the most common at 39% followed by `Soft Tissue / Distant LN` and `Lung` at 14% and 13% respectively.
- `NA` here corresponds to `Primary` site in `Table 1`

```{r}
count_agg(df_clin, "SIMPLIFIED_METASTATIC_SITE")
```

### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`).

-   This is done by calling a genric function from `final_proj_utils.R`
-   The grouping field here is `DMT_PRIMARY_SITE_CLASSIFIED`
-   The author applies a filter to this data where `CANCER_TYPE_DETAILED == "Cutaneous Melanoma"` before creating summary statistics. This makes sense because we want to rule out the patients with unknown primary cancer site/area as this will skew the result.
- There are 140 patients (20%) with unknown DMT primary cancer site which we want to rule out before reporting the proportion of patients per generic primary cancer site.
-   The `count_agg` function has an optional parameter for digits (number of decimals) which has default value of `0` but here we explicitly set `digits = 2` for a more accurate `Freq` (group size proportion as a percentage based on count variable `n`) to match the counts reported.
- There is a relatively equaly split here where each site has a share between 18% and 32%. 
- `Trunk` and `Head` are the top 2 results from this summary.

```{r}
count_agg(df_clin |> filter(CANCER_TYPE_DETAILED == "Cutaneous Melanoma"), "DMT_PRIMARY_SITE_CLASSIFIED", digits = 2)
```

## Further Exploratory Analysis (*NOT* based on `Table 1`)

### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`).

- This is done by calling a genric function from `final_proj_utils.R`
- The grouping field here is `DMT_PRIMARY_SITE`
- The `count_agg` has an optional parameter for digits (number of decimals) which has default value of `0` but here we explicitly set `digits = 2` for a more accurate `Freq` (group size proportion as a percentage based on count variable `n`)
- This summary table goes into more detail on where the site is, giving further detail on the lower and upper extremities in particular.
- Unfortunately, `Unknown Primary` observations account for 20% of the data set.
- However there are more than 20 records for neck, face, scalp, arm/shoulder, leg/hip and trunk. Trunk again is the top result at 26% which may be expected to a certain extent as it is a large component of the human body.

```{r}
count_agg(df_clin, "DMT_PRIMARY_SITE", digits = 2)
```
### Create summary table with count by group aggregation that adds count variable `n` and count proportion as a percentage (variable `Freq`).

-   This is done by calling a genric function from `final_proj_utils.R`
-   This is the first summary of mutations data
-   The grouping field here is `Hugo_Symbol`
-   The `count_agg` has an optional parameter for n_results (number of results returned after sorting results set by `n` in descending order) which by default will return first 20 results (includes each group). There are many `Hugo_Symbol` values and I am only interested in the top 15 occurences so we set `n_results = 15`
-   The `count_agg` has an optional parameter for digits (number of decimals) which has default value of `0` but here we explicitly set `digits = 2` for a more accurate `Freq` (group size proportion as a percentage based on count variable `n`)
- One of the most common changes in melanoma cells is a mutation in the BRAF oncogene, which is found in approx. half of diagnosed melanomas. 
- Other genes that can be affected in melanoma include NRAS, CDKN2A & NF1 (Usually only 1 of these genes is affected).
- We can see BRAF (# 5), NRAS (# 15), CDKN2A (# 4) and NF1 (# 7)
- The top 3 in this data set consist of: TERT, PTPRT and GRIN2A
- It appears the BRAF mutation is associated with aggressive tumor features & can increase the risk of persistent and recurrent disease
- I think this is a promising subset of data to examine further for discerning data features or even prinicpal component analysis in the mutations data frame which has a column of values per patient.

```{r}
count_agg(data_files$data_mutations, "Hugo_Symbol", 15, 2, F)
```
### Get dimensions of mutations data set
- We can see there are 19,479 observations or records
- We can see there is now 122 variables or columns
- The top 5 symbols in order of count were: TERT, PTPRT, GRIN2A, *CDKN2A*, *BRAF*
```{r}
dim(data_files$data_mutations)
```

## Graphical Summaries / Data Visualisation
- I would like to create some data visualisations to further explore the summary information from `Table 1`.
- I am interested in the split for `age_group` (Ordinal factor) and `SEX` (Male or Female - categporical).
- The majority of samples are `Male` and lie between age groups `45-55` -> `75-85`

```{r}
ggplotly(ggplot(df_clin, aes(x=age_group, fill = SEX)) +
  geom_bar())
```

- I am interested in the split for `OS_STATUS` (Living or Deceased) and `age_group` (Ordinal factor)
- The largest group of deceased patient samples are in the age bracket of `65-75`.
```{r}
ggplotly(ggplot(df_clin, aes(x=age_group, fill = OS_STATUS_GRP)) +
  geom_bar())
```
- I am interested in the split for `OS_STATUS_GRP` (Living or Deceased) and `DMT_PRIMARY_SITE_CLASSIFIED` (categorical feature)
- This shows a lot of samples for living patients that were diagnosed with skin cancer in the head area, especially 65-85.
- The area with a high proportion of deceased patients were diagnosed with skin cancer in trunk or torso area across all age groups followed by the Head area. In the 15-25 age group, the deceased appear to all have Head diagnosis as primary site of melanoma.
```{r}
ggplotly(ggplot(df_clin |> filter(CANCER_TYPE_DETAILED == "Cutaneous Melanoma"), aes(x=OS_STATUS_GRP, fill = DMT_PRIMARY_SITE_CLASSIFIED)) +
  geom_bar() + facet_wrap(~age_group))
```
- I am interested in the split for `OS_STATUS_GRP` (Living or Deceased) and `EST_CURR_AGE` (continuous variable that represents the estimated current age of patient - age at diagnosis + OS_MONTHS)
- There are a lot more patients living with skin cancer diagnosis than there are deceased in this data set.
- However, there is a relatively high count of deceased with estimated current age between 40 and 80 years of age (count 174).
```{r}
ggplotly(ggplot(df_clin, aes(x=EST_CURR_AGE, fill = OS_STATUS_GRP)) +
  geom_histogram(binwidth = 20))
```
- I am interested in the split for `OS_STATUS_GRP` (Living or Deceased) and `OS_YEARS` (continuous variable that represents the number of years after initial diagnosis)
- This chart is replicated for each primary cancer site (`DMT_PRIMARY_SITE`) as a sub plot or graph.
- All patients diagnosed with skin cancer in hand/lip and site unspecified as primary site of melanoma are deceased.
- Patients diagnosed with skin cancer where foot was primary site has a 50:50 split for Living:Deceased (3:3).
- There are patients that had diagnosis in the leg/hip and trunk and primary site living long after the initial diagnosis from this data set (up to and just over 30 years).
- There is an outlier for one patient that is living after initial diagnosis of neck skin cancer 43 years after initial diagnosis and counting with estimated current age of 68yrs old.
```{r}
ggplotly(ggplot(df_clin, aes(x=OS_YEARS, y = EST_CURR_AGE, color = OS_STATUS_GRP )) +
  geom_point() +
  facet_wrap(~DMT_PRIMARY_SITE))
```

- I would like to see if there are any outliers based on primary cancer site and gender so I will use a box plot this time.
- I am interested in the split between `SEX` (Male or Female) and `AGE_AT_INITIAL_DIAGNOSIS` (continuous variable that represents the patients age at initial diagnosis)
- This chart is replicated for each generic primary cancer site (`DMT_PRIMARY_SITE_CLASSIFIED`) as a sub plot or graph.
- `Head` and `Upper Extremity` graphs show outliers for males that were diagnosed earlier than normal for Head/Upper Extremity skin cancer.
- This is odd because males on average are diagnosed later with skin cancer except for `Lower Extremity`. This may be a sign that the data is not great. In addition, There are lot more male samples than female. Conversely, it is possible that the samples for female patients may be lacking.
```{r}
ggplotly(ggplot(df_clin, aes(x=SEX, y = AGE_AT_INITIAL_DIAGNOSIS, color = SEX)) +
  geom_boxplot() + facet_wrap(~DMT_PRIMARY_SITE_CLASSIFIED)) 
```

# Part 2: R Packages

- I have selected the `testthat` and `covr` packages for this section. 

## testthat: Unit Testing for R
- The testthat package is for writing unit tests against R code (packages / functions).
- 'testthat' is a testing framework that is easy to learn and use, and integrates with RStudio.

## covr: Test Coverage for Packages
- Track and report code coverage for your package and there is an option to upload the results to a coverage service. The covr package is used to generate reports that highlight the level of unit test coverage by percenatge and number of lines, logic branches (if else etc.) covered. It helps to identify missed lines of code or code that does not have any unit test coverage. It provides information on the code and its structure.

- Clean code has unit tests and unit tests also encourages code improvements and refactoring with piece of mind that the code is still functioning to a good level especially if unit test coverage is at a high percentage such as 85%+.

Please see test_final_proj_utils.R (uni tests / test suite) and final_proj_utils.R (source code) files attached.

See `test_final_proj_utils.R` for a test suite containing 5 tests that roll out to 12 test cases.
The structure of the tests is the de-facto "Given, When, Then" formula for test driven development (TDD).
- `Given` a set of test variables/constants/parameters for a given test scenario
- `When` we call a method or unit of code with these configurations / arguments
- `Then` we assert or check the result is what we expect. This is the most important step as we need to test for something tangible. Running a test and not verifying the result would be pointless.

- I have demonstrated some of the methods available from the `testthat` package for asserting the result of the unit test is as expected with functions: `expect_null`, `expect_false`, `expect_equal`, and `expect_failure`. There are other notable functions where you can expect s3 or s4 objects amongst others. This is very easy to use and from previous experience in unit testing with other languages, `testthat` package in R matches up quite well with JUnit / PyTest etc.

- I have used the `covr::file_coverage` method to run the unit tests with coverage to allow for code coverage report generation but you also run the tests as a stand alone directly from the test script in RStudio as long as `testthat` library is loaded.See screenshots for further information on this.
```{r}
# Run tests with coverage
covr <- file_coverage("final_proj_utils.R", "test_final_proj_utils.R")

# Generate HTML coverage report
report(covr)
```

- The report function runs the coverage for the source and test code files passed as arguments. This can also be done for a whole package via `package_coverage`. There are methods to convert the report to cobertura code coverage or sonarqube reports in XML format which often plug-in to build servers like Jenkins for code coverage and static code scanning analysis. More information on the functions available at <https://cran.r-project.org/web/packages/covr/covr.pdf>.

- The report generated is in HTML format and can be viewed directly in RStudio or via internet browser such as chrome.

- The test script with the `testthat` framework installed and loaded allows 'Run Tests' option on top RHS of the code editor. The bottom left pain shows the results of running test suite (pass/fail/issues).

![](images/rstudio run tests2.png){width="100%"}. 
![](images/rstudio run tests1.png){width="100%"}. 

- The bottom right 'viewer' displays report generated by 'report' function.

![](images/covr%201.png){width="100%"}. 

![](images/covr%202.png){width="100%"}. 


# Part 3: Functions / Programming

## S3 Class / Objects / Methods

- Create `AggregationArgs` class from object instance `aggregation_drv_class`
- The code creates a list called `aggregation_drv_class` with 4 elements: df, col, n_results and digits.
- The values of these elements are `df_clin`, "DRIVER_CLASS", 3, and 1 respectively.
- Then, the class() function is used to assign a class to the aggregation_drv_class list.
- The class is set to "AggregationArgs".
```{r}
aggregation_drv_class <- list(df = df_clin, col = "DRIVER_CLASS", n_results=3, digits = 1)
class(aggregation_drv_class) <- "AggregationArgs"
```
- Define `countAgg` function / method.
- The function takes one argument, "object".
- The function then uses the "UseMethod" function to dispatch the method that matches the class of the "object" argument.
- This is an example of OOP in R, where different methods can be defined for different classes of objects.
- This code defines a function called `countAgg` that takes an object as its argument.
- Within the function, it uses the `count_agg` function that I previously defined in `final_proj_itls.R` to print a summary table with count and Frequency info for the data frame and `group_by` column passed as arguments, it also takes 2 parameters (`n_results` and `digits`) that are normally optional (more strict version).
```{r}
countAgg <- function(object) {
  UseMethod("countAgg")
}

countAgg.AggregationArgs <- function(object) {
  cat("Entering countAgg.AggregationArgs method with params", object$col, object$n_results, object$digits)
  count_agg(object$df, object$col, n_results = object$n_results, digits = object$digits)
}
```
- Initially countAgg prints the object components except for the data frame as it is a large object (too much information for debugging).
- Then it prints to console the summary table as expected in `knitr::kable format`
```{r}
countAgg(aggregation_drv_class)
```
- This time I call the same S3 function with different arguments which is straight forward as the OOP part is already defined now and reusable.
```{r}
aggregation_tmb <- list(df = df_clin, col = "TMB_NONSYNONYMOUS", n_results=5, digits = 2)
class(aggregation_tmb) <- "AggregationArgs"
countAgg(aggregation_tmb)
```
Finally, I want to verify that the method is S3.
```{r}
isS3method("countAgg.AggregationArgs")

isS3stdGeneric(countAgg)
```

## S4 Class / Objects / Methods
- The setup for S4 is a bit different.
- First I want to create the equivalent S4 class with defined slots (object or list components by type instead of value)
```{r}
setClass("countAggregation", slots=list(df='data.frame', col="character", n_results="numeric", digits = "numeric"))
```
- Then I create an instance of my class object
- I want to verify the object is a valid object and that it is a valid instance of an S4 class
```{r}
obj1 <- new("countAggregation",df=df_clin, col="SIMPLIFIED_METASTATIC_SITE", n_results=6, digits=4)
is.object(obj1)
isS4(obj1)
```

- Instead of the `$` / dollar sign symbol in S3, I can use the `@` symbol to access components of an S4 object
```{r}
obj1@col
obj1@n_results
obj1@digits
```

- Now, I would like to set the method to be called to show the output of this aggregation.
```{r}
setMethod("show", "countAggregation", function(object) {
  cat("Entering show.countAggregation method with params", object@col, object@n_results, object@digits)
  print(count_agg(object@df, object@col, n_results = object@n_results, digits = object@digits))
})
```
Now I would like to test the method is called when I show/print the object to console
```{r}
obj1
```
- Instead of creating another object, I am just going to modify the existing S4 object to generate another summary table for `DRIVER_CLASS`
```{r}
obj1@col <- "DRIVER_CLASS"
```
Finally, I would like to test the method is called when I show/print the *UPDATED* object to console
```{r}
obj1
```
- This is an example of how useful OOP programming is for re-usable code whether using S3/S4. S4 is newer and cleaner so I think I prefer S4 (less code after setup).