gerkovink · emsulo · Sep 22, 2022 · Oct 21, 2022 · Oct 21, 2022 · Nov 2, 2022
diff --git a/Practicals/Emilia Loscher/Assignment 1/Monte Carlo Simulation_Emilia.Rmd b/Practicals/Emilia Loscher/Assignment 1/Monte Carlo Simulation_Emilia.Rmd
@@ -0,0 +1,65 @@
+---
+title: "Monte Carlo Simulation"
+author: "Emilia Löscher"
+date: '2022-09-14'
+output:
+  pdf_document: default
+  html_document: default
+---
+
+# 8 
+##Monte Carlo simulation exercise
+
+Perform a small simulation that does the following:
+a. Sample 100 samples from a standard normal distribution.
+
+```{r}
+library(ggplot2)
+library(dplyr)
+set.seed(1409)
+N <- 100 #Number of samples
+n <- 1000 #Size of each sample 
+samples <- matrix(rnorm(n*N), nrow = n, ncol = N, byrow = FALSE) 
+#samples in column of size 1000 (n)
+```
+
+b. For each of these samples, calculate the following statistics for the mean:
+- absolute bias
+- standard error
+- lower bound of the 95% confidence interval
+- upper bound of the 95% confidence interval
+
+```{r}
+sample_mean <-  abs_bias <-  std_error <-  low_CI <-  up_CI <- length(N)
+for(i in 1:ncol(samples)){
+  sample_mean[i] <- mean(samples[,i])
+  abs_bias[i] <- abs(sample_mean[i] - 0)
+  std_error[i] <- sd(samples[,i])/sqrt(n)
+  low_CI[i] <- sample_mean[i] - 1.96 * std_error[i]
+  up_CI[i] <- sample_mean[i] + 1.96 * std_error[i]
+}
+
+mcs_list <- list(sample_mean, abs_bias, std_error, low_CI, up_CI)
+```
+
+
+c. Create a plot that demonstrates the following:
+“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)
+```{r}
+data <- data.frame(Sample = 1:N, mean= sample_mean, lower = low_CI, upper = up_CI)
+
+ggplot(data, aes(mean, Sample)) +        # ggplot2 plot with confidence intervals
+  geom_point() +
+  geom_errorbar(aes(xmin = lower, xmax = upper))+
+  geom_vline(xintercept = 0,color = "red", size = 0.5)
+
+```
+
+d. Present a table containing all simulated samples for which the resulting confidence interval does not contain the population value.
+```{r}
+data %>% filter(lower > 0 | upper < 0)
+```
+
+5 samples as expected when using a 95\% confidence interval. 
+
+
diff --git a/Practicals/Emilia Loscher/Assignment 1/Monte-Carlo-Simulation_Emilia.pdf b/Practicals/Emilia Loscher/Assignment 1/Monte-Carlo-Simulation_Emilia.pdf
diff --git a/Practicals/Emilia Loscher/Assignment 1/PersonalInfo_Emilia.txt b/Practicals/Emilia Loscher/Assignment 1/PersonalInfo_Emilia.txt
@@ -0,0 +1,5 @@
+Personal Information
+
+I am currently learning Dutch, 
+love travelling by train to places where you can go for long hikes, 
+and I have been singing in different choirs since I was five years old. 
diff --git a/Practicals/Emilia Loscher/Assignment 1/SummerMemory_Emilia.jpg b/Practicals/Emilia Loscher/Assignment 1/SummerMemory_Emilia.jpg
diff --git a/Practicals/Emilia Loscher/Assignment 5.zip b/Practicals/Emilia Loscher/Assignment 5.zip
diff --git a/Practicals/Emilia Loscher/Beamer_Emilia.pdf b/Practicals/Emilia Loscher/Beamer_Emilia.pdf
diff --git a/Practicals/Emilia Loscher/Beamer_Emilia.tex b/Practicals/Emilia Loscher/Beamer_Emilia.tex
@@ -0,0 +1,91 @@
+\documentclass[aspectratio=169]{beamer}
+
+\usecolortheme{beaver}
+\usepackage{geometry}
+\usepackage{amsmath}
+\setbeamertemplate{navigation symbols}{}
+
+\begin{document}
+
+\title{Example document to recreate with \texttt{beamer} in \LaTeX}
+\author{Emilia Löscher}
+\date[My data]{\vspace{1cm} \\\large FALL 2022\\ \large Markup Languages and Reproducible Programming in Statistics}
+
+%title page
+
+\maketitle
+
+
+%outline
+\begin{frame}{Outline}
+  Working with equations \\
+  \begin{itemize}
+     \item[]  Aligning the same equations\\
+  \item[] Omit equation numbering \\
+  \item[] Ugly alignment\\
+  \end{itemize}
+
+ \vspace{1 cm} 
+  Discussion
+\end{frame}
+
+\begin{frame}{Working with equations}
+We define a set of equations as
+\begin{gather}
+    a = b + c^2,\\
+    a - c^2 = b,\\
+    \text{left side = right side},\\
+    \text{left side + something} \geq \text{right side},
+\end{gather}
+
+
+for all something $>$ 0.    
+\end{frame}
+
+\begin{frame}{Aligning the same equations}
+Aligning the equations by the equal sign gives a much better view into the placements
+of the separate equation components.
+\begin{align}
+    a &= b + c^2,\\
+    a - c^2 &= b,\\
+    \text{left side} &= \text{right side},\\
+    \text{left side + something} &\geq \text{right side},
+\end{align}
+
+\end{frame}
+
+
+\begin{frame}{Omit equation numbering}
+Alternatively, the equation numbering can be omitted.
+\begin{align*}
+    a &= b + c^2\\
+    a - c^2 &= b\\
+    \text{left side} &= \text{right side}\\
+    \text{left side + something} &\geq \text{right side}
+\end{align*}    
+
+\end{frame}
+
+\begin{frame}{Ugly alignment}
+Some components do not look well, when aligned. Especially equations with different
+heights and spacing. For example,
+\begin{align}
+    E &= mc^2,\\
+   m &= \frac{E}{c^2}\\
+    c &= \sqrt{\frac{E}{m}}
+    \end{align}
+Take that into account. 
+\end{frame}  
+
+\begin{frame}{Discussion}
+    This is where you’d normally give your audience a recap of your talk, where you could
+discuss e.g. the following
+\begin{itemize}
+    \item Your main findings
+    \item The consequences of your main findings
+    \item Things to do 
+    \item Any other business not currently investigated, but related to your talk
+\end{itemize}
+
+\end{frame}
+\end{document}
diff --git a/Practicals/Emilia Loscher/E2_EmiliaLoscher.Rmd b/Practicals/Emilia Loscher/E2_EmiliaLoscher.Rmd
@@ -0,0 +1,183 @@
+---
+title: "Exercise 2"
+author: "Emilia Löscher"
+date: "2022-09-28"
+output:
+  pdf_document: default
+  html_document: default
+---
+
+# Aim
+The aim is to compare different approaches for classification based on the MSE with the help of cross validation. The models that we are going to compare are the following:
+- logistic regression (lr)
+- linear discriminant analysis (lda)
+- random forest (rf)
+
+# Set up
+We are using the cardiovascular disease dataset of 253 patients (from the SLV course). The data set is split into a five parts, so that while performing cross validation, the training data set always consists of 80\% and a test data set consists of 20\% of the original data set. As this splitting of the data set is random, it requires using RNG. 
+
+Using cross validation, the three models are fit to the respective training data set, always leaving out one fifth in turn. Then predictions are made for the 20\% of the data which make up the test data set. 
+By comparing these predictions to the "true" outcomes, the MSE is obtained. 
+
+Hence, we obtain a 3x5-matrix containing the results. By taking the mean across the cross validation for each model, we can compare which model performs best by identifying the model with the lowest mean MSE. 
+
+# Setting the seed 
+
+```{r}
+set.seed(2010)
+```
+
+# Packages required and data
+```{r}
+library(ggplot2)
+library(randomForest)
+library(MASS)
+library(tidyverse)
+library(utils)
+
+
+cardio <- read.csv("cardiovascular_treatment.csv") %>% 
+  mutate(severity = as.factor(severity),
+         gender   = as.factor(gender),
+         dose     = as.factor(dose),
+         response = as.factor(response))
+```
+
+# Code for the cross validation and MSE
+
+```{r}
+#MSE function
+mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
+
+myCV <- function(k = 5, data = cardio){
+   mse_mat <- matrix(NA,  ncol = k, nrow = 4)
+
+  split <- c(rep(1:k, floor(nrow(data)/k)), 1:(nrow(data)%%k))
+  split_shuff <- sample(split, length(split) )
+  #adding column to randomly split the data set
+  data$split <- split_shuff
+  for(i in 1:k){
+    #splitting the data set for each k into train and test
+    data_train <- data[which(data$split != i),]
+    data_test <- data[which(data$split == i),]
+  #fitting the different models
+    #logistic regression
+    lr <- glm(response~ ., data = data_train, family= "binomial")
+
+    #lda
+    lda <- lda(response~. , data = data_train)
+
+    #random forest
+    rf <- randomForest(response ~., data = data_train)
+
+    model_list <- list(lr, lda, rf)
+    pred_list <- list()
+    pred_list[[1]] <- ifelse(predict(lr, newdata = data_test) < 0.5, 0,1)
+    pred_list[[2]] <- rbernoulli(nrow(data_test),p=  predict(lr, newdata = data_test))
+    pred_list[[3]] <- predict(lda, newdata = data_test)$class
+    pred_list[[4]] <- predict(rf, newdata = data_test)
+    for(j in 1:4){
+     mse_mat[j,i] <- mse(as.numeric(data_test$response), as.numeric(pred_list[[j]]))
+    } 
+  }
+  return(mse_mat)
+}
+mse_res <- myCV()
+rownames(mse_res) <- c("lr_cut", "lr_bern", "lda", "rf")
+
+rowMeans(mse_res)
+```
+
+# Presentation, visualization, and discussion of results 
+```{r}
+results <- data.frame("names"= rownames(mse_res), "mse" = rowMeans(mse_res))
+ggplot(results, aes(x = names, y = mse, fill= names))+
+  geom_bar(stat ="identity")+
+   labs(
+    x     = "Method", 
+    y     = "Mean squared error", 
+    title = "Comparing regression method prediction performance")
+```
+It can be seen that regarding the MSE, the linear discriminant analysis performs best (MSE = 0.3756). The Random Forest method has a similar performance with a MSE of 0.3871. The logistic regression methods perform worst. It does not make much of a difference if the classification is done according to a cut-off value of 0.5 or using a bernoulli distribution with the obtained probabilities. The MSEs are 1.8227 (cut-off) and 1.8388 (bernoulli).
+
+# Replication of the analysis
+When we use the same code as before with another seed to see if we obtain the same results. They will be slightly different, but the order of performance of the different methods should be the same. 
+
+```{r}
+set.seed(2110)
+
+library(ggplot2)
+library(randomForest)
+library(MASS)
+library(tidyverse)
+library(utils)
+
+
+cardio <- read.csv("cardiovascular_treatment.csv") %>% 
+  mutate(severity = as.factor(severity),
+         gender   = as.factor(gender),
+         dose     = as.factor(dose),
+         response = as.factor(response))
+
+#MSE function
+mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
+
+myCV <- function(k = 5, data = cardio){
+   mse_mat <- matrix(NA,  ncol = k, nrow = 4)
+
+  split <- c(rep(1:k, floor(nrow(data)/k)), 1:(nrow(data)%%k))
+  split_shuff <- sample(split, length(split) )
+  #adding column to randomly split the data set
+  data$split <- split_shuff
+  for(i in 1:k){
+    #splitting the data set for each k into train and test
+    data_train <- data[which(data$split != i),]
+    data_test <- data[which(data$split == i),]
+  #fitting the different models
+    #logistic regression
+    lr <- glm(response~ ., data = data_train, family= "binomial")
+
+    #lda
+    lda <- lda(response~. , data = data_train)
+
+    #random forest
+    rf <- randomForest(response ~., data = data_train)
+
+    model_list <- list(lr, lda, rf)
+    pred_list <- list()
+    pred_list[[1]] <- ifelse(predict(lr, newdata = data_test) < 0.5, 0,1)
+    pred_list[[2]] <- rbernoulli(nrow(data_test),p=  predict(lr, newdata = data_test))
+    pred_list[[3]] <- predict(lda, newdata = data_test)$class
+    pred_list[[4]] <- predict(rf, newdata = data_test)
+    for(j in 1:4){
+     mse_mat[j,i] <- mse(as.numeric(data_test$response), as.numeric(pred_list[[j]]))
+    } 
+  }
+  return(mse_mat)
+}
+mse_res <- myCV()
+rownames(mse_res) <- c("lr_cut", "lr_bern", "lda", "rf")
+
+print(rowMeans(mse_res))
+
+results <- data.frame("names"= rownames(mse_res), "mse" = rowMeans(mse_res))
+ggplot(results, aes(x = names, y = mse, fill= names))+
+  geom_bar(stat ="identity")+
+   labs(
+    x     = "Method", 
+    y     = "Mean squared error", 
+    title = "Comparing regression method prediction performance")
+```
+
+We see that the ranking of the methods is similar and the values of the MSEs vary a bit: 
+Before: 1.8226667(lr_cut) 1.8388235 (lr_bern) 0.3756078 (lda) 0.3871373 (rf)
+Now: 1.6040784 (lr_cut) 1.5529412 (lr_bern) 0.3597647 (lda) 0.4466667 (rf)
+
+There is now a larger difference between the linear discriminant analysis and the random forest method. The LDA still performs best with respect to the MSE. Now, the cut off logistic regression performs worse than the bernoulli logistic regression.   
+
+
+# Session info
+```{r}
+sessionInfo()
+```
+
diff --git a/Practicals/Emilia Loscher/E2_EmiliaLoscher.pdf b/Practicals/Emilia Loscher/E2_EmiliaLoscher.pdf
diff --git a/Practicals/Emilia Loscher/Emilia Loscher.Rproj b/Practicals/Emilia Loscher/Emilia Loscher.Rproj
@@ -0,0 +1,13 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX