Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: "Monte Carlo Simulation"
author: "Emilia Löscher"
date: '2022-09-14'
output:
pdf_document: default
html_document: default
---

# 8
##Monte Carlo simulation exercise

Perform a small simulation that does the following:
a. Sample 100 samples from a standard normal distribution.

```{r}
library(ggplot2)
library(dplyr)
set.seed(1409)
N <- 100 #Number of samples
n <- 1000 #Size of each sample
samples <- matrix(rnorm(n*N), nrow = n, ncol = N, byrow = FALSE)
#samples in column of size 1000 (n)
```

b. For each of these samples, calculate the following statistics for the mean:
- absolute bias
- standard error
- lower bound of the 95% confidence interval
- upper bound of the 95% confidence interval

```{r}
sample_mean <- abs_bias <- std_error <- low_CI <- up_CI <- length(N)
for(i in 1:ncol(samples)){
sample_mean[i] <- mean(samples[,i])
abs_bias[i] <- abs(sample_mean[i] - 0)
std_error[i] <- sd(samples[,i])/sqrt(n)
low_CI[i] <- sample_mean[i] - 1.96 * std_error[i]
up_CI[i] <- sample_mean[i] + 1.96 * std_error[i]
}

mcs_list <- list(sample_mean, abs_bias, std_error, low_CI, up_CI)
```


c. Create a plot that demonstrates the following:
“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)
```{r}
data <- data.frame(Sample = 1:N, mean= sample_mean, lower = low_CI, upper = up_CI)

ggplot(data, aes(mean, Sample)) + # ggplot2 plot with confidence intervals
geom_point() +
geom_errorbar(aes(xmin = lower, xmax = upper))+
geom_vline(xintercept = 0,color = "red", size = 0.5)

```

d. Present a table containing all simulated samples for which the resulting confidence interval does not contain the population value.
```{r}
data %>% filter(lower > 0 | upper < 0)
```

5 samples as expected when using a 95\% confidence interval.


Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Personal Information

I am currently learning Dutch,
love travelling by train to places where you can go for long hikes,
and I have been singing in different choirs since I was five years old.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Practicals/Emilia Loscher/Assignment 5.zip
Binary file not shown.
Binary file added Practicals/Emilia Loscher/Beamer_Emilia.pdf
Binary file not shown.
91 changes: 91 additions & 0 deletions Practicals/Emilia Loscher/Beamer_Emilia.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
\documentclass[aspectratio=169]{beamer}

\usecolortheme{beaver}
\usepackage{geometry}
\usepackage{amsmath}
\setbeamertemplate{navigation symbols}{}

\begin{document}

\title{Example document to recreate with \texttt{beamer} in \LaTeX}
\author{Emilia Löscher}
\date[My data]{\vspace{1cm} \\\large FALL 2022\\ \large Markup Languages and Reproducible Programming in Statistics}

%title page

\maketitle


%outline
\begin{frame}{Outline}
Working with equations \\
\begin{itemize}
\item[] Aligning the same equations\\
\item[] Omit equation numbering \\
\item[] Ugly alignment\\
\end{itemize}

\vspace{1 cm}
Discussion
\end{frame}

\begin{frame}{Working with equations}
We define a set of equations as
\begin{gather}
a = b + c^2,\\
a - c^2 = b,\\
\text{left side = right side},\\
\text{left side + something} \geq \text{right side},
\end{gather}


for all something $>$ 0.
\end{frame}

\begin{frame}{Aligning the same equations}
Aligning the equations by the equal sign gives a much better view into the placements
of the separate equation components.
\begin{align}
a &= b + c^2,\\
a - c^2 &= b,\\
\text{left side} &= \text{right side},\\
\text{left side + something} &\geq \text{right side},
\end{align}

\end{frame}


\begin{frame}{Omit equation numbering}
Alternatively, the equation numbering can be omitted.
\begin{align*}
a &= b + c^2\\
a - c^2 &= b\\
\text{left side} &= \text{right side}\\
\text{left side + something} &\geq \text{right side}
\end{align*}

\end{frame}

\begin{frame}{Ugly alignment}
Some components do not look well, when aligned. Especially equations with different
heights and spacing. For example,
\begin{align}
E &= mc^2,\\
m &= \frac{E}{c^2}\\
c &= \sqrt{\frac{E}{m}}
\end{align}
Take that into account.
\end{frame}

\begin{frame}{Discussion}
This is where you’d normally give your audience a recap of your talk, where you could
discuss e.g. the following
\begin{itemize}
\item Your main findings
\item The consequences of your main findings
\item Things to do
\item Any other business not currently investigated, but related to your talk
\end{itemize}

\end{frame}
\end{document}
183 changes: 183 additions & 0 deletions Practicals/Emilia Loscher/E2_EmiliaLoscher.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
---
title: "Exercise 2"
author: "Emilia Löscher"
date: "2022-09-28"
output:
pdf_document: default
html_document: default
---

# Aim
The aim is to compare different approaches for classification based on the MSE with the help of cross validation. The models that we are going to compare are the following:
- logistic regression (lr)
- linear discriminant analysis (lda)
- random forest (rf)

# Set up
We are using the cardiovascular disease dataset of 253 patients (from the SLV course). The data set is split into a five parts, so that while performing cross validation, the training data set always consists of 80\% and a test data set consists of 20\% of the original data set. As this splitting of the data set is random, it requires using RNG.

Using cross validation, the three models are fit to the respective training data set, always leaving out one fifth in turn. Then predictions are made for the 20\% of the data which make up the test data set.
By comparing these predictions to the "true" outcomes, the MSE is obtained.

Hence, we obtain a 3x5-matrix containing the results. By taking the mean across the cross validation for each model, we can compare which model performs best by identifying the model with the lowest mean MSE.

# Setting the seed

```{r}
set.seed(2010)
```

# Packages required and data
```{r}
library(ggplot2)
library(randomForest)
library(MASS)
library(tidyverse)
library(utils)


cardio <- read.csv("cardiovascular_treatment.csv") %>%
mutate(severity = as.factor(severity),
gender = as.factor(gender),
dose = as.factor(dose),
response = as.factor(response))
```

# Code for the cross validation and MSE

```{r}
#MSE function
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)

myCV <- function(k = 5, data = cardio){
mse_mat <- matrix(NA, ncol = k, nrow = 4)

split <- c(rep(1:k, floor(nrow(data)/k)), 1:(nrow(data)%%k))
split_shuff <- sample(split, length(split) )
#adding column to randomly split the data set
data$split <- split_shuff
for(i in 1:k){
#splitting the data set for each k into train and test
data_train <- data[which(data$split != i),]
data_test <- data[which(data$split == i),]
#fitting the different models
#logistic regression
lr <- glm(response~ ., data = data_train, family= "binomial")

#lda
lda <- lda(response~. , data = data_train)

#random forest
rf <- randomForest(response ~., data = data_train)

model_list <- list(lr, lda, rf)
pred_list <- list()
pred_list[[1]] <- ifelse(predict(lr, newdata = data_test) < 0.5, 0,1)
pred_list[[2]] <- rbernoulli(nrow(data_test),p= predict(lr, newdata = data_test))
pred_list[[3]] <- predict(lda, newdata = data_test)$class
pred_list[[4]] <- predict(rf, newdata = data_test)
for(j in 1:4){
mse_mat[j,i] <- mse(as.numeric(data_test$response), as.numeric(pred_list[[j]]))
}
}
return(mse_mat)
}
mse_res <- myCV()
rownames(mse_res) <- c("lr_cut", "lr_bern", "lda", "rf")

rowMeans(mse_res)
```

# Presentation, visualization, and discussion of results
```{r}
results <- data.frame("names"= rownames(mse_res), "mse" = rowMeans(mse_res))
ggplot(results, aes(x = names, y = mse, fill= names))+
geom_bar(stat ="identity")+
labs(
x = "Method",
y = "Mean squared error",
title = "Comparing regression method prediction performance")
```
It can be seen that regarding the MSE, the linear discriminant analysis performs best (MSE = 0.3756). The Random Forest method has a similar performance with a MSE of 0.3871. The logistic regression methods perform worst. It does not make much of a difference if the classification is done according to a cut-off value of 0.5 or using a bernoulli distribution with the obtained probabilities. The MSEs are 1.8227 (cut-off) and 1.8388 (bernoulli).

# Replication of the analysis
When we use the same code as before with another seed to see if we obtain the same results. They will be slightly different, but the order of performance of the different methods should be the same.

```{r}
set.seed(2110)

library(ggplot2)
library(randomForest)
library(MASS)
library(tidyverse)
library(utils)


cardio <- read.csv("cardiovascular_treatment.csv") %>%
mutate(severity = as.factor(severity),
gender = as.factor(gender),
dose = as.factor(dose),
response = as.factor(response))

#MSE function
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)

myCV <- function(k = 5, data = cardio){
mse_mat <- matrix(NA, ncol = k, nrow = 4)

split <- c(rep(1:k, floor(nrow(data)/k)), 1:(nrow(data)%%k))
split_shuff <- sample(split, length(split) )
#adding column to randomly split the data set
data$split <- split_shuff
for(i in 1:k){
#splitting the data set for each k into train and test
data_train <- data[which(data$split != i),]
data_test <- data[which(data$split == i),]
#fitting the different models
#logistic regression
lr <- glm(response~ ., data = data_train, family= "binomial")

#lda
lda <- lda(response~. , data = data_train)

#random forest
rf <- randomForest(response ~., data = data_train)

model_list <- list(lr, lda, rf)
pred_list <- list()
pred_list[[1]] <- ifelse(predict(lr, newdata = data_test) < 0.5, 0,1)
pred_list[[2]] <- rbernoulli(nrow(data_test),p= predict(lr, newdata = data_test))
pred_list[[3]] <- predict(lda, newdata = data_test)$class
pred_list[[4]] <- predict(rf, newdata = data_test)
for(j in 1:4){
mse_mat[j,i] <- mse(as.numeric(data_test$response), as.numeric(pred_list[[j]]))
}
}
return(mse_mat)
}
mse_res <- myCV()
rownames(mse_res) <- c("lr_cut", "lr_bern", "lda", "rf")

print(rowMeans(mse_res))

results <- data.frame("names"= rownames(mse_res), "mse" = rowMeans(mse_res))
ggplot(results, aes(x = names, y = mse, fill= names))+
geom_bar(stat ="identity")+
labs(
x = "Method",
y = "Mean squared error",
title = "Comparing regression method prediction performance")
```

We see that the ranking of the methods is similar and the values of the MSEs vary a bit:
Before: 1.8226667(lr_cut) 1.8388235 (lr_bern) 0.3756078 (lda) 0.3871373 (rf)
Now: 1.6040784 (lr_cut) 1.5529412 (lr_bern) 0.3597647 (lda) 0.4466667 (rf)

There is now a larger difference between the linear discriminant analysis and the random forest method. The LDA still performs best with respect to the MSE. Now, the cut off logistic regression performs worse than the bernoulli logistic regression.


# Session info
```{r}
sessionInfo()
```

Binary file added Practicals/Emilia Loscher/E2_EmiliaLoscher.pdf
Binary file not shown.
13 changes: 13 additions & 0 deletions Practicals/Emilia Loscher/Emilia Loscher.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX
Loading