Boston.Rmd

---
title: "Boston Housing"
author: "Tristan-Mihai Radulescu, Hugo Ponthieu, Benoit Planche"
output:
  html_document:
    df_print: paged
---

# **Introduction**
This report analyzes the `BostonHousing` dataset, which contains information about housing in Boston. We will explore the relationships between several variables and the median home value (`medv`).  

---

```{r setup, include=FALSE}
# Load necessary libraries
library(ggplot2)
library(knitr)
library(tidyverse)
library(broom)
library(GGally)
library(MASS)
library(randomForest)
library(rsample)
library(dplyr)
library(ipred)
library(caret)
library(rpart)
library(rpart.plot)
knitr::opts_chunk$set(echo = TRUE)
```

```{r, echo=FALSE}
# Load the dataset
BostonHousing <- read.csv("./BostonHousing.csv")

# Display first rows
head(BostonHousing)

# Structure of the dataset
# str(BostonHousing)
```


## **1. Distribution of Key Variables**
We analyze the distribution of home prices (`medv`).  

```{r valeur-mediane, echo=FALSE}
ggplot(BostonHousing, aes(x = medv)) + 
  geom_histogram(fill = "blue", bins = 30, alpha = 0.7) +
  labs(title = "Distribution of Home Prices",
       x = "MEDV (in $1000s)", 
       y = "Frequency")
```

---

## **2. Correlations Between Variables**
We compute the correlation matrix and display values associated with `medv`.  

```{r, echo=FALSE}
# Compute correlation matrix
cor_matrix <- cor(BostonHousing)

# Display correlations with MEDV
cor_matrix["medv", ]
```

**Observations**:  
- **Positive correlation** with `rm` (number of rooms).  
- **Negative correlation** with `lstat` (percentage of lower-status population).  
- **`tax` and `crim` have a negative relationship** with `medv`.  


### **2.1 Pairs**

To get a first grasp of the data, and the different relationships between variables, we can use the `pairs` function.

```{r pairs, echo=FALSE}
pairs(BostonHousing)
```

Even though there are a lot of modalities, we can see that there is a relationship between `rm` and `medv`, a negative relationship between `lstat` and `medv` and finally `crim` and `medv`. So we will try to model linear regressions between each of these variables.

---

### **2.2 Key Relationships**

```{r nb-piece, echo=FALSE}
# Relationship between RM (number of rooms) and MEDV
ggplot(BostonHousing, aes(x = rm, y = medv)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Relationship Between RM and MEDV", x = "Average Number of Rooms", y = "Median Home Value (in $1000s)")
```

Here we can see a clear correlation between `rm` and `medv`. Even though there are a few outliers. Also, based on the description of the dataset on kaggle, some data scientists think that the `medv` variable was censored since `medv` often takes the value 50 which is also its maximum. This property makes us question the reliability of this dataset, but anyway.

```{r pop-pauvre, echo=FALSE}
# Relationship between LSTAT (% lower-status population) and MEDV
ggplot(BostonHousing, aes(x = lstat, y = medv)) +
  geom_point(alpha = 0.6, color = "purple") +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Relationship Between LSTAT and MEDV", x = "Lower-Status Population (%)", y = "Median Home Value (in $1000s)")
```

Here we can se that the relationship between `lstat` and `medv` is negative. The hypothesis would be that a higher percentage of lower-status population will have a lower home value.

---

## **3. Impact of Crime and Taxation on Home Prices**
### **3.1 Distribution of `medv`, `crim` and `tax`**
```{r tax-crim, echo=FALSE}
ggplot(BostonHousing, aes(x = crim)) +
  geom_histogram(fill = "red", bins = 30, alpha = 0.7) +
  labs(title = "Distribution of Crime Rate", x = "CRIM", y = "Frequency") +
  scale_x_log10()

ggplot(BostonHousing, aes(x = tax)) +
  geom_histogram(fill = "blue", bins = 30, alpha = 0.7) +
  labs(title = "Distribution of Tax Rate", x = "TAX", y = "Frequency")
```

**Observations**:  
- **`crim` is highly skewed**: most neighborhoods have low crime rates, but some areas have extreme values.  
- **`tax` shows distinct peaks**, suggesting fixed tax rates in certain areas.

---

### **3.2 Relationship Between Crime Rate and Home Prices**
```{r tax-crim2, echo=FALSE}
ggplot(BostonHousing, aes(x = crim, y = medv)) +
  geom_point(alpha = 0.5, color = "red") +
  geom_smooth(method = "lm", col = "black") +
  scale_x_log10() +
  labs(title = "Relationship Between CRIM and MEDV",
       x = "Crime Rate (log)", 
       y = "Median Home Value (in $1000s)")
```

**Observation**: Home prices **decrease as crime rates increase**.  

---

## **3.3 Relationship Between Taxation and Home Prices**
```{r tax-crim3, echo=FALSE}
ggplot(BostonHousing, aes(x = tax, y = medv)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", col = "black") +
  labs(title = "Relationship Between TAX and MEDV",
       x = "Tax Rate", 
       y = "Median Home Value (in $1000s)")
```

---

**Observation**: A **higher tax rate is associated with lower home prices**.  

```{r reg-multiple, echo=FALSE}
reg_multi <- lm(medv ~ crim + tax, data = BostonHousing)
summary(reg_multi)
```

**Results**:  
- **The coefficients for `crim` and `tax` are negative**, confirming their negative impact on `medv`.  
- **p-values < 0.05** indicate these effects are statistically significant.  

---

## 4. Linear regressions

### 4.1 Backward selection

To start, we can try to use the backward selection algorithm to find the best linear regression model to predict `medv`. But first, let's try to add all the variables to the model to see what happens.

```{r, echo=FALSE}
model <- lm(medv ~ ., data = BostonHousing)

summary(model)

```

Obviously, some modalities are not significant (e.g. `indus` or `age`). So let's try to use the backward selection algorithm to find the best model.

```{r, echo=FALSE}
backward_model <- step(model, direction = "backward", trace = FALSE)

summary(backward_model)


plot(backward_model)
```

We can see that the model gives us a low `p-value` of 2.2e-16 with a residual standard error of 4.736. To be honest, it is quite hard for us to interpret this indicator since the observed values are not in the same range. Later we will try to normalize our variables to interpret it easily.
Though, after plotting the model, we can see on the Q-Q plot that the residuals are not normally distributed which means that the linear regression model is not appropriate for our data. Still, we will try the forward selection algorithm to show case our skills 🕶️.

### 4.2 Multiple linear regression with normalized data and backward selection

```{r, echo=FALSE}
normalized_data <- preProcess(BostonHousing, method = c("center", "scale"))

normalized_data <- predict(normalized_data, BostonHousing)
```


#### 4.2.1 Boxplots

```{r, echo=FALSE}
boxplot(BostonHousing)
boxplot(normalized_data)
```

The goal of these boxplots is to show how the data is distributed. We can see on the boxplot below that the every modalities have the average of 0 and the same range.

#### 4.2.2 Modelization

```{r, echo=FALSE}
model <- lm(medv ~ ., data = normalized_data)
summary(model)
backward_model <- step(model, direction = "backward", trace = FALSE)
summary(backward_model)
plot(backward_model)
```

On our normalized data, we can see that our model didn't change much. The only output that changed is the residual standard error which is 0.515. This value is quite high which confirm us that this model is not the best for our use case.

---

### 4.3. Forward selection
#### 4.3.1. Sampling data

```{r, echo=FALSE}
# Load necessary libraries
set.seed(123)
sample_size <- floor(0.8 * nrow(BostonHousing))
train_indices <- sample(seq_len(nrow(BostonHousing)), size = sample_size)
train_data <- BostonHousing[train_indices, ]
test_data <- BostonHousing[-train_indices, ]

initial_model <- lm(medv ~ 1, data = train_data)
```

This time we will use the `stepAIC` function to apply the forward selection algorithm. We will use the initial model as the lower bound and the model with all the variables as the upper bound. Before hand we used a different library.

#### 4.3.2. Applying the algorithm

```{r, echo=FALSE}
forward_model <- stepAIC(initial_model, direction = "forward", scope = list(lower=initial_model, upper=~crim+ zn+ indus+chas +nox+rm+ age +dis+rad+tax+ptratio+b+lstat ))
```

```{r, echo=FALSE}
plot(forward_model, which=2)
```

On this Q-Q plot, we can see that the residuals are not normally distributed. This means that the model provided by the forward selection algorithm is not appropriate for our data...

```{r, echo=FALSE}
summary(forward_model)

test_predictions <- predict(forward_model, newdata = test_data)

actual_vs_predicted <- data.frame(Actual = test_data$medv, Predicted = test_predictions)

residuals <- data.frame(Predicted = test_predictions, Residuals = test_data$medv - test_predictions)

ggplot(residuals, aes(x = Predicted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot", x = "Predicted Values", y = "Residuals") +
  theme_minimal() +
  # Ensure the full range of predicted values is shown
  coord_cartesian(xlim = range(test_predictions))
```


#### 4.3.3. Correlation Matrix
```{r, echo=FALSE}
correlation_matrix <- cor(train_data)
ggplot(reshape2::melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()
```

## 5. Random forest model
```{r, echo=FALSE}
set.seed(123)
rf_model <- randomForest(medv ~ ., data = train_data, importance = TRUE)
rf_predictions <- predict(rf_model, newdata = test_data)
# Evaluate the model
actual_vs_predicted_rf <- data.frame(Actual = test_data$medv, Predicted = rf_predictions)

# Plot actual vs predicted values
ggplot(actual_vs_predicted_rf, aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Values (Random Forest)", x = "Actual Values", y = "Predicted Values") +
  theme_minimal()
```

As we can see, the model is not very good. We can try to improve it by using a tree. Intuitively, we can see that our data is segmented in three groups. One group is the one with the lowest medv, one group is the one with the highest medv and the third group is in between. But let's use trees to segment these groups correctly.

### 5.1. Decision Tree


```{r, echo=FALSE}
# Plot a tree using rpart
tree_model <- rpart(medv ~ ., data = train_data)

rpart.plot(tree_model, cex = 0.8)
```

We can see that the two variables `lstat` and `rm` are significant as we have deduced it intuitively before.