-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBoston.Rmd
316 lines (229 loc) · 10.7 KB
/
Boston.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Boston Housing"
author: "Tristan-Mihai Radulescu, Hugo Ponthieu, Benoit Planche"
output:
html_document:
df_print: paged
---
# **Introduction**
This report analyzes the `BostonHousing` dataset, which contains information about housing in Boston. We will explore the relationships between several variables and the median home value (`medv`).
---
```{r setup, include=FALSE}
# Load necessary libraries
library(ggplot2)
library(knitr)
library(tidyverse)
library(broom)
library(GGally)
library(MASS)
library(randomForest)
library(rsample)
library(dplyr)
library(ipred)
library(caret)
library(rpart)
library(rpart.plot)
knitr::opts_chunk$set(echo = TRUE)
```
```{r, echo=FALSE}
# Load the dataset
BostonHousing <- read.csv("./BostonHousing.csv")
# Display first rows
head(BostonHousing)
# Structure of the dataset
# str(BostonHousing)
```
## **1. Distribution of Key Variables**
We analyze the distribution of home prices (`medv`).
```{r valeur-mediane, echo=FALSE}
ggplot(BostonHousing, aes(x = medv)) +
geom_histogram(fill = "blue", bins = 30, alpha = 0.7) +
labs(title = "Distribution of Home Prices",
x = "MEDV (in $1000s)",
y = "Frequency")
```
---
## **2. Correlations Between Variables**
We compute the correlation matrix and display values associated with `medv`.
```{r, echo=FALSE}
# Compute correlation matrix
cor_matrix <- cor(BostonHousing)
# Display correlations with MEDV
cor_matrix["medv", ]
```
**Observations**:
- **Positive correlation** with `rm` (number of rooms).
- **Negative correlation** with `lstat` (percentage of lower-status population).
- **`tax` and `crim` have a negative relationship** with `medv`.
### **2.1 Pairs**
To get a first grasp of the data, and the different relationships between variables, we can use the `pairs` function.
```{r pairs, echo=FALSE}
pairs(BostonHousing)
```
Even though there are a lot of modalities, we can see that there is a relationship between `rm` and `medv`, a negative relationship between `lstat` and `medv` and finally `crim` and `medv`. So we will try to model linear regressions between each of these variables.
---
### **2.2 Key Relationships**
```{r nb-piece, echo=FALSE}
# Relationship between RM (number of rooms) and MEDV
ggplot(BostonHousing, aes(x = rm, y = medv)) +
geom_point(alpha = 0.6, color = "blue") +
geom_smooth(method = "lm", col = "red") +
labs(title = "Relationship Between RM and MEDV", x = "Average Number of Rooms", y = "Median Home Value (in $1000s)")
```
Here we can see a clear correlation between `rm` and `medv`. Even though there are a few outliers. Also, based on the description of the dataset on kaggle, some data scientists think that the `medv` variable was censored since `medv` often takes the value 50 which is also its maximum. This property makes us question the reliability of this dataset, but anyway.
```{r pop-pauvre, echo=FALSE}
# Relationship between LSTAT (% lower-status population) and MEDV
ggplot(BostonHousing, aes(x = lstat, y = medv)) +
geom_point(alpha = 0.6, color = "purple") +
geom_smooth(method = "lm", col = "red") +
labs(title = "Relationship Between LSTAT and MEDV", x = "Lower-Status Population (%)", y = "Median Home Value (in $1000s)")
```
Here we can se that the relationship between `lstat` and `medv` is negative. The hypothesis would be that a higher percentage of lower-status population will have a lower home value.
---
## **3. Impact of Crime and Taxation on Home Prices**
### **3.1 Distribution of `medv`, `crim` and `tax`**
```{r tax-crim, echo=FALSE}
ggplot(BostonHousing, aes(x = crim)) +
geom_histogram(fill = "red", bins = 30, alpha = 0.7) +
labs(title = "Distribution of Crime Rate", x = "CRIM", y = "Frequency") +
scale_x_log10()
ggplot(BostonHousing, aes(x = tax)) +
geom_histogram(fill = "blue", bins = 30, alpha = 0.7) +
labs(title = "Distribution of Tax Rate", x = "TAX", y = "Frequency")
```
**Observations**:
- **`crim` is highly skewed**: most neighborhoods have low crime rates, but some areas have extreme values.
- **`tax` shows distinct peaks**, suggesting fixed tax rates in certain areas.
---
### **3.2 Relationship Between Crime Rate and Home Prices**
```{r tax-crim2, echo=FALSE}
ggplot(BostonHousing, aes(x = crim, y = medv)) +
geom_point(alpha = 0.5, color = "red") +
geom_smooth(method = "lm", col = "black") +
scale_x_log10() +
labs(title = "Relationship Between CRIM and MEDV",
x = "Crime Rate (log)",
y = "Median Home Value (in $1000s)")
```
**Observation**: Home prices **decrease as crime rates increase**.
---
## **3.3 Relationship Between Taxation and Home Prices**
```{r tax-crim3, echo=FALSE}
ggplot(BostonHousing, aes(x = tax, y = medv)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", col = "black") +
labs(title = "Relationship Between TAX and MEDV",
x = "Tax Rate",
y = "Median Home Value (in $1000s)")
```
---
**Observation**: A **higher tax rate is associated with lower home prices**.
```{r reg-multiple, echo=FALSE}
reg_multi <- lm(medv ~ crim + tax, data = BostonHousing)
summary(reg_multi)
```
**Results**:
- **The coefficients for `crim` and `tax` are negative**, confirming their negative impact on `medv`.
- **p-values < 0.05** indicate these effects are statistically significant.
---
## 4. Linear regressions
### 4.1 Backward selection
To start, we can try to use the backward selection algorithm to find the best linear regression model to predict `medv`. But first, let's try to add all the variables to the model to see what happens.
```{r, echo=FALSE}
model <- lm(medv ~ ., data = BostonHousing)
summary(model)
```
Obviously, some modalities are not significant (e.g. `indus` or `age`). So let's try to use the backward selection algorithm to find the best model.
```{r, echo=FALSE}
backward_model <- step(model, direction = "backward", trace = FALSE)
summary(backward_model)
plot(backward_model)
```
We can see that the model gives us a low `p-value` of 2.2e-16 with a residual standard error of 4.736. To be honest, it is quite hard for us to interpret this indicator since the observed values are not in the same range. Later we will try to normalize our variables to interpret it easily.
Though, after plotting the model, we can see on the Q-Q plot that the residuals are not normally distributed which means that the linear regression model is not appropriate for our data. Still, we will try the forward selection algorithm to show case our skills 🕶️.
### 4.2 Multiple linear regression with normalized data and backward selection
```{r, echo=FALSE}
normalized_data <- preProcess(BostonHousing, method = c("center", "scale"))
normalized_data <- predict(normalized_data, BostonHousing)
```
#### 4.2.1 Boxplots
```{r, echo=FALSE}
boxplot(BostonHousing)
boxplot(normalized_data)
```
The goal of these boxplots is to show how the data is distributed. We can see on the boxplot below that the every modalities have the average of 0 and the same range.
#### 4.2.2 Modelization
```{r, echo=FALSE}
model <- lm(medv ~ ., data = normalized_data)
summary(model)
backward_model <- step(model, direction = "backward", trace = FALSE)
summary(backward_model)
plot(backward_model)
```
On our normalized data, we can see that our model didn't change much. The only output that changed is the residual standard error which is 0.515. This value is quite high which confirm us that this model is not the best for our use case.
---
### 4.3. Forward selection
#### 4.3.1. Sampling data
```{r, echo=FALSE}
# Load necessary libraries
set.seed(123)
sample_size <- floor(0.8 * nrow(BostonHousing))
train_indices <- sample(seq_len(nrow(BostonHousing)), size = sample_size)
train_data <- BostonHousing[train_indices, ]
test_data <- BostonHousing[-train_indices, ]
initial_model <- lm(medv ~ 1, data = train_data)
```
This time we will use the `stepAIC` function to apply the forward selection algorithm. We will use the initial model as the lower bound and the model with all the variables as the upper bound. Before hand we used a different library.
#### 4.3.2. Applying the algorithm
```{r, echo=FALSE}
forward_model <- stepAIC(initial_model, direction = "forward", scope = list(lower=initial_model, upper=~crim+ zn+ indus+chas +nox+rm+ age +dis+rad+tax+ptratio+b+lstat ))
```
```{r, echo=FALSE}
plot(forward_model, which=2)
```
On this Q-Q plot, we can see that the residuals are not normally distributed. This means that the model provided by the forward selection algorithm is not appropriate for our data...
```{r, echo=FALSE}
summary(forward_model)
test_predictions <- predict(forward_model, newdata = test_data)
actual_vs_predicted <- data.frame(Actual = test_data$medv, Predicted = test_predictions)
residuals <- data.frame(Predicted = test_predictions, Residuals = test_data$medv - test_predictions)
ggplot(residuals, aes(x = Predicted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residual Plot", x = "Predicted Values", y = "Residuals") +
theme_minimal() +
# Ensure the full range of predicted values is shown
coord_cartesian(xlim = range(test_predictions))
```
#### 4.3.3. Correlation Matrix
```{r, echo=FALSE}
correlation_matrix <- cor(train_data)
ggplot(reshape2::melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
coord_fixed()
```
## 5. Random forest model
```{r, echo=FALSE}
set.seed(123)
rf_model <- randomForest(medv ~ ., data = train_data, importance = TRUE)
rf_predictions <- predict(rf_model, newdata = test_data)
# Evaluate the model
actual_vs_predicted_rf <- data.frame(Actual = test_data$medv, Predicted = rf_predictions)
# Plot actual vs predicted values
ggplot(actual_vs_predicted_rf, aes(x = Actual, y = Predicted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Actual vs Predicted Values (Random Forest)", x = "Actual Values", y = "Predicted Values") +
theme_minimal()
```
As we can see, the model is not very good. We can try to improve it by using a tree. Intuitively, we can see that our data is segmented in three groups. One group is the one with the lowest medv, one group is the one with the highest medv and the third group is in between. But let's use trees to segment these groups correctly.
### 5.1. Decision Tree
```{r, echo=FALSE}
# Plot a tree using rpart
tree_model <- rpart(medv ~ ., data = train_data)
rpart.plot(tree_model, cex = 0.8)
```
We can see that the two variables `lstat` and `rm` are significant as we have deduced it intuitively before.