-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path13-case-study-sensordata.Rmd
361 lines (309 loc) · 18.8 KB
/
13-case-study-sensordata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
---
knit: bookdown::preview_chapter
---
# Pedestrian counting sensor data
```{r ped-pkgs, echo = FALSE}
library(tidyverse)
```
## Overview
The City of Melbourne has sensors set up in strategic locations across the inner city to keep hourly tallies of pedestrians. The data is updated on a monthly basis and available for download from [Melbourne Open Data Portal](https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-2009-to-Present-counts-/b2ak-trbp). The **rwalkr** package provides an API in R to easily access sensor counts and geographic locations. In this case study, we focus on the foot traffic of 2018 at 4 sensors.
```{r ped-data, eval=FALSE}
library(rwalkr)
library(lubridate)
library(purrr)
library(dplyr)
sensors <- c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")
start_date <- make_date(2018, 1:12)
end_date <- rollforward(start_date)
peds_2018 <- map2_dfr(start_date, end_date, melb_walk, .progress = TRUE) %>%
filter(Sensor %in% sensors) %>%
mutate(Count = as.numeric(Count))
save(peds_2018, file="data/peds_2018.rda")
```
```{r ped-locations}
library(readr)
library(lubridate)
library(dplyr)
sensors <- c("Southern Cross Station",
"Melbourne Central",
"Flinders Street Station Underpass",
"Birrarung Marr")
# Data from https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-sensor-locations/export/
ped_loc <- read_csv("data/pedestrian-counting-system-sensor-locations.csv") %>%
dplyr::filter(year(Installation_date) < 2019,
Status == "A")
```
```{r ped-sensor, eval=FALSE}
library(ggmap)
melb_bbox <- c(min(ped_loc$Longitude) - .001, min(ped_loc$Latitude) - 0.001,
max(ped_loc$Longitude) + .001, max(ped_loc$Latitude) + 0.001)
melb_map <- get_map(location = melb_bbox,
source="osm")
```
```{r ped-loc-map, echo=FALSE}
library(ggmap)
library(ggplot2)
load("data/peds_2018.rda")
ped_map <- readRDS(here::here("data/ped_map.rds"))
selected <- ped_loc %>%
dplyr::filter(Sensor_description %in% sensors)
nonselected <- ped_loc %>%
dplyr::filter(!(Sensor_description %in% sensors))
ggmap(ped_map) +
geom_point(
data = nonselected, aes(x = Longitude, y = Latitude),
colour = "#2b8cbe", alpha = 0.6, size = 3
) +
geom_point(
data = selected, aes(x = Longitude, y = Latitude, colour = Sensor_description),
size = 4, shape = 17
) +
xlab("Longitude") +
ylab("Latitude") +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
guides(col = guide_legend(nrow = 2, byrow = TRUE)) +
theme(legend.position = "bottom")
```
The map above gives a snapshot of `r nrow(ped_loc)` active sensors in 2018 with 4 sensors highlighted. The selection of sensors here covers some human activity types: commuting at Southern Cross Stations and Flinders Street Stations, a mix of shopping and commuting at Melbourne Central, and cultural events around Birrarung Marr.
## Different temporal patterns
```{r eval = FALSE}
peds_2018 %>%
ggplot(aes(x = Date_Time, y = Count)) +
geom_line(size = 0.3) +
facet_grid(Sensor ~ ., labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_x_datetime(date_labels = "%d %b %Y", date_minor_breaks = "1 month") +
xlab("Date Time")
```
We're primarily interested in exploiting pedestrian patterns at various time resolutions and across different locations. In light of people's daily schedules, we plot the counts against time of the day, shown in Figure \@ref(fig:ped-lineplots). At these train stations, two distinct clusters pop out to the viewers.
```{r ped-lineplots, fig.cap = "Counts plotted against time of the day, faceted by sensors."}
peds_2018 %>%
ggplot(aes(x = Time, y = Count, group = Date, colour = Sensor)) +
geom_line(size = 0.3, alpha = 0.3) +
facet_wrap(~ Sensor, labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
```
We further tease out work versus non-work days to explain variations arisen from the discrepancy. Except for Birrarung Marr, the most dominant pattern is driven by the workforce, with commuters' spikes at 8am and 5pm and a lunch hour rush. These spikes are completely absent on weekends and public holidays.
However, non-typical days are yet to be discovered.
```{r}
library(lubridate)
library(tsibble)
hol2018 <- tsibble::holiday_aus(2018, state = "VIC") %>%
bind_rows(tibble(holiday = "AFL", date = ymd("20180929")))
workday <- fct_inorder(c("Work day", "Non-work day"))
peds_2018 <- peds_2018 %>%
mutate(
Day = wday(Date_Time, label = TRUE, week_start = 1),
Workday = if_else(
(Date %in% hol2018$date) | Day %in% c("Sat", "Sun"),
workday[2], workday[1])
)
```
```{r}
peds_2018 %>%
ggplot(aes(x = Time, y = Count, group = Date, colour = Sensor)) +
geom_line(size = 0.3, alpha = 0.3) +
facet_grid(Sensor ~ Workday, labeller = labeller(Sensor = label_wrap_gen(20))) +
scale_colour_brewer(palette = "Dark2", name = "Sensor") +
theme(legend.position = "none")
```
To locate those unusual moments, Flinders Street Station data is calendarised on the canvas, using the **sugrrants** package. The calendar plot unfolds the day-to-day life in a fresh way. The White Night event saw the 17th night spike in February. Weeks of the data went missing in the middle of August. The underpass shut down from the midday of November 19 until the end of the following day.
```{r}
library(sugrrants)
flinders <- peds_2018 %>%
dplyr::filter(Sensor == "Flinders Street Station Underpass")
flinders_cal <- flinders %>%
frame_calendar(x = Time, y = Count, date = Date)
gg_cal <- flinders_cal %>%
ggplot(aes(x = .Time, y = .Count, colour = Workday, group = Date)) +
geom_line()
prettify(gg_cal) +
theme(legend.position = "bottom")
```
## Does the weather make a difference to the number of people walking out?
```{r eval = FALSE}
stations <- read_table(
"https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
col_names = c("ID", "lat", "lon", "elev", "state", "name",
"v1", "v2", "v3"), skip = 353, n_max = 17081)
oz <- map_data("world", xlim = range(stations$lon), ylim = range(stations$lat))
ggplot(oz, aes(x = long, y = lat)) +
geom_path(aes(group = group)) +
geom_point(data = stations,
aes(x = lon, y = lat), colour = "red", alpha = 0.5) +
coord_quickmap()
```
```{r eval = FALSE}
# melb_stns <- stations %>%
# filter(
# lon > min(ped_loc$longitude), lon < max(ped_loc$longitude),
# lat > min(ped_loc$latitude), lat < max(ped_loc$latitude))
melb_stns <- stations %>%
filter(ID == "ASN00086282") # Melbourne airport
library(vroom)
ghcn2018 <- vroom(
"https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2018.csv.gz",
col_names = FALSE, col_select = 1:4)
melb_ghcn <- ghcn2018 %>%
dplyr::filter(X1 == melb_stns$ID, X3 %in% c("PRCP", "TMAX", "TMIN")) %>%
rename_all(~ c("station", "date", "variable", "value")) %>%
mutate(date = ymd(date), value = value / 10) %>%
pivot_wider(names_from = variable, values_from = value) %>%
rename_all(tolower)
write_rds(melb_ghcn, file = "data/melb_ghcn.rds", compress = "gz")
```
Time of day and day of week are the predominant driving force of the number of pedestrian, depicted in the previous data plots. Apart from these temporal factors, the weather condition could possibly affect how many people are walking in the city. In particular, people are likely to stay indoors, when the day is too hot or too cold, or raining hard. Daily meteorological data as a separate source, available on [National Climatic Data Center](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/), is used and joined to the main pedestrian data table using common dates. Binary variables are created to serve as the tipping points and avoid multicollinearity issues for modelling later, rather than the original numerics.
```{r}
melb_ghcn <- read_rds("data/melb_ghcn.rds")
high_prcp_fct <- fct_inorder(c("none", "rain"))
high_temp_fct <- fct_inorder(c("not", "hot"))
low_temp_fct <- fct_inorder(c("not", "cold"))
melb_ghcn <- melb_ghcn %>%
mutate(
high_prcp = ifelse(prcp > 5, high_prcp_fct[2], high_prcp_fct[1]),
high_temp = ifelse(tmax > 33, high_temp_fct[2], high_temp_fct[1]),
low_temp = ifelse(tmin < 6, low_temp_fct[2], low_temp_fct[1])
)
peds_weather <- peds_2018 %>%
left_join(melb_ghcn, by = c("Date" = "date"))
```
We are going to fit a Poisson model on hourly counts, regressed on a three-way interactions between `Time`, `Workday`, and `Sensor`, and three meteorological variables. It is an appropriate choice to fit the Poisson model, since the response takes non-negative integers only. All the variables have a significant effect on counts. The coefficients of the weather variables are all negative, meaning that the higher/lower temperature and raining days tend to have lower counts than the other days.
```{r}
peds_fit <- peds_weather %>%
mutate(Time = as_factor(Time)) %>%
glm(Count ~ Time * Workday * Sensor + high_prcp + high_temp + low_temp,
data = ., family = poisson(link = "log"))
coef(peds_fit)[c("high_prcp", "high_temp", "low_temp")]
```
To examine the goodness of fit for the model, we plot the fitted values against the observed values for the Flinders Street Station, laid out on the calendar again. The pink lines indicate the fitted data. The model performs well in general, excepting for a couple of misfits due to the unusual events we spotted earlier.
```{r echo = FALSE, eval=FALSE}
peds_aug <- broom::augment(peds_fit,
data = peds_weather,
type.predict = "response")
flinders_aug <- peds_aug %>%
filter(Sensor == "Flinders Street Station Underpass")
flinders_aug_cal <- flinders_aug %>%
frame_calendar(x = Time, y = vars(Count, .fitted), date = Date)
gg_cal_aug <- flinders_aug_cal %>%
ggplot(aes(x = .Time, group = Date)) +
geom_line(aes(y = .Count)) +
geom_line(aes(y = ..fitted), colour = "hotpink")
prettify(gg_cal_aug) +
theme(legend.position = "bottom")
```
## If setting up a coffee business
Melbourne is world-renowned for its coffee culture. Location matters a lot to small business, particularly coffee shops. An ideal site should provide constant streams of potential customers all week around. To help decide which site we open the coffee shop, Flinders Street Station or Melbourne Central, we run a simulation study using the fitted model given the following assumptions:
* At Flinders Street Station, the proportion of pedestrians passing by who will buy a coffee is 0.1 between 7-10am, 0.05 between 10-4pm, 0.01 between 4-8pm. At Melbourne Central, the proportion who will buy coffee is 0.08 between 7-10am, 0.06 between 10-4pm, 0.02 between 4-8pm. No purchases at all other times.
* Each coffee purchase is \$4.
* One attendant costs \$100/hour. Two attendants is \$150/hour. Three attendants is \$200/hour and four attendants is \$250/hour.
* Each attendant can handle 30 customers per hour. If the number is more than the number the staff can handle, customers will walk out without purchasing.
The `compute_profit()` calculates the profit based on one set of simulated data given the date-times and the number of attendants for two locations.
```{r}
compute_profit <- function(date, time, flinders = 1, central = 1) {
peds_sim <- peds_weather %>%
filter(!is.na(Count)) %>%
mutate(.simulated = simulate(peds_fit, seed = 2020)$sim_1)
fl_rate <- c(rep(0, 6), rep(0.1, 3), rep(0.05, 6), rep(0.01, 4), rep(0, 5))[time]
mc_rate <- c(rep(0, 6), rep(0.08, 3), rep(0.06, 6), rep(0.02, 4), rep(0, 5))[time]
peds_sim <- peds_sim %>%
filter(Date == date, Time %in% time)
fl_count <- peds_sim %>%
filter(Sensor == "Flinders Street Station Underpass") %>%
pull(.simulated)
mc_count <- peds_sim %>%
filter(Sensor == "Melbourne Central") %>%
pull(.simulated)
fl_ncustomers <- pmin(round(fl_count * fl_rate, 0), 30 * flinders)
mc_ncustomers <- pmin(round(mc_count * mc_rate, 0), 30 * central)
fl_profit <- fl_ncustomers * 4 - (100 + 50 * (flinders - 1))
mc_profit <- mc_ncustomers * 4 - (100 + 50 * (central - 1))
tibble(time = time, flinders = fl_profit, central = mc_profit)
}
```
With three attendants at Flinders and two attendants at Melbourne Central, earnings and losses are reported for a Thursday on 2018-03-28.
```{r}
compute_profit("2018-03-28", 7:19, flinders = 3, central = 2)
```
## Your turn
Modify the `compute_profit()` to take account of time-variant attendant numbers and generate 10 simulated sets for a possible range of profits.
<!--
We can investigate the working force pattern a bit more closer by location. Rescaling the counts for each location reveals the difference in patterns much more clearly and shows the three workforce spikes much more pronounced (cf Figure \@ref(fig:ped_adjcounts)).
```{r peds-adjcounts, fig.cap="Adjusted counts of pedestrians reveals the three workforce spikes on weekdays more clearly.", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE, eval=FALSE}
weekhourlies <- peds %>% group_by(Sensor, Weekday, Time) %>%
summarize(Count = median(Count, na.rm=TRUE))
weekhourlies <- weekhourlies %>% group_by(Sensor) %>%
mutate(adjCount = scale(Count))
ggplot(aes(x=Time, y=adjCount), data=weekhourlies) + facet_grid(.~Weekday) + geom_line(aes(group=Sensor))
```
Figure \@ref(fig:loc-cluster) gives an overview of the hourly pattern observed at each location. Locations are grouped by their pattern, resulting into three groups, that can be described mostly by their pedestrian counts at 8am, noon, and 5pm. One ofthe groups shows a strong morning and afternoon peak, with only a slight increase during lunch. The other two groups are not nearly as much affected by the pedestrian rush hours. One group shows almost the same pattern on weekdays as on weekends (with tiny spikes added on weekdays), while pedestrian traffic for the last groups is generally higher for the last group on weekdays than on weekends and increases during the day until peaking at 5pm.
```{r loc-cluster, fig.cap="Location of sensors clustered by observed patterns of pedestrian counts.", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE, eval=FALSE}
hourlies <- weekhourlies %>% group_by(Time, Sensor) %>% summarize(
adjCount = mean(adjCount, na.rm=TRUE)
)
adjcounts <- hourlies %>% spread(Time, adjCount)
dists <- dist(adjcounts[,-1])
pedclust <- hclust(dists)
adjcounts$Group <- cutree(pedclust, k=3)
weekhourlies <- merge(weekhourlies[1:nrow(weekhourlies),], adjcounts[, c("Sensor", "Group")], by="Sensor", all.x=TRUE)
ggplot(aes(x=Time, y=Count), data=weekhourlies) + facet_grid(Group~Weekday) + geom_line(aes(group=Sensor, colour=factor(Group))) + scale_colour_brewer(palette="Dark2")
```
The [geographic location of the sensors](https://data.melbourne.vic.gov.au/Transport-Movement/Pedestrian-Sensor-Locations/ygaw-6rzq) is made available through the Melbourne Data initiative. A copy of the data is available locally. What we would like to do with this data, is to plot the sensors on a map of Melbourne, coloured by the grouping that we just identified to get an idea of whether the groupings have a geographical interpretation as well.
```{r eval=FALSE}
sensors <- read.csv("data/Pedestrian_Sensor_Locations.csv")
```
Unfortunately, we cannot match the names of the locations directly, because they are formatted (slightly) differently between the two sources. The sensor location data set e.g. contains the string Lygon St (West), whereas the pedestrian count data contains the same location encoded as Lygon.St..West. (note the . introduced by R as a substitute for any special character such as a white space in a variable name).
In order to match these locations, we make use of fuzzy matching as implemented in `adist`, which is based on the generalized Levenshtein distance:
```{r eval=FALSE}
src1 <- as.character(sensors$Sensor.Description)
src2 <- as.character(unique(weekhourlies$Location))
dist.name<-adist(src1, src2, partial = TRUE, ignore.case = TRUE)
dim(dist.name)
```
This distance is an integer value of essentially the number of differences between two character strings. We will pick the minimum for each of the pairs to match the locations strings.
```{r eval=FALSE}
mins <- apply(dist.name, MARGIN=1, FUN=which.min)
sensors$Location <- unique(weekhourlies$Location)[mins]
```
We should also investigate the actual distance values to make sure that we did not accidentally match things that we should not have matched:
```{r eval=FALSE}
sensors$MatchQuality <- apply(dist.name, MARGIN=1, FUN=min)
summary(sensors$MatchQuality)
sensors <- sensors[order(sensors$MatchQuality),]
tail(sensors)[,c("Location", "Sensor.Description")]
```
These matches all look good except for two: `Lonsdale St-Spring St (West)` and `Fitzroy Gardens Visitor Centre` should probably not be matched at all (these two sensors are not actually included in the pedestrian count data at this time). We will set those two locations to NA, and then match via Location to include the grouping information:
```{r eval=FALSE}
sensors$Location[sensors$Sensor.Description %in% c("Lonsdale St-Spring St (West)","Fitzroy Gardens Visitor Centre")] <- NA
sensors <- merge(sensors, weekhourlies[,c("Location", "Group")], by="Location")
```
Now we want to put this information on a map:
```{r ped-map, fig.cap="Map of inner Melbourne. Locations of sensors are indicated by dots, colour indicates their group.", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE, eval=FALSE}
library(ggmap)
library(ggthemes)
melb <- get_map("Melbourne, Australia", zoom=14) # we need to set the zoom - the auto-zoom includes too much.
ggmap(melb, extent="normal") +
geom_point(aes(x=Longitude, y=Latitude, colour=factor(Group)), data=sensors, size=3) + theme_map() +
scale_colour_brewer(palette="Dark2")
```
***Does the grouping shown in Figure \@ref(fig:ped-map) make sense to somebody who knows Melbourne?***
Coming back to the general pattern of the graphic we started out with in Figure \@ref(fig:peds-counts), we see that
generally, things are quiet at 5 am, particularly on weekdays. There are, however, some notable exceptions with pedestrian counts of more than 1000 between 5 and 6 in the morning:
```{r eval=FALSE}
subset(mpeds, Hour==5 & Count > 1000)
```
In the current data set (Dec 2015 - Feb 2016) these counts occurred in eight locations on February 21, 2016, which is when Melbourne hosted its annual White Night in 2016.
Besides New Year's morning on the corner of Flinders and Swanston Street, City Square was the place to be on December 28 and 29 at five in the morning. *** not sure what was going on on those two dates ***
```{r eval=FALSE}
days <- unique(subset(mpeds, Hour==5 & Count > 1000)[, c("Date", "Location")])
# I want to get the remaining hours for each of the locations. I also want to get the Feb 20 data.
```
```{r eval=FALSE}
ggplot(aes(x=Hour, y=Count, group=interaction(Location, Day)),
data=mpeds) +
facet_grid(.~Weekday) +
geom_line(alpha=0.3)
```
-->