-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path08-case-study-accidents.Rmd
403 lines (312 loc) · 24.2 KB
/
08-case-study-accidents.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
---
knit: bookdown::preview_chapter
---
# Accidents
The US Department of transportation is keeping a detailed track of every accident that results in a fatality. This data are part of FARS, the [Fatality Analysis Reporting System](http://www.nhtsa.gov/FARS). Data is available in annual releases going back to 1975.
Accident forms change over time, and the corresponding databases with them - e.g. cell phones as a potential contributor to an accident were not on the list back in the 80s. This often makes an analysis of trends over time a bit tricky.
```{r map-2014, echo=FALSE, fig.cap='Overview of where fatal accidents occurred in 2014', out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
library(ggplot2)
library(foreign)
accidents <- read.dbf("data/FARS2014-DBF/accident.dbf")
subset(accidents, dplyr::between(LONGITUD, -130, 0)) %>%
ggplot() +
geom_point(aes(x = LONGITUD, y = LATITUDE), size = .5, alpha = .5)
```
Here, we first focus on a single year's worth of data. The data is released in different formats. Besides SAS, DBF seems to be a staple at the DOT. DBF stands for 'data base format', and in R we can read files in this format using the function ```read.dbf```, which is part of the package ```foreign``` [@R-foreign]. Files in SAS format can be read with package ```sas7bdat``` [@R-sas7bdat]. Generally, files in SAS format are a bit smaller than the ones in DBF, but the ```read.dbf``` function is much faster than the ```read.sas7bdat``` function. However, the SAS files come with a file called ```Format.sas```, which contains a translation of numerical codes to string values, making results much easier to interpret. Using a bit of our magical text skills, we can get the relevant information out of this file (see Your Turn question #1).
***the annoying thing about this database is that in Format14.sas the SAS variable names are used, which have not much to do with the DBF variable names, so the applicability of the getLevels function is a bit limited here. ***
***Give some idea of the complexity of the data files***
Figure \@ref(fig:map-2014) gives an overview of all geographical locations where an accident resulting in at least one fatality was recording in 2014. A pretty detailed roadmap of the US is the result.
## When do people die on the roads?
Figure \@ref(fig:when) gives an overview of when fatal accidents happen on the road: on weekdays, the number of deaths peaks twice a day during times when people commute. During the morning commute the number of deaths shows a small peak with a mode around 6am. A second, much bigger mode follows the afternoon commute with a peak at around 6pm. These peaks in the number of detahs is by far outdone by the number of deaths from fatal accidents on weekend nights and early mornings starting with Friday afternoon and lasting until Sunday around 10pm.
```{r when, message=FALSE, warning = FALSE, fig.cap='Number of fatalities from car accidents by time of the day and hour of the week.', out.width='100%', fig.asp=.65, fig.align='center'}
library(dplyr)
drunk_by_day <- accidents %>% group_by(DAY_WEEK, HOUR) %>%
summarize(
accidents = n(),
deaths = sum(FATALS),
drunk = sum(DRUNK_DR*FATALS > 0)
)
drunk_by_day$drunkPerc <- with(drunk_by_day, drunk/deaths*100)
drunk_by_day$HOUR[drunk_by_day$HOUR == 99] <- NA
drunk_by_day$DAY_WEEK <- factor(drunk_by_day$DAY_WEEK)
levels(drunk_by_day$DAY_WEEK) <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
drunk_by_day$DAY_WEEK <- factor(drunk_by_day$DAY_WEEK, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
ggplot(data = drunk_by_day) +
geom_bar(aes(x = HOUR, weight = deaths), colour = NA) +
facet_wrap(~DAY_WEEK, ncol = 4)
```
The variable ```DRUNK_DR``` is the number of drivers involved in the crash who tested as being above the legal limit of alcohol, i.e. were legally drunk.
Days of the week are encoded as 1 for Sunday through 7 for Saturday, the dots in Figure \@ref(fig:drunk) show the percentage of accidents in which at least one of the drivers was drunk. On Saturdays and Sundays the percentage of drunk drivers is generally higher, but on all days of the week, fatal accidents that occur after 8 pm and before 5 am have an over 30% chance of involving a drunk driver. During the very early hours of the morning this rate spikes to well over 50% even on weekdays!
```{r drunk, message=FALSE, warning = FALSE, fig.cap='Percentage of accidents where at least one of the drivers was legally drunk by day of the week and time of the day. Late nights and early mornings have particularly high frequencies.', out.width='80%', fig.asp=.75, fig.align='center'}
ggplot(data = drunk_by_day, aes(x = HOUR, y = drunkPerc, group = DAY_WEEK, colour = DAY_WEEK)) +
geom_point() +
theme_bw() +
geom_smooth(se = FALSE)
```
## What contributes to fatal accidents?
```CF``` is a variable encoding contributing factors. For more than 90% of all accidents no contributing factors are recorded, but for the remainder, the most frequent factors are (14) falling cargo followed by (20) police pursuit (see Figure \@ref(fig:what) ).
```{r, echo=FALSE}
getLevels <- function(varname, file) {
formats <- readLines(con=file)
found <- grep(varname, formats)
if (length(found) == 0) {
stop(sprintf("Error: no level information found for variable %s\n", varname))
}
# semicolons indicate end of lines
semicolons <- grep(";", formats)
#read lines from found to next semicolon
endOfRead <- semicolons[which(semicolons - found > 0)[1]]
dframe <- read.table(file, header=FALSE, sep= "=", skip = found, nrows=endOfRead-found-1)
dframe
}
# getLevels("MFACTOR", "data/FARS2014-DBF/Format14.sas")
# getLevels("STATE", "data/FARS2014-DBF/Format14.sas")
```
```{r what, fig.cap='Number of accidents by contributing factor. Contributing factor 0 stands for "none".', out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
enc <- getLevels("ARF14F", "data/FARS2014-DBF/Format14.sas")
names(enc) <- c("CF1", "Label")
ggplot(data = accidents,
aes(x = reorder(CF1, CF1, length))) +
geom_bar() +
coord_flip() +
geom_text(aes(label = Label),
y = 1000, data = enc, hjust = 0)
```
Was the driver distracted? - The answer to this question is recorded in the variable ```MDRDSTRD``` and summarised in Figure @\ref(fig:MDRDSTRD).
```{r MDRDSTRD, fig.cap='Number of accidents by factors distracting the driver', out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
distract <- read.dbf("data/FARS2014-DBF/Distract.dbf")
enc2 <- getLevels("DRDIS14F", "data/FARS2014-DBF/Format14.sas")
names(enc2) <- c("MDRDSTRD", "Label")
ggplot(data = distract, aes(x = reorder(MDRDSTRD, MDRDSTRD, length))) +
geom_bar(aes(fill = MDRDSTRD %in% c("05", "06", "15"))) +
coord_flip() +
scale_fill_discrete("Cell phone related distraction") +
geom_text(aes(label = Label), y = 1000, data = enc2, hjust = 0)
# This is really hard to see...
```
0 stands for 'Not distracted'. 99 is code for 'unknown',
96 is a non-report, and 93 stands for a non-specific 'inattention'.
## Who is involved in fatal accidents?
Assuming, individuals on the front left seat are drivers, we see in Figure \@ref(fig:drivers) that on weekdays there is not a big difference in the drunk driving pattern between male and female drivers, but starting Friday afternoon and early evening a gap opens up: a higher percentage of men involved in a fata accident are driving drunk than women in the evenings and early morning hours until during Monday mornings the gap closes again.
```{r drivers, fig.cap='Percentages of drunk drivers involved in a fatal accident by gender, time of the day and day of the week.', out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
person <- read.dbf("data/FARS2014-DBF/person.dbf")
driver <- subset(person, SEAT_POS == 11) # front left seat
driver$Date <- with(driver, as.Date(sprintf("2014/%s/%s", MONTH, DAY)))
driver$DAY_WEEK <- lubridate::wday(driver$Date, label=TRUE)
drunk <- driver %>% group_by(HOUR, SEX, DAY_WEEK) %>%
summarize(
n = n(),
drunk = sum(DRINKING==1)
)
drunk$HOUR[drunk$HOUR==99] <- NA
drunk$drunkPerc <- with(drunk, drunk/n*100)
drunk$SEX[drunk$SEX > 2] <- NA
drunk$SEX <- factor(drunk$SEX)
levels(drunk$SEX) <- c("Male", "Female")
ggplot(data = na.omit(drunk),
aes(x = HOUR, y = drunkPerc, colour = factor(SEX),
group = interaction(SEX, DAY_WEEK))) +
geom_point() +
theme_bw() +
geom_smooth(se = F) +
facet_wrap(~DAY_WEEK)
```
Do we see this gap between genders because women are more sensible with alcohol or do they just let somebody else drive?
Figure \@ref(fig:seats) gives a comparison of the number of men and women involved in fatal accidents. Overall, there are about twice as many men involved in a fatal accident than there are women. On the driver seat, this difference is even more pronounced: if there is only one person in the car, the odds of that person being a man are about 3:1. When there are two people in the car, the ratio changes again. On the driver seat the odds of male:female change to about 2:1, while on the passenger seat more women are found than man.
```{r seats, fig.cap='Number of men and women involved in fatal accidents grouped by number of people in the car and seat.', out.width='100%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
person <- read.dbf("data/FARS2014-DBF/person.dbf")
frontseats <- subset(person, SEAT_POS %in% c(11, 13))
frontseats$SEX[frontseats$SEX > 2] <- NA
frontseats$SEX <- factor(frontseats$SEX)
levels(frontseats$SEX) <- c("Male", "Female")
enc4 <- getLevels("DRINK14F", "data/FARS2014-DBF/Format14.sas")
names(enc4) <- c("DRINKING", "DRINKINGR")
frontseats <- frontseats %>% group_by(ST_CASE, VEH_NO) %>% mutate(PERSON=n())
frontseats <- subset(frontseats, PERSON < 3)
frontseats <- subset(frontseats, SEX %in% c("Male", "Female"))
frontseats <- subset(frontseats, (SEAT_POS == 11) | (PERSON == 2 & SEAT_POS ==13) )
frontseats$SEAT_POS <- factor(frontseats$SEAT_POS)
levels(frontseats$SEAT_POS) <- c("Driver", "Passenger")
frontseats$DRINKING <- factor(frontseats$DRINKING)
levels(frontseats$DRINKING) <- as.vector(enc4$DRINKINGR)
ggplot(data=frontseats) +
geom_bar(aes(x = DRINKING, fill= SEX), position="dodge") +
facet_grid(PERSON~SEAT_POS, labeller="label_both") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
```
Figure \@ref(fig:seats-perc) shows the percentages of men involved in fatal accidents by position in the car and number of persons. In all three situations, the odds of men versus women change towards a higher percentage of men if alcohol is involved.
```{r seats-perc, fig.cap='Number of men and women involved in fatal accidents grouped by number of people in the car and seat.', out.width='100%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
ggplot(data=frontseats) +
geom_bar(aes(x = DRINKING, fill= SEX), position="fill") +
facet_grid(PERSON~SEAT_POS, labeller="label_both") +
theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
```
When we further investigate the demographics of persons involved in fatal accidents, we find some more imbalances between the genders. Figure \@ref(fig:age) shows the marginal distributions of people involved by age. Once somebody can get a licens (at around age 18), there are huge spikes in the numbers for both genders. A secondary spike at around age 50 is much more pronounced in men than women.
```{r age, fig.keep=TRUE, fig.cap='Age of persons involved in fatal accidents. Color shows gender and outcome. Both genders show a huge spike in numbers at around age 18. The second spike around age 50 is more pronounced in men than in women. ', out.width='80%', fig.asp=.5, fig.align='center', message=FALSE, warning = FALSE}
person$Alive <- person$DEATH_YR > 2015
person$SEX[person$SEX > 2] <- NA
person$AGE[person$AGE > 199] <- NA
person$GenderAlive <- with(person, interaction(SEX,Alive))
levels(person$GenderAlive) <- c("Male/died", "Female/died", "Male/alive", "Female/alive")
person$SEX <- factor(person$SEX)
levels(person$SEX) <- c("Male", "Female")
cols <- RColorBrewer::brewer.pal(n=6, name="Paired")
ggplot(aes(AGE, fill=GenderAlive), data=subset(person, !is.na(GenderAlive))) +
geom_histogram( binwidth=1) +
scale_fill_manual(values=cols[c(2,6,1,5)]) +
theme(legend.position="none") +
geom_hline(yintercept=0.5, colour="white") + facet_wrap(~SEX)
```
Figure \@ref(fig:age-b) shows essentially the same picture, but focuses on outcome of the accident. Light colors indicate survival. Death rates vary a lot by age, and there are some interesting differences between genders: for the first 17 years, death rates for both genders are low and very similar (and probably related to position in the car). Rates then jump to much higher rates: for men aged between 20 and 60 the chance of dying in the car accident when involved in a fatal car crash is almost 50%. For women this probability is about 40%. After age 60 the rates go up steeply. Rates for men stay above the rates for women up to about age 80, when the gender differences disappear again (but there is also not much data to support this statement).
```{r age-b, fig.keep=TRUE, fig.cap='Death rates of persons involved in fatal accidents by age and gender. For both genders there is a huge increase in the rate at age 18. Starting around age 50 the rates of death increase. Generally, the rate for a man dying are higher than for a woman.', out.width='80%', fig.asp=.5, fig.align='center', message=FALSE, warning = FALSE}
ggplot(aes(AGE, fill=GenderAlive), data=subset(person, !is.na(GenderAlive))) +
geom_histogram( binwidth=1, position="fill", alpha=0.8) +
scale_fill_manual(values=cols[c(2,6,1,5)]) + ylab("proportion") +
theme(legend.position="none") + facet_wrap(~SEX)
```
Figure \@ref(fig:age-c) puts all of these percentages together. The figure gives an overview of the multinomial distributions conditioned on age. Between ages 20 and 50 men are by far overrepresented in fatal accidents.
```{r age-c, fig.keep=TRUE, fig.cap='Age of persons involved in fatal accidents. Color shows gender and outcome.', out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
person$GenderAlive <- factor(person$GenderAlive, levels=c("Male/died", "Male/alive", "Female/alive", "Female/died"))
ggplot(aes(AGE, fill=GenderAlive), data=person) +
geom_histogram(position="fill", binwidth=1, alpha=0.8) +
scale_fill_manual(values=cols[c(2,1,5,6)]) +
theme(legend.position="none") + ylab("proportion") +
geom_hline(yintercept=0.5, colour="white")
```
## Are roads getting safer?
First we read in data at five year intervals starting with 1975. In 2000, we switch to yearly updates, because, as can be seen in Figure \@ref(fig:fatalities-trend) during this time frame things change most dramatically in terms of the number of accidents. At the same time (see Figure \@ref(fig:fatalities-rates)), the rate at which people die in fatal accidents drops as well.
```{r, eval=FALSE}
require(foreign)
files <- dir("data/FARS-allyears/", pattern="dbf", ignore.case=TRUE)
accidents <- list()
for (fidx in seq_along(files)) {
accidents[[fidx]] <- read.dbf(file.path("data/FARS-allyears", files[fidx]))
}
accidents[[length(accidents)+1]] <- read.dbf("data/FARS2014-DBF/accident.dbf")
library(tidyr)
library(dplyr)
numbers <- data.frame(
year = accidents %>% purrr::map_int(.f = function(x) {
year <- x$YEAR[1]
if (year < 2000) return(as.integer(1900+year[1]))
as.integer(year)
}),
fatalities = accidents %>% purrr::map_dbl(.f = function(x) {
sum(x$FATALS)
}),
n = accidents %>% purrr::map_dbl(nrow)
)
write.csv(numbers, "data/numbers.csv", row.names=FALSE)
```
```{r fatalities-trend, fig.keep=TRUE, fig.cap="Number of accidents (black points) and fatalities (red points) over time.", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
numbers <- read.csv("data/numbers.csv")
ggplot(data = numbers) +
geom_point(aes(x = year, y = n)) +
geom_point(aes(x = year, y = fatalities), colour = "red")
```
```{r fatalities-rates, fig.cap="Rate of fatalities per accident in which at least one fatality occurred. The rate decreases over time.", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
ggplot(data = numbers) +
geom_point(aes(x = year, y = fatalities/n))
```
The drop in both numbers of fatal accidents over time and the rates of fatalities is even more dramatic, when put into the perspective of how many more cars are on the roads. This kind of information is made available by RITA.
RITA got renamed to BTS - none of the websites below exist any more. Need to find them again.
<!-- The [table of the number of cars on US roads](http://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/publications/national_transportation_statistics/html/table_01_11.html) is available online. -->
<!-- While we can read in the data (see below), it is in a fairly awful shape: -->
<!-- ```{r} -->
<!-- rita <- read.table("http://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/table_01_11_3.csv", skip=1, header=TRUE, sep=",", na.strings="NA") -->
<!-- head(rita[,1:10]) -->
<!-- ``` -->
<!-- But after some basic cleaning, we can extract the number of vehicles registered annually: -->
<!-- ```{r} -->
<!-- library(tidyr) -->
<!-- names(rita)[1] <- "mode" -->
<!-- rm <- gather(rita, key=year, value=number, -mode, na.rm=TRUE) -->
<!-- rm$year <- as.numeric(gsub("X","", rm$year)) -->
<!-- rm$number <- gsub(",","", rm$number) # get rid of all the comma -->
<!-- rm$number <- trimws(gsub("(R)", "", rm$number, fixed=TRUE)) -->
<!-- rm$number <- as.numeric(rm$number) -->
<!-- rm <- subset(rm, !is.na(rm$number)) -->
<!-- vehicles <- subset(rm, mode %in% "Highway, total (registered vehicles)") -->
<!-- ``` -->
Figure @\ref(fig:registration) shows the number (in millions) of registered vehicles over time. Between 1965 and 1993 the number of annual registrations doubles from 100 million to 200 million. Until most recently, the number of registrations increases linearly with an average of about 3.5 Million more vehicles registered each year. Unfortunately, this does not tell us eaxactly how many cars are on the roads each year - for that we would need to also take into account, for how many years people use the same vehicle on average. For the most meaningful measure we would also need to take the number of miles into account that people drive each year. This number would allow us to estimate a rate of fatal accidents per miles driven.
*** we could follow up on this, there are a number of interesting sites that deal with estimating the number of annually driven miles - e.g. the travel monitoring at [https://www.fhwa.dot.gov/policyinformation/travel_monitoring/tvt.cfm](https://www.fhwa.dot.gov/policyinformation/travel_monitoring/tvt.cfm)
and the federal reserve bank of St Louis [https://research.stlouisfed.org/fred2/release?rid=254](https://research.stlouisfed.org/fred2/release?rid=254)
***
<!-- ```{r registration, fig.cap="Number of registered vehicles (in Millions) by year. The number of registered cars increases linearly over time.", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE} -->
<!-- ggplot(aes(x=year, y=number/10^6), data=vehicles) + -->
<!-- geom_smooth(method="lm") + -->
<!-- geom_point() + -->
<!-- ylab("Number of registered vehicles (in Millions)") -->
<!-- ``` -->
Figure \@ref(fig:adjusted-rates) shows the rate of accident fatalities per million registered vehicles over time. Since 1975 this rate has been cut into less than half from over 300 to just over 125 fatalities per million registrations. However, since about 2009 not much progress was made to further reduce the rate (which might go hand in hand with the sub-linear increase in the number of registrations for that same timeframe).
<!-- ```{r adjusted-rates, fig.cap="Rate of accident fatalities by million of registered vehicles. ", out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE} -->
<!-- numbers <- merge(numbers, vehicles, by="year", all.x=TRUE) -->
<!-- qplot(year, fatalities/number*10^6, data=numbers) + ylab("Rate of accident fatalities per million registered vehicles.") -->
<!-- ``` -->
## Your Turn
+ Use the structure you find in file [Format14.sas](data/FARS2014-DBF/Format14.sas) as the basis to write a function ```getLevels(varname, file)``` that takes a variable name and a sas file, and returns a dataframe with all levels of the variable in a numeric format and the level names as strings.
+ Was there something wrong with the vehicle? Investigate variable ```MFACTOR``` in the ```Factor.dbf```. Make sure to re-code the numbers to strings.
+ Include gender in the discussion of Figure \@ref(fig:when). (Warning! This question looks harmless, but requires some work.)
+ Where people are when they get involved in an accident matters a lot. Investigate numbers and rates of death by seat position (```SEAT_POS```). How do age and gender factor into this? Make sure to investigate only combinations with enough data support.
+ Fatal accidents by state over time:
This is a two-part question. First, look at the code below and explain the pieces of the data processing pipeline:
```{r, eval=FALSE}
state.numbers <- accidents %>% purrr:::map_df(
.f = function(x) x %>% group_by(STATE) %>% summarize(
year= YEAR[1],
fatalities = sum(FATALS),
accidents = n()
)
)
idx <- which(state.numbers$year < 1900)
state.numbers$year[idx] <- 1900+state.numbers$year[idx]
write.csv(numbers, "data/state-numbers.csv", row.names=FALSE)
```
Second, load the data set ```state-numbers.csv``` into your R session and investigate. For interpretation, it will be much easier to change the fips codes to state names. What is the dominant pattern in this plot? How would it be possible to reveal the actual rates of fatalities in each state over time? Try to find a source and investigate how rates of fatalities change for each state over time.
## Solutions
+ ```getLevels()``` included above but not shown.
+ Figure \@ref(fig:MFACTOR) shows the most common contributing factors. Besides 0 'none' and 99 'unkown', the most factors are defects to the tires (1).
```{r MFACTOR, fig.cap='Number of accidents by pre-existing conditions pertaining the vehicles involved', out.width='80%', fig.asp=.75, fig.align='center', message=FALSE, warning = FALSE}
factor <- read.dbf("data/FARS2014-DBF/Factor.dbf")
enc3 <- getLevels("FACTR14F", "data/FARS2014-DBF/Format14.sas")
names(enc3) <- c("MFACTOR", "Label")
ggplot(data = factor, aes(x = reorder(MFACTOR, MFACTOR, length))) +
geom_bar() +
coord_flip() +
geom_text(aes(label = Label), y = 1000, data = enc3, hjust = 0)
```
+ ***Where people are when they get involved in an accident matters a lot. Investigate numbers and rates of death by seat position (```SEAT_POS```). How do age and gender factor into this? Make sure to investigate only combinations with enough data support.***
Investigating seat position first:
- change encoding to text
- reduce the number of categories
- include deaths
```{r}
enc5 <- getLevels("SEATP14F", "data/FARS2014-DBF/Format14.sas")
names(enc5) <- c("SEAT_POS", "Seat")
person <- merge(person, enc5, by="SEAT_POS", all.x=TRUE)
person$SEAT_POS <- with(person, reorder(SEAT_POS, SEAT_POS, length))
ggplot(data = person) +
coord_flip() +
geom_bar(aes(x = SEAT_POS)) +
geom_label(aes(x = SEAT_POS, label = Seat), y = 1000,
data = unique(person[,c("SEAT_POS", "Seat")]),
hjust = 0, alpha = 0.5)
```
We will require each category of ```SEAT_POS``` to have at least 500 records:
```{r}
# SEAT_POS is already ordered by number
person$SEAT_POS_cat <- as.character(person$Seat)
person$SEAT_POS_cat[as.numeric(person$SEAT_POS) < 20] <- "Other"
person$SEAT_POS_cat <- with(person, reorder(SEAT_POS_cat, SEAT_POS_cat, length))
```
```{r}
person$Alive <- person$DEATH_YR > 2015
# qplot(data=person, SEAT_POS_cat, fill=Alive) + coord_flip()
ggplot(data=person, aes(x=SEAT_POS_cat, fill=Alive)) + coord_flip() +
geom_bar(position="fill")
ggplot(data=person, aes(x=SEX, fill=Alive)) + coord_flip() +
geom_bar(position="fill") + facet_wrap(~SEAT_POS_cat) + xlab("proportion")
person$AGE_cat <- cut(person$AGE, breaks=c(0,18, 60, 105))
ggplot(data=person, aes(x=AGE_cat, fill=Alive)) + coord_flip() +
geom_bar(position="fill") + facet_wrap(~SEAT_POS_cat) + xlab("proportion")
```