06-case-study-airlines.Rmd

---
knit: bookdown::preview_chapter
---

# US flights


HH: what I like to include here:

- discussion of airlines: number of passengers, centralized/de-centralization (hubs or not), fuel consumption
- discussion of delays: by time of day, day of week, airport, airline
- finding exceptions: delays of more than 24h, balloons at 600 mph, ...
- making a movie of the flights

I would also like to use new data rather than the Expo data.


There were 445,827 national flights recorded for January 2016:
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
library(ggthemes)
# load ontime data into the working directory
load("data/airlines/ontime/january-flights.RData") 
dim(ontime)
library(dplyr)

tails <- ontime %>% group_by(TailNum) %>% summarize(
  numFlights = n(),
  carrier = UniqueCarrier[1],
  miles = sum(Distance)
)
ggplot(data = tails) + 
  geom_boxplot(aes(x = reorder(carrier, miles), 
                   y = miles, fill = carrier), varwidth = TRUE) + 
  ylim(c(0,150000)) 

ggplot(data = tails) + 
  geom_histogram(aes(x = miles, fill = carrier)) + 
  xlim(c(0, 150000)) + 
  facet_wrap(~carrier, scales = "free_y")
```
Southwest (WN) flies the heck out of its planes, but it is beaten at this game by a lot of the very small airline carriers, at the front of which is Spirit Airlines (NK).
What we cannot see from the boxplots is that some of these airline carriers have several modes for the number of miles that their planes fly. United and American seem to have a couple of planes that fly only very rarely. Carrier AS has two modes: one mode at a very high number of miles, the other at around 75,000 miles.  

All times in the ontime data set are in local time. This is problematic for almost any data analysis. Even ordering flights by their departure time over the course of a day is not possible unless we are only considering flights in the same time zone. 

Because all of the locations we are interested in, are airports, we can make use of the
[National Transportation Atlas Database](http://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/publications/national_transportation_atlas_database/2011/index.html) provided by the Bureau of Transportation Statistics. In this database, airports are listed with geographic location in latitude and longitude. 
```{r, message=FALSE}
library(foreign)
airports <- read.dbf("data/airlines/airports/airports.dbf")
# we can reduce our data set to airports only 
table(airports$LAN_FA_TY)

airports <- subset(airports, LAN_FA_TY=="AIRPORT")
ggplot(data = airports) + 
  geom_point(aes(x = LONGITUDE, y = LATITUDE))
# qplot(LONGITUDE, LATITUDE, data=airports)
```

There are a slew of different approaches and (semi-free) online services available to resolve issues around time zones. Here, we are using the Google Maps Time Zone API to get time zones based on geographic location for every US airport. 
In order to run the following example, you will have to sign up for this service to get an API key. Make yourself familiar with any limitations of the service so you don't accidentally ramp up any charges.

```{r, eval=FALSE}
timestamp <- 1452427200 # epoch time of Jan 10, 2016 12:00 pm

i <- 1
stringi <- sprintf("https://maps.googleapis.com/maps/api/timezone/json?location=%s,%s&timestamp=1452427200&key=PASTE-YOUR-KEY-HERE", airports$LATITUDE[i], airports$LONGITUDE[i])

library(rjson)
json_data <- fromJSON(file=stringi)
```

```{r, echo=FALSE, eval=FALSE}
timestamp <- 1452427200 # epoch time of Jan 10, 2016 12:00 pm
library(rjson)

# airport_tz <- NULL
for (i in 1:nrow(airports)) {
  stringi <- sprintf("https://maps.googleapis.com/maps/api/timezone/json?location=%s,%s&timestamp=1452427200&key=AIzaSyCYXzAUh7Tq9CQ9dHvNFdkZ7QBdrfKNS3U", airports$LATITUDE[i], airports$LONGITUDE[i])
  cat(i)
  cat("\n")
  json <- fromJSON(file=stringi)
  if (json$status =="OK") {
    temp <- data.frame(ID = airports$LOCID[i], json)
    airport_tz <- rbind(airport_tz, temp)
  } else {
    cat(json$status)
    cat("\n")
    if (json$status == "OVER_QUERY_LIMIT") return()
  }
}
write.csv(airport_tz, file="data/airport-timezones.csv", row.names=FALSE)
```
This results in a dataset indexed by airport FAA ID (the three letter acronym for each airport, such as `ORD` for Chicago O'Hare or `DFW` for Dallas/Fort Worth) with a timezone ID and the timezone Name. 
```{r}
airport_tz <- read.csv("data/airport-timezones.csv")
head(airport_tz)
```
We use this information to calculate the time shift for each one of the time zones compared to a reference timezone and date. We make UTC our reference time zone and use 12 pm on a date in January 2016 to calculate the difference in hours of the day. Because there is no change from Standard to Daylight Savings Time in January it does not matter which exact day we use. 
```{r}
library(lubridate)
# reference date
pb.date <- ymd_hm("2016-01-24 12:00", tz="UTC")
hour(pb.date)

airport_tz$timeZoneId <- as.character(airport_tz$timeZoneId)
airport_tz$tz_hours <- 12 - sapply(airport_tz$timeZoneId, function(x) hour(format(pb.date, tz=x, usetz=TRUE)))

airports <- merge(airports, airport_tz[, c("ID", "timeZoneName", "timeZoneId", "tz_hours")], all.x=TRUE, by.x="LOCID", by.y="ID")
ggplot() + 
  geom_point(data = airports, aes(x = LONGITUDE, y = LATITUDE, colour = factor(tz_hours))) + 
  geom_point(data = subset(airports, !is.na(tz_hours)), aes(x = LONGITUDE, y = LATITUDE, colour = factor(tz_hours)))
# qplot(data=airports, x = LONGITUDE, y = LATITUDE, colour=factor(tz_hours)) +
#   geom_point(data=subset(airports, !is.na(tz_hours)))
```

Using the time zone information we  transform local times at airports into global times referenced to UTC, which we then use to order the ontime data according to the time that flights took place ***needs a more precise definition***.
```{r}
# merge time zone data into ontime data by  origin and destination 
# adjust local departure times by origin and arrival times by destination

```
Order flights to identify ghost flights
```{r}
ontime$Departure <- as.POSIXct(lubridate::ymd(as.character(ontime$FlightDate)))
hour(ontime$Departure) <- ontime$CRSDepTime %/% 100
minute(ontime$Departure) <- ontime$CRSDepTime %% 100  # this is where the timezones have to come in
```

This is how we get to ghost flights: for each individual plane (using tail number `TailNum` as an identifier), we weed out all flights that got cancelled (because presumably the plane won't move if its flight gets cancelled). We then sort flights according to their (scheduled) departure date and time. `LastDest` introduces a variable specifying the last destination a plane has flown into - if things go well, this should be the same as the origin from where the next flight is scheduled to take off. If `LastDest` and `Origin` are different airports, this indicates that at least one ghost flight has taken place, because the plane had to move somehow from its last destination to the new origin. 
```{r}
ghosts <- ontime %>% filter(TailNum != "") %>% 
  group_by(TailNum) %>% 
  select(
    TailNum, Departure, FlightDate, Origin, Dest, Cancelled, UniqueCarrier
  ) %>% filter(Cancelled == 0) %>%  
  arrange(Departure) %>%
  mutate(
    LastDest = lag(Dest),
    Ghost = !(Origin == LastDest)
  ) %>% filter(Ghost)
dim(ghosts)
```
There are 9651 ghost flights in January 2016 (*** not adjusted by timezone - this might change a couple of flights ***).

HH: Caveat: if a plane is doing an international flight, it might create an 'artificial' ghost flight.
HH: what I would like to know: number of ghost miles flown. Fuel costs for ghost flights. 

```{r ghostrate01}
ghostsCarrier <- data.frame(xtabs(~UniqueCarrier, data=ghosts))
names(ghostsCarrier)[2] <- "Ghost"
flightsCarrier <- data.frame(xtabs(~UniqueCarrier, data=subset(ontime, Cancelled==0)))
names(flightsCarrier)[2] <- "Flight"

carrierStats <- merge(ghostsCarrier, flightsCarrier, by="UniqueCarrier")
ggplot(data = carrierStats) + 
  geom_point(aes(x = Ghost/Flight * 1000, y = reorder(UniqueCarrier, Ghost/Flight), size = Flight)) + 
  ggtitle("Number of Ghosts for every 1000 Flights") +
  xlab("Number of ghost flights per 1000 flights") + 
  ylab("Airline Carrier")
```

Figure \@ref(fig:ghostrate01) shows the rate of ghost flights per each 1000 flights for each carrier. The size of the dots is proportional to the number of flights the airline carried out in January 2016, i.e. larger carriers are shown by larger sized dots. Generally, larger carriers have larger ghost rates, but there are some remarkable exceptions. Southwest (WN) has the lowest rate of ghost flights among the larger carriers, while Frontier Airlines (F9) has a very high ghost rate for such a small carrier.  

```{r ghostmap01, fig.cap="Ghost flights of three national airline carriers. Airports with at least 2.5\\% of the ghost traffic are labelled.", fig.height=4, fig.width=12, warning=FALSE, message=FALSE}
ghosts <- merge(ghosts, airports[,c("LOCID", "LONGITUDE", "LATITUDE")], by.x="Origin", by.y="LOCID", all.x=TRUE)
names(ghosts)[grep("LONGITUDE", names(ghosts))] <- "orig.long"
names(ghosts)[grep("LATITUDE", names(ghosts))] <- "orig.lat"

ghosts <- merge(ghosts, airports[,c("LOCID", "LONGITUDE", "LATITUDE")], by.x="LastDest", by.y="LOCID", all.x=TRUE)
names(ghosts)[grep("LONGITUDE", names(ghosts))] <- "dest.long"
names(ghosts)[grep("LATITUDE", names(ghosts))] <- "dest.lat"

ghosts$LastDest <-as.character(ghosts$LastDest)
ghosts$Origin <-as.character(ghosts$Origin)
ghostPorts <- ghosts %>% split(.$UniqueCarrier) %>% purrr::map_df(
  function(x) {
    dframe <- data.frame(
      UniqueCarrier = x$UniqueCarrier[1], 
      ID = with(x, c(LastDest, Origin)))
    dframe %>% group_by(UniqueCarrier, ID) %>% summarize(
      ghosts = n())
  }
)
ghostPorts <- merge(ghostPorts, airports[,c("LOCID", "LONGITUDE", "LATITUDE")], by.x="ID", by.y="LOCID", all.x=TRUE)
ghostPorts <- ghostPorts %>% group_by(UniqueCarrier) %>% mutate(
  ghostPerc = ghosts / sum(ghosts) *100
)

states <- map_data("state")
ggplot() + theme_bw() + 
  geom_polygon(aes(long, lat, group=group), data=states, fill="grey80") +
  geom_segment(aes(x=orig.long, xend=dest.long, y=orig.lat, yend=dest.lat),
               data=subset(ghosts, UniqueCarrier %in% c("OO", "EV", "F9")),
               colour="darkorange", alpha=0.25) + facet_wrap(~UniqueCarrier) +
  xlim(c(-125, -60)) + ylim(c(25, 50)) + 
  theme_map() + 
  geom_label(
    aes(LONGITUDE, LATITUDE, label=ID), alpha=0.5,
    data=subset(ghostPorts, 
                UniqueCarrier %in% c("OO", "EV", "F9") & ghostPerc > 2.5))
```

## Flight delays

There are two main kind of delays in the US flight data: departure and arrival delays. Obviously, the two delays are correlated - at least at first sight. Figure \@ref(fig:delays) shows two scatterplots of arrival and departure delays for each of the 400,000 flights using progressive zooms into the data.  The  scatterplot on the left shows all departure and arrival delays. The correlation is very strong, particularly for large delays. But if we zoom closer into the actually practically relevant region of delays up to 120 mins, we see a more complex relationship. While there is still a strong linear relationship (on average planes make up 6 mins of departure delay on arrival), a second pattern emerges for planes that leave on time or are even early but land late. 

```{r delays01, echo=FALSE, fig.caption="Scatterplots of departure delay and arrival delay. Overall, there is a strong linear relationship. Zooming in, a second pattern emerges of planes departing on time but arriving late.", fig.width=6, fig.height=6, warning=FALSE}

ggplot(data = ontime) + 
  geom_point(aes(x = DepDelay, y = ArrDelay))

ggplot(data = subset(ontime, ArrDelay < 120)) + 
  geom_point(aes(x = DepDelay, y = ArrDelay), alpha = .05)
```

### Cancellations?


The Department of Transportation is keeping track of four reasons for cancellations: Carrier delays, Weather delays, Security delays and delays due to the National Air System. Three of these reasons were given for cancelled flights in January 2016, as can be seen in Figure \@ref(fig:cancellation-codes). Obviously, cancellations due to weather are most prominent for this timeframe. 
```{r cancellation-codes01}
cancellations <- subset(ontime, Cancelled == 1)
levels(cancellations$CancellationCode) <- c("", "Carrier", "Weather", "National Air System")
ggplot(data = cancellations) + 
  geom_bar(aes(x = CancellationCode))
```

During January 2016 there is a period of four days (see Figure \@ref(fig:day-delays)) that caused massive, mainly weather-related cancellations. 

```{r day-delays01}
cancelDay <- cancellations %>% group_by(DayofMonth, CancellationCode) %>% summarize(
  n = n()
)
ggplot(data = cancelDay) + 
  geom_point(aes(x = DayofMonth, y = n)) + 
  facet_wrap(~CancellationCode)
```

Looking closer, these cancellations occur mainly in the North East, where winter storm Jonas dropped 2 feet of snow between Jan 22 and Jan 24.
```{r}
fourdays <- cancellations %>% filter(between(DayofMonth, 22, 25)) %>%
  group_by(Origin) %>% summarize(
    n = n()
  )
fourdays <- merge(fourdays, 
                      unique(airports[, c("LOCID", "LONGITUDE", "LATITUDE")]),
                      by.x="Origin", by.y="LOCID")
ggplot(data = fourdays) + 
  geom_point(aes(x = LONGITUDE, y = LATITUDE, size = n))
```


HH: this still needs some work: 

- it seems that some airports are listed multiple times (where does that happen? it's not in the original data, some merge went wrong)
- number of cancellations are less interesting than the rates of cancellations.


Let us investigate arrival delays by airline and other 


```{r airtime-delays01, echo=FALSE}
ggplot(aes(x=CRSElapsedTime, y=ActualElapsedTime), data=ontime) +
  geom_hex(binwidth=c(5,5))
```


### Fuel

RITA publishes monthly fuel costs and consumptions of airline carriers  on its website at [http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=294&DB_Short_Name=Air%20Carrier%20Financial](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=294&DB_Short_Name=Air%20Carrier%20Financial).
We can download the data from there and combine it into one data set for the time frame of the ASA data expo ontime data:

```{r}
fuel <- read.csv("data/airlines/fuel consumption/fuel.csv")
ggplot(data = fuel) + 
  geom_point(aes(x = MONTH, y = TDOMT_GALLONS, colour = UNIQUE_CARRIER)) + 
  facet_wrap(~YEAR)
```

HH: we also need to get the number of miles flown into this picture, as well as the number of passengers, if possible.