regression.Rmd

# (PART) Extending the Model {-}

# Beta binomial regression {#regression}

```{r, echo = FALSE}
library(knitr)
opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, tidy = FALSE, fig.height = 5, fig.width = 6.67, out.height = "3in",out.width = "4in")
options(digits = 3)
```

```{r cache = FALSE, echo = FALSE}
library(ggplot2)
theme_set(theme_bw())
```

In this book we've been using the empirical Bayes method to estimate batting averages of baseball players. Empirical Bayes is useful in these examples because when we don't have a lot of information about a batter, they're "shrunken" towards the average across all players, as a natural consequence of the beta prior.

But there's a complication that we haven't yet approached. **When players are better, they are given more chances to bat**![^wickham]. That means there's a relationship between the number of at-bats (AB) and the true batting average. For reasons explained, this makes our estimates systematically inaccurate.

[^wickham]: Hat tip to Hadley Wickham for pointing this complication out to me.

In this chapter, we'll adjust our model to a new one where each batter has his own prior, using a method called **beta-binomial regression**. We show that this new model lets us adjust for the confounding factor while still relying on the empirical Bayes philosophy. We also note that this gives us a general framework for allowing a prior to depend on known information, which will become important in Chapter \@ref(hierarchical-modeling).

## Setup

As usual, we start with code that sets up the variables analyzed in this chapter.

```{r lahman_07}
library(dplyr)
library(tidyr)
library(Lahman)
library(ggplot2)
theme_set(theme_bw())

# grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# add player names
career <- Master %>%
  tbl_df() %>%
  dplyr::select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID")

# values estimated by maximum likelihood in Chapter 3
alpha0 <- 101.4
beta0 <- 287.3
prior_mu <- alpha0 / (alpha0 + beta0)

# for each player, update the beta prior based on the evidence
# to get posterior parameters alpha1 and beta1
career_eb <- career %>%
  mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0)) %>%
  mutate(alpha1 = H + alpha0,
         beta1 = AB - H + beta0) %>%
  arrange(desc(eb_estimate))
```

Recall that the `eb_estimate` column gives us estimates about each player's batting average, estimated from a combination of each player's record with the beta prior parameters estimated from everyone ($\alpha_0$, $\beta_0$). For example, a player with only a single at-bat and a single hit ($H = 1; AB = 1; H / AB = 1$) will have an empirical Bayes estimate of

$(H + \alpha_0) / (AB + \alpha_0 + \beta_0) = (1 + `r round(alpha0, 1)`) / (1 + `r round(alpha0, 1)` + `r round(beta0, 1)`) = `r (1 + alpha0) / (1 + alpha0 + beta0)`$

```{r abaveragescatter, dependson = "lahman_07", echo = FALSE, fig.cap = "Relationship between the number of at-bats (AB) and the raw batting average (H / AB) across all players with at least 10 at-bats."}
career %>%
  filter(AB >= 10) %>%
  ggplot(aes(AB, average)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() +
  labs(x = "Number of at-bats (AB)", 
       y = "Raw batting average (H / AB)")
```

Now, here's the complication. Let's compare at-bats (on a log scale) to the raw batting average (Figure \@ref(fig:abaveragescatter)). We notice that batters with low ABs have more variance in our estimates- that's a familiar pattern, because we have less information about them and the raw estimate is noisier.
 
But notice a second trend: as the number of at-bats increases, the batting average *also* increases. Unlike the variance, this is *not* an artifact of our measurement: it's a result of the choices of baseball managers! Better batters get played more: they're more likely to be in the starting lineup and to spend more years playing professionally.

Now, there are many other factors that are correlated with a player's batting average (year, position, team, etc). But this one is particularly important, because it confounds our ability to perform empirical Bayes estimation.

```{r abaveragescattershrink, dependson = "lahman_07", echo = FALSE, fig.cap = "Scatter plot of the relationship AB has with raw batting average (left) and with empirical Bayes shrunken estimates (right). The prior mean .261 is shown as a horizontal dashed red line, -fit lines are shown in blue."}
career_eb %>%
  filter(AB >= 10) %>%
  gather(type, value, average, eb_estimate) %>%
  mutate(type = plyr::revalue(type, c(average = "Raw",
                                      eb_estimate = "With EB Shrinkage"))) %>%
  ggplot(aes(AB, value)) +
  geom_point() +
  scale_x_log10() +
  geom_hline(color = "red", lty = 2, size = 1.5, yintercept = prior_mu) +
  facet_wrap(~type) +
  ylab("average") +
  geom_smooth(method = "lm")
```

```{r echo = FALSE}
median_lt_20 <- career_eb %>%
  filter(AB >= 10, AB <= 20) %>%
  summarize(average = median(H / AB))
```

Figure \@ref(fig:abaveragescattershrink) shows how empirical Bayes shrinkage affects the estimates of player batting averages. That horizontal red line shows the prior mean that we're "shrinking" towards ($\frac{\alpha_0}{\alpha_0 + \beta_0} = `r prior_mu`$). Notice that it is too high for the low-AB players. For example, the median batting average for players with 5-20 at-bats is `r median_lt_20$average`, and they get shrunk *way* towards the overall average! The high-AB crowd basically stays where they are, because each batter has a lot of evidence.

So since low-AB batters are getting overestimated, and high-AB batters are staying where they are, we're working with a biased estimate that is systematically *overestimating* batter ability. If we were working for a baseball manager (like in [Moneyball](https://en.wikipedia.org/wiki/Moneyball)), that's the kind of mistake we could get fired for!

## Accounting for AB in the model

How can we fix our model? We'll need to have AB somehow influence our priors, particularly affecting the mean batting average. In particular, we want the typical batting average to be linearly affected by $\log(\mbox{AB})$.

First we should write out what our current model is, in the form of a **generative process**, in terms of how each of our variables is generated from particular distributions. Defining $p_i$ to be the true probability of hitting for batter $i$ (that is, the "true average" we're trying to estimate), we're assuming

$$p_i \sim \mbox{Beta}(\alpha_0, \beta_0)$$

$$H_i \sim \mbox{Binom}(\mbox{AB}_i, p_i)$$

(We're letting the totals $\mbox{AB}_i$ be fixed and known per player). We made up this model in [one of the first posts in this series](http://varianceexplained.org/r/empirical_bayes_baseball) and have been using it since.

I'll point out that there's another way to write the $p_i$ calculation, by **re-parameterizing** the beta distribution. Instead of parameters $\alpha_0$ and $\beta_0$, let's write it in terms of $\mu_0$ and $\sigma_0$:

$$p_i \sim \mbox{Beta}(\mu_0 / \sigma_0, (1 - \mu_0) / \sigma_0)$$

Here, $\mu_0$ represents the mean batting average, while $\sigma$ represents how spread out the distribution is (note that $\sigma = \frac{1}{\alpha+\beta}$). When $\sigma$ is high, the beta distribution is very wide (a less informative prior), and when $\sigma$ is low, it's narrow (a more informative prior). Way back in [my first post about the beta distribution](http://varianceexplained.org/statistics/beta_distribution_and_baseball), this is basically how I chose parameters: I wanted $\mu = .27$, and then I chose a $\sigma$ that would give the desired distribution that mostly lay between .210 and .350, our expected range of batting averages.

Now that we've written our model in terms of $\mu$ and $\sigma$, it becomes easier to see how a model could take AB into consideration. We simply define $\mu$ so that it includes $\log(\mbox{AB})$ as a linear term[^linear]:

[^linear]: If you have some experience with regressions, you might notice a problem: $\mu$ can theoretically go below 0 or above 1, which is impossible for a $\beta$ distribution. Thus in a real model we would use a "link function", such as the [logistic function](https://en.wikipedia.org/wiki/Logistic_function), to keep $\mu$ between 0 and 1. I used a linear model (and `mu.link = "identity"` in the `gamlss` call) to make the math in this introduction simpler, and because for this particular data it leads to almost exactly the same answer (try it).

$$\mu_i = \mu_0 + \mu_{\mbox{AB}} \cdot \log(\mbox{AB})$$

$$\alpha_{0,i} = \mu_i / \sigma_0$$

$$\beta_{0,i} = (1 - \mu_i) / \sigma_0$$

Then we define the batting average $p_i$ and the observed $H_i$ just like before:

$$p_i \sim \mbox{Beta}(\alpha_{0,i}, \beta_{0,i})$$

$$H_i \sim \mbox{Binom}(\mbox{AB}_i, p_i)$$

This particular model is called **beta-binomial regression**. We already had each player represented with a binomial whose parameter was drawn from a beta, but now we're allowing the expected value of the beta to be influenced.

## Step 1: Fit the model across all players

Going back to the basics of empirical Bayes from Chapter \@ref(empirical-bayes), our first step is to fit these prior parameters: $\mu_0$, $\mu_{\mbox{AB}}$, $\sigma_0$. When doing so, it's ok to momentarily "forget" we're Bayesians- we picked our $\alpha_0$ and $\beta_0$ using maximum likelihood, so it's OK to fit these using a maximum likelihood approach as well. You can use the [gamlss](https://cran.r-project.org/package=gamlss) [@R-gamlss] package for fitting beta-binomial regression using maximum likelihood.

```{r gamlss_fit, dependson = "lahman_07", results = 'hide'}
library(gamlss)

fit <- gamlss(cbind(H, AB - H) ~ log(AB),
              data = career_eb,
              family = BB(mu.link = "identity"))
```

We can pull out the coefficients using `tidy()` from my broom package.[^broom]

[^broom]: The broom package provides methods for "tidying" model objects into data frames. See `?gamlss_tidiers` for documentation on this particular tidier.

```{r td, dependson = "gamlss_fit"}
library(broom)

td <- tidy(fit)
td
```

This gives us our three parameters: $\mu_0 = `r td$estimate[1]`$, $\mu_{\mbox{AB}} = `r td$estimate[2]`$, and (since `sigma` has a log-link) $\sigma_0 = \exp(`r td$estimate[3]`) = `r exp(td$estimate[3])`$.

This means that our new prior beta distribution for a player *depends on* the value of AB. For example, here are our prior distributions for several values of $AB$.

```{r abpriors, dependson = "td", echo = FALSE, fig.cap = "The density of the prior distribution for a player with particular numbers of at-bats."}
mu_0 <- td$estimate[1]
mu_AB <- td$estimate[2]
sigma <- exp(td$estimate[3])

crossing(x = seq(0.08, .35, .001), AB = c(1, 10, 100, 1000, 10000)) %>%
  mutate(density = dbeta(x, (mu_0 + mu_AB * log(AB)) / sigma,
                         (1 - (mu_0 + mu_AB * log(AB))) / sigma)) %>%
  mutate(AB = factor(AB)) %>%
  ggplot(aes(x, density, color = AB, group = AB)) +
  geom_line() +
  xlab("Batting average") +
  ylab("Prior density")
```

Notice that there is still uncertainty in our prior- a player with 10,000 at-bats could have a batting average ranging from about .22 to .35. But the range of that uncertainty changes greatly depending on the number of at-bats- any player with AB = 10,000 is almost certainly better than one with AB = 10.

## Step 2: Estimate each player's average using this prior

Now that we've fit our overall model, we repeat our second step of the empirical Bayes method. Instead of using a single $\alpha_0$ and $\beta_0$ values as the prior, we choose the prior for each player based on their AB. We then update using their $H$ and $AB$ just like before.

Here, all we need to calculate are the `mu` (that is, $\mu = \mu_0 + \mu_{\log(\mbox{AB})}$) and `sigma` ($\sigma$) parameters for each person. (Here, `sigma` will be the same for everyone, but that may not be true in more complex models). This can be done using the `fitted` method on the gamlss object.

```{r mu_sigma, dependson = "gamlss_fit"}
mu <- fitted(fit, parameter = "mu")
sigma <- fitted(fit, parameter = "sigma")

head(mu)
head(sigma)
```

Now we can calculate $\alpha_0$ and $\beta_0$ parameters for each player, according to $\alpha_{0,i}=\mu_i / \sigma_0$ and $\beta_{0,i}=(1-\mu_i) / \sigma_0$. From that, we can update based on $H$ and $AB$ to calculate new $\alpha_{1,i}$ and $\beta_{1,i}$ for each player.

```{r career_eb_wAB, dependson = "mu_sigma"}
career_eb_wAB <- career_eb %>%
  dplyr::select(name, H, AB, original_eb = eb_estimate) %>%
  mutate(mu = mu,
         alpha0 = mu / sigma,
         beta0 = (1 - mu) / sigma,
         alpha1 = alpha0 + H,
         beta1 = beta0 + AB - H,
         new_eb = alpha1 / (alpha1 + beta1))
```

```{r bbmodelscatter, dependson = "career_eb_wAB", echo = FALSE, fig.cap = "The relationship between the original empirical Bayes shrunken estimates and the values under the beta-binomial regression model."}
ggplot(career_eb_wAB, aes(original_eb, new_eb, color = AB)) +
  geom_point() +
  geom_abline(color = "red") +
  xlab("Original EB Estimate") +
  ylab("EB Estimate w/ AB term") +
  scale_color_continuous(trans = "log", breaks = 10 ^ (0:4))
```

We visualize how this changes our estimates in Figure \@ref(fig:bbmodelscatter). Notice that relative to the previous empirical Bayes estimate (shown in Figure \@ref(fig:ebestimatescatter)), this one gives lower estimates for batters with low AB and about the same for high-AB batters. This fits with our earlier description- we've been systematically over-estimating batting averages.

We can also revisit the AB/estimate relationship examined in Figure \@ref(fig:abaveragescatter), and see whether new method solves the problem of shrinking low-AB batters to be too high (Figure \@ref(fig:abthreemethods)).

```{r abthreemethods, dependson = "career_eb_wAB", echo = FALSE, fig.cap = "The relationship between AB and the estimate for three methods: raw batting average, shrunken batting average, and averages shrunk towards a relationship found through regression."}
library(tidyr)

lev <- c(raw = "Raw H / AB", original_eb = "EB Estimate", new_eb = "EB w/ Regression")

career_eb_wAB %>%
  filter(AB >= 10) %>%
  mutate(raw = H / AB) %>%
  gather(type, value, raw, original_eb, new_eb) %>%
  mutate(mu = ifelse(type == "original_eb", prior_mu,
                     ifelse(type == "new_eb", mu, NA))) %>%
  mutate(type = factor(plyr::revalue(type, lev), lev)) %>%
  ggplot(aes(AB, value)) +
  geom_point() +
  geom_line(aes(y = mu), color = "red") +
  scale_x_log10() +
  facet_wrap(~type) +
  xlab("At-Bats (AB)") +
  ylab("Estimate")
```

Notice that our original estimate ("EB Estimate") used to shrink batters towards the overall average, but the new estimate ("EB w/ Regression") now shrinks them towards the overall *trend* fit by beta binomial regression, represented by that red slope.[^edgeR]

Don't forget that this change in the posteriors won't just affect shrunken estimates. It will affect all the ways we've used posterior distributions in previous chapters: credible intervals, posterior error probabilities, and A/B comparisons. Improving the model by taking AB into account can help all these results more accurately reflect reality.

[^edgeR]: If you work in in my old field of gene expression, you may be interested to know that empirical Bayes shrinkage towards a trend is exactly what some differential expression packages such as [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) do with per-gene dispersion estimates. It's a powerful concept that allows a balance between individual observations and overall expectations.