13.0_LinearRegression.Rmd

---
output: 
  html_document: 
    fig_caption: yes
    number_sections: yes
    theme: readable
    toc: yes
    toc_depth: 3
---

# Linear Regression {#chLinReg}

## Learning Objectives for Chapter

1. State the null and alternative hypotheses for simple linear regression (SLR).
1. Describe how variance is partitioned in SLR.
1. Calculate the slope and intercept and their confidence intervals.
1. Interpret what the slope and intercept mean in real, biological terms.
1. Calculate and interpret the coefficient of determination.
1. List examples of uses for simple linear regression.
1. Write the model and assumptions for simple linear regression.
1. List strategies to increase the precision of slope estimates.
1. Determine if a problem requires an estimate for the expected value or for a single observation.
1. Calculate and interpret an ANOVA table for SLR.
1. Discuss the difference between estimating parameters and making predictions in SLR.
1. Build confidence intervals for predictions of expected value and individual responses.

## Introduction to Simple Linear Regression

### Goals of Simple linear regression

Simple linear regression can be used for multiple purposes, the vast majority of which can be classified into two categories:

1. Prediction. Emphasis is on values of Y expected for given levels of X.
2. Explanation or understanding. Emphasis is on regression coefficients.

In the case of **prediction** the goal of SLR is to provide a means to estimate values of the response variable $Y$ or expected values of the response variable $\mu{Y|X}$ for certain values of the independent variable that may not have been observed. In this application we are not really concerned about why Y is related to X or whether they have a cause-and-effect relationship. For example, we may use the relationship between femur length and age to estimate the age at death of fossilized mammals. Obviously, femur length does not cause animals to be younger or older. Similarly, we may use the thickness of the triceps skin pinch to estimate a person's body fat content. The thickness of the pinch does not cause people to have more or less fat.

In the case of **understanding** the goal is to determine how and how much a causal variable affects another; how strong is their relationship. For example, we determine the relationship between total food consumed and weight gain in a beef steer. The food consumed is the cause of the gain, and the relationship is important because it establishes how much gain there is per unit food consumed. In this case, the slope of the relationship is the food conversion efficiency. In this case we are specifically interested in having good estimates of the slope and its sampling variance. We are also interested in the variance of $\epsilon$ because that is an indicator of the effect of other factors on gain and food conversion efficiency.

### Fixed and Random X cases

There are two types of situations in simple linear regression and in regression in general: fixed X and random X. The main point of this section is that all equations presented in this chapter are strictly valid for the fixed X case, and are approximations in the random X case.

**Fixed X** means that the values of the predictor or independent variable X are known without error and fixed values. X is not a random variable and its values did not result from a random process or experiment. The theory and calculations presented in this chapter are specifically valid for this situation.

According to [@RossetTibshirani2018] the fixed X case is the most commonly used one and it is characterized by the following:

1. The covariate values $X_1, . . . , X_k$ used to estimate parameters are not random but constitute  a set of selected or designed values, and the only randomness the estimates is due to the responses $Y1, . . . , Yk$.
2. The covariates values used for prediction exactly match $X_1, . . . , X_k$ and the corresponding responses are independent realizations of $Y1, . . . , Yk$, respectively.

Examples of fixed X cases are:

- Application of prescribed quantities of fertilizer to field plots to study yield.
- Previously selected levels of feeding for animals to study weight gain.
- Drug dose given to rats to determine effect on metabolic rate.
- Dose of active principle per acre to study effect on pest density.


**Random X** means that the values of the predictor X are not set by the experimenter, but instead are random values resulting from a random process. The random process may be the process of measurement and or the process of sampling the individuals or experimental units where the response and the predictor are to be measured.
Examples of random X case are:

- Measure the available P concentration in soil and yield in field plots.
- Measure triceps pinch thickness and body fat content of several people.
- Measure total income and food expenses in a set of randomly selected families.
- Measured concentration of a hormone in blood and milk yield of dairy cows.

When there is measurement error in the values of X, one consequence is that the estimate of the slope will be biased downwards. If it is possible, selecting a wide range of X values, relative to the variance of measurements, will ameliorate the problem. Other things being equal, selection of a wide range of X values increases the precision of the estimations, because all the equations for variances include the sum of squares of X in a denominator.


### Population Relationship

In simple linear regression, we are interested in the relationship between two continuous variables. More specifically, we are interested in how one variable (Y, the dependent variable) changes with each unit change in another variable (X, the independent variable). The mathematical model that describes how the mean of Y changes as a function of X is that for a straight line. Thus, simple linear regression works for populations where the relationship between expected or mean Y and X a is of the following form: 

<br>

$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i = \mu_{Y|X_i} + \epsilon_i\\[25pt]
\epsilon_{i} \sim N(0, \ \sigma^2)$$

<br>

where $Y_i$ is the observed value of variable $Y$ for observation $i$, $X_i$ is the corresponding value of the predictor variable for that observation. The variable $\epsilon_i$ represents the random deviation from the observed value of Y to the true line that represents the population relationship between Y and X. The "error" $\epsilon$ is normally distributed with mean 0 and variance $\sigma_{\epsilon}^2$.

Note that the true values of Y for each X are not on the line; they have a normal distribution about the line for any given X. That normal distribution has mean equal to $\beta_0 + \beta_1X_i$ and variance $\sigma_{\epsilon}^2$. Thus, the points on the true line are the means of normal populations, and there is one of these normal populations for each possible level of X, not just those levels sampled. Regardless of the mean and the value of X, all the distributions are independent and identical normal distributions with the same variance $\sigma_{\epsilon}^2$.

If the population were known, we could define a region about the line containing the central 95% of the population values. For any level of X, say $X_h$, that region is centered at $\mu_{Y|X_i} = \beta_0 + \beta_1X_h$ and ranges from $\mu_{Y|X_i} + z_{0.025} \ \sigma_{\epsilon}$ to $\mu_{Y|X_i} + z_{0.025} \ \sigma_{\epsilon}$. The point of this is to emphasize that even when the population values are known, there is variation of individual observations about the line.

$\beta_0$ and $\beta_1$ are called *regression coefficients*, and they have their own special meaning. $\beta_0$ is the **intercept**, and it is simply $\hat{Y}_i$ when $X_i$ is equal to 0. $\beta_1$ the **slope**, and it represents the change in Y for each unit change in X. For example, in Figure \@ref(fig:slrPop) the slope is 0.9, which means that when X increases by 1, $\mu_{Y|X}$ increases by 0.9. It is extremely important to consider the units of Y, X, $\beta_0$ and $\beta_1$, both for prediction and understanding purposes.
<br>

```{r slrPop, echo = F, fig.cap = "Scatterplot of population values and true regression line. Each dot represents a pair of x and Y from the population. Imagine that the whole population of dots is plotted. The blue line is the true population line, which represents the mean of the response variable Y for each value of X. Each point deviates vertically from the mean for that level of X (the point on the line) by an amount that has a normal distribution and is called epsilon. The epsilon for one point is represented in red. The slope of the line, called beta, is the change in the mean of Y per unit change in X. Usually, only a smaple of points is available, so the blue line and the parameters are not known."}

library(latex2exp)
library(shape)

set.seed(400)

x <- 1:5000/500
Y <- 5 + 0.9 * x +rnorm(5000, sd = 1)

plot(x, Y, pch = ".", # xaxt = "n", yaxt = "n", 
     xlab = "X", ylab = "Y", 
     xlim = c(0,10), ylim = c(0,17), 
     bty = "l", cex.lab = 1.5, cex = 1,
     main = "Population line and X-Y values")

lines(x = c(6, 6), y = c(8, 10.4), col = "red", lty = 1, lwd = 2)

points(6, 8, pch = 20)

lines(0:10, 5 + 0.9 * 0:10, lwd = 2, col = "blue")

text(7.5, 9, labels = TeX("$\\epsilon_i = \\mu_{Y|X} - Y_i$"),cex = 1.5, col = "red")

arrows(x0 = 0.5, y0 = 5.45, x1 = 4, y1 = 5.45, code = 0, lwd = 2, lty = 2)
arrows(x0 = 4, x1 = 4, y0 = 5.45, y1 = 8.6, code = 0, lwd = 2, lty = 2)
text(2.5, 4.5, labels = expression(paste(Delta,"X")), cex = 1.5)
text(4.3, 7, labels = expression(paste(Delta,"Y")), cex = 1.5)
text(4.2, 4,expression(beta[1] == over(Delta*Y,Delta*X)), cex = 1.5)

text(2, 14, labels =TeX("$\\mu_{Y|X} = \\beta_0 + \\beta_1 X$"), cex = 1.7, col = "blue")
text(2, 12.5, TeX("$\\mu_{Y|X} = 5 + 9 X$"), cex = 1.6, col = "blue")

Arrows(2.2, 11.7, 3, 8.1, lwd = 1.5, col = "blue")
```

<br>


**Units**

Imagine that leaf expansion rate of a grass species is studied as a function of temperature. If you were told that the slope of the relationship is 50, you would not know if temperature has a strong or weak effect on leaf expansion, because the number given has no units. The number means something only when given with its proper units. Suppose that the units are $mm \cdot month^{-1} degreeC^{-1}$. This means that for each degree C that the temperature is increased, the expansion rate increases by 50 mm per month. This is quite a large effect. IN this case the units are:

<br>

$$Y: \quad mm \cdot mo^{-1} \\[20pt]
X: \quad degrees \ C \\[20pt]
\beta_1: \quad mm \cdot mo^{-1} \ degrees \ C^{-1}\\[20pt]
\beta_0: \quad mm \cdot mo^{-1}$$

<br>


## Sample relationship and sampling distributions

The problem in simple linear regression is to estimate $\beta_0$, $\beta_1$ and $\sigma_{\epsilon}^2$ using a random sample of $\{X, Y\}$ pairs. With those estimates we can estimate the the position of the line for any $X$ and our uncertainty about that position, as well as the uncertainty about any single population value for that value of $X$.

We use observed values of X and Y to estimate the parameters $\beta_0$ and $\beta_1$ with the method called **ordinary least squares**. This method chooses the values of $\beta_0$ and $\beta_1$ that minimize the sum of squares of the residuals or vertical distances from observation to estimated line. You can think of this as trying many many lines and picking the one that yields the least value for the SSE. Figure \@ref(fig:leastsquaresline) illustrates a least squares line versus alternative lines we could use to explain the relationship between X and Y.

<br>

```{r leastsquaresline, echo = F, warning = F, fig.cap = "The solid line is the line of 'best fit' that minimizes the sum of the squared deviations between the line and the observations, which are represented by the points. The dashed lines represent candidate lines with different regression coefficients ($\\beta_0$ and $\\beta_1$) that do not minimize the SSE."}

set.seed(400)

x <- 0:10
Y <- x +rnorm(10)

plot(x, Y, pch = 20, 
     xaxt = "n", yaxt = "n", 
     xlab = "X", ylab = "Y", 
     xlim = c(0,10), ylim = c(0,10), 
     bty = "l")

lines(0:10, 0:10, lwd = 2)

slopes <- seq(0, 1, .20)

b0 <- sort(seq(0,5, 1), decreasing = T)

for(i in 1:6){
   y <-  b0[i] + x*slopes[i]
   lines(0:10, y, lty = 3, lwd = 1.5)

}

```

<br>


You may recall that the sample average is the least squares estimate of the population mean. The same principle is applied here, except that because the population mean changes with X, a different algorithm is necessary to find the least squares estimates. The solution consists of expressing the SSE as a function of estimated parameter values, calculating the partial derivatives of SSE with respect to each parameter, setting those to zero and solving for parameter values. The values obtained are estimated parameter values that change from sample to sample.


The ordinary least squares procedure results in the following equations to estimate the regression coefficients using a sample with *k* observations or X-Y pairs: 


<br>

\begin{equation}
\hat{\beta_1} = \frac{\sum_{i=1}^k (X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^k (X_i-\bar{X})^2} (\#eq:beta1)
\end{equation}

<br>

The equation for $\beta_1$ includes $\sum_{i=1}^k (X_i-\bar{X})(Y_i-\bar{Y})$, called the **sum of cross-products**, which is involved in the calculation of the covariance between random variables.

<br>


\begin{equation}
\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \ \bar{X}
(\#eq:beta0)
\end{equation}

$$\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1} \ X_i \qquad \quad \text{estimated Y}$$

<br>

The equation to calculate $\hat{\beta_0}$ determines that the estimated line always passes through the point $\{\bar{X},\bar{Y}\}$. Because they are calculated using the sample values, $\hat{\beta_0}$, $\hat{\beta_1}$ and $\hat{Y}$ are random variables that vary from sample to sample. These estimates have the following properties:

**Unbiasedness**

The mean or expectation of the estimated parameters over many, many independent samples equals the true value of the parameter.

<br>

$$E\{\hat{\beta_0}\} = \beta_0 \\[20pt]
E\{\hat{\beta_1}\} = \beta_1 \\[20pt]
E\{\hat{Y_i}\} = \mu_{Y|X_i}$$

<br>

**Normality**

Each of the estimates are random variables with normal distributions.

<br>

$$\hat{\beta_0} \sim N \left (\beta_0, \ \sigma^2_{\hat{\beta_0}} \right ) \\[20pt]
\hat{\beta_1} \sim N \left (\beta_1, \ \sigma^2_{\hat{\beta_1}} \right ) \\[20pt]
\hat{Y}_{X_h} \sim N \left (\mu_{Y|X_h}, \ \sigma^2_{\hat{Y}} \right ) $$

<br>

**Variance**

<br>

$$V\{\hat{\beta_0}\} = \sigma^2_{\hat{\beta_0}} = \sigma^2_{\epsilon} \ \left [ \frac{1}{k} + \frac{\bar{X}^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ] \\[30pt]
V\{\hat{\beta_1}\} = \sigma^2_{\hat{\beta_1}} =   \frac{\sigma^2_{\epsilon}}{\sum_1^k \ (X_i - \bar{X})^2} $$

<br>

The variance of $\hat{Y}$ depends on whether the estimated Y is for the expected value of Y given a value of X or for a single value of Y given X. The first case is an **estimation**, because we are guessing the value of a fixed quantity or parameter: $\mu_{Y|X}$. The second case is a **prediction**, because we are guessing the value of a random variable $Y_i|X$.

**Estimation** of $\mu_{Y|X_h}$. The uncertainty comes only from the uncertainty about $\beta_0$ and $\beta_1$, which is represented by the sampling variances of $\hat{\beta_0}$ and $\hat{\beta_1}$.

<br>

$$V\{\hat{\mu}_{Y|X_h}\} = \sigma^2_{\epsilon} \ \left [ \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ]$$

<br>

**Prediction** of $Y_i|X_i$. In this case there is uncertainty about the unknown $\beta_0$ and $\beta_1$, plus the uncertainty due to the variance of $\epsilon$. Even if $\beta_0$ and $\beta_1$ were known, guessing $Y_i|X_i$ would amount to predicting the value of a random normal number drawn from a population with mean $\mu_{Y|X_i}$ and variance $\sigma^2_{\epsilon}$. Thus, the equation for the variance of $\hat{Y}_i|X_i$ includes an extra $\sigma^2_{\epsilon}$


<br>

$$V\{\hat{Y}_i|X_i\} = \sigma^2_{\epsilon} \ \left [ 1 + \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ]$$


<br>


Usually, the true variance of population $\sigma^2_{\epsilon}$ is not known, so we estimate it with the mean squared error calculated as follows. Once the estimated line is calculated, there is a vertical distance between each observation and the line, called *residual* or *errors*. These residuals are estimates of the true but unobservable $\epsilon$ for each observation. The sum of squares of residuals, called SSE, divided by the degrees of freedom of the SSE is the best estimate of the variance of $\epsilon$.

<br>


$$Y_i = \hat{\beta_0} + \hat{\beta_1} \ X_i + e_i = \hat{Y}_i + e_i \\[25pt]
SSE = \sum^k_{i=1} \ e_i^2 = \sum^k_{i=1} \ (Y_i - \hat{Y}_i)^2 \\[25pt]
dfe = k - 2 \\[25pt]
\hat{\sigma}^2_{\epsilon} = MSE = \frac{SSE}{dfe}$$


<br>

Therefore, the variances of the estimated coefficients and response are calculated as:

$$\hat{V}\{\hat{\beta_0}\} = \hat{\sigma}^2_{\hat{\beta_0}} = MSE \ \left [ \frac{1}{k} + \frac{\bar{X}^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ] \\[35pt]
V\{\hat{\beta_1}\} = \sigma^2_{\hat{\beta_1}} =   \frac{MSE}{\sum_1^k \ (X_i - \bar{X})^2} \\[35pt]
V\{\hat{\mu}_{Y|X_h}\} = MSE \ \left [ \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ] \\[35pt]
V\{\hat{Y}_i|X_i\} = MSE \ \left [ 1 + \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ]$$

<br>


## Fitting the least squares line

An example of a linear relationship between an independent continuous variable (X) and a dependent continuous variable (Y) would be the relationship between annual rainfall and rice crop yield (kg/ha) in India. Figure \@ref(fig:rice) shows observed rice crop yields in India from 1990--2000 against annual rainfall levels.

```{r rice, echo = F, warning = F, fig.cap="Rice yields  and rainfall in India from 1990 to 2000."}

rice <- read.csv("Datasets/rainYield.csv")

plot(rice$rain.mm, rice$yield.kg.ha, xlab = "Annual rainfall", ylab = "Rice crop yield (kg/ha)", cex.lab = 1.5, cex = 1.25)

abline(lm(yield.kg.ha ~ rain.mm, rice))

```

In table \@ref(tab:ricetab), we show the values and calculations necessary to find the regression coefficients for the least squares regression line. 

```{r ricetab, echo = F}
library(kableExtra)
X <- rice$rain.mm
Y <- rice$yield.kg.ha
meanX <- round(mean(X),1)
meanY <- round(mean(Y),1)
deltaX <- round(X-meanX,1)
deltaY <- round(Y-meanY,1)
cross <- round(deltaX*deltaY,1)
SSX <- round(deltaX^2,1)
sums <- c("","","", "Sums", round(sum(cross),1), round(sum(SSX),1))

df <- data.frame(X = x, Y = y, deltaX = deltaX, deltaY = deltaY, XY = cross, SSX = SSX)
df <- rbind(df, sums)

knitr::kable(
   df,
   col.names = c("$X_i$", "$Y_i$", "$(X_i - \\overline{X})$", "$(Y_i - \\overline{Y})$", "$\\Delta X$$\\Delta Y$", "$\\Delta X^2$"), 
   caption = "Table shows the rice crop yields in India (Y) from 1990--2000, along with annual rainfall (X), the difference between each X$_i$ and \\overline{X}, the difference between each Y$_i$ and \\overline{Y}, the product of the X and Y deviations from their respective means ($\\Delta X$$\\Delta Y$), and the squared deviations of X$_i$ from \\overline{X}",
   align = rep("r", 6)
) %>% row_spec(12, bold = T) 


```

We can now use equation \@ref(eq:b1) to find the slope of the regression line that minimizes the error in the observations around the line:

$$\begin{equation}
\begin{aligned}
\sum_{i=1}^k (X_i-\overline{X})(Y_i-\overline{Y}) & = -60241.5 \\
\sum_{i=1}^k (X_i-\overline{X})^2 & = 93019.4 \\
\beta_1 & = \frac{\sum (X_i-\overline{X})(Y_i-\overline{Y})}{\sum (X_i-\overline{X})^2} \\
\beta_1 & = -60241.5/93019.4 \\
\beta_1 & = -0.65
\end{aligned}
\end{equation}$$

And we can use \@ref(eq:b0) to find the intercept:

\begin{equation}
\begin{aligned}
\beta_0 & = \overline{Y} - \beta_1\overline{X} \\
\overline{Y} & = 2064 \\
\beta_1 & = -0.65 \\
\overline{X} & = 1130.7 \\
\beta_0 & = 2064 - 0.65 \times 1130.7 \\
\beta_0 & = 2799
\end{aligned}
\end{equation}

This gives us the regression equation: 

\begin{equation}
\hat{Y}_i = 2799 - 0.65 \times X_i + \epsilon_i
(\#eq:fittedrice)
\end{equation}

And we can plot this line on the original data scatterplot: 

```{r riceline, echo = F, warning = F, fig.cap="Observed rice crop yields in India from 1990–2000 against annual rainfall levels and the least squares regression line ($\\beta_0 = 2799$kg/ha,  $\\beta_1 = -0.65$)"}
rice <- read.csv("./Datasets/rainYield.csv")
plot(rice$rain.mm, rice$yield.kg.ha, xlab = "Annual rainfall", ylab = "Rice crop yield (kg/ha)", cex.lab = 1.5, cex = 1.25)
yhat <- 2799 - 0.65*rice$rain.mm

lines(rice$rain.mm, yhat, lwd = 2)
```


We have shown you above how to fit a simple linear regression line to explain the relationship between a predictor variable X and a dependent variable Y. We can state the direction of the relationship (negative) and report an effect size based on the estimated slope (for every increase in one unit of annual rainfall, there is a 0.65 decrease in projected rice crop yield). We can even state what we would expect the crop yield to be when there is no rain at all on a given year, which is simply the intercept (2,799 kg/ha).  

But how do we know if the relationship that our regression line describes is significant? If you observe the scatter of the observations around the line in figure \@ref(fig:riceline), you can see that the line does not perfectly predict crop yield given the annual rainfall. Those deviations are represented by the residual error symbol, $\epsilon_i$ in equation \@ref(eq:fittedrice). The residual error is the random variation in crop yield (Y) that is not explained by our regression line. In other words, it is the variation in Y that is not explained by X. Is the residual error so large that our model of the relationship between X and Y is useless?  In the next section, we will demonstrate how we test that the relationship between X and Y is significant, given the variation we see in Y. 

## Analysis of variance of regression 

In Chapter 9, we introduced you to a method called the analysis of variance. The general idea of analysis of variance is that we can partition the total variance of a Y dependent variable into random variation and variation explained by an X independent variable in order to test if there is a significant effect of X on Y. In simple linear regression, we partition the total sum of squares of Y (SSY) into the sum of squares that can be explained by the regression line (SSR) and the sum of squares that is not explained by the regression line, also called the residual sum of squares (RSS). $SSY = SSR + RSS$. In Figure \@ref(fig:anovaslr), we show you each component of variation in Y. The total variation in Y has to do with the deviation of each $Y_i$ from $\overline{Y}$. The regression line explains some of that variation in the deviations of each $\hat{Y}_i$ from $\overline{Y}$. The remaining, residual variation is in the deviations of each observation $Y_i$ from$\hat{Y}_i$.

```{r anovaslr, echo = F, fig.cap = "Components of variation in Y. The total sum of squares represents the total variation in Y (SSY $= \\sum (Y_i - \\overline{Y})^2$. The sum of squares of the regression represents the variation that is explained by the regression line, and by extension, X (SSR $= \\sum (\\hat{Y}_i  - \\overline{Y})^2$). The residual sum of squares represents the random variation in the scatter of Y observations around the regression line (RSS $= \\sum (Y_i - \\hat{Y}_i)^2$) "}
########make a figure and label it slr#####
set.seed(400)
x <- 1:10
Y <- x +rnorm(10)
plot(x, Y,  pch = 20, xaxt = "n", yaxt = "n", xlab = "X", ylab = "Y", xlim = c(0,10), ylim = c(0,10), bty = "l", cex.lab = 1.5)
lines(0:10, rep(mean(Y),11), lty = 5, lwd = 1.5)
text(1, mean(Y)+1, expression(bar(Y)), cex = 1.5)
arrows(x0 = 2, y0 = Y[2], y1 = mean(Y), code = 0, lwd = 1.15, lty = 3)
text(1, 3.75, expression(Y[i] - bar(Y)), cex = 1.5)
text(-1,5, "Y" , cex = 1.75)
text(5, -1.5,"X", cex = 1.75 )
lines(0:10, 0:10)
arrows(x0 = 6, y0 = 6, y1 = Y[6], code = 0,lty = 3, lwd = 1.15)
text(6.5, 3.75, labels = expression(Y - hat(Y) == epsilon[i]),cex = 1.5)
arrows(x0 = 8, y0 =mean(Y), y1 = 8, code = 0, lty=3, lwd = 1.15)
text(9, 7.25, labels = expression(hat(Y) - bar(Y)),cex = 1.5)


```

From the sum of squares (SS), we can obtain estimates of the **variances**, which are called the mean squares (MS). The MS is simply the SS divided by its corresponding degrees of freedom. In a simple linear regression, we are interested in comparing the variance explained by the regression (MSR) and the variance explained by the residuals (MSr). Specifically, we wish to obtain an F-ratio, which we calculate as the ratio of the variance explained by the regression to the residual variance $\frac{MSR}{MSr}$.  
 

We can summarize this information in an analysis of variance table.  

```{r aovtable, echo = F, warning = F}

source <- c("Total", "Regression", "Residual")
df <- c("k-1", 1, "k-2")
SS <- c("$\\sum (Y_i - \\overline{Y})^2$", "$\\sum (\\hat{Y}_i  - \\overline{Y})^2$", "$\\sum (Y_i - \\hat{Y}_i)^2$")
MS <- c("","MSR", "MSr")
Finfo <- c("", "F","")
df <- cbind(source, df, SS, MS, Finfo)
knitr::kable(df,
             col.names = c("source", "df", "SS", "MS", "F-ratio"),
             caption = "Analysis of variance table.")

```
  
  
We can use the F-ratio to determine the significance of the regression. 
Specifically, the null hypothesis that we are interested in testing is that the regression coefficient ($\beta_1$) = 0.  If $\beta_1$ is equal to 0, then for each unit change in X, Y does not change, and there is no relationship between X and Y.  

We will now demonstrate how to do an analysis of variance on the dataset from the previous section on rice crop yields in India.  

First, we must calculate the different sources of variation in Y: 

```{r SS, warning = F, echo = F}

library(kableExtra)
X <- rice$rain.mm
Y <- rice$yield.kg.ha
meanX <- mean(X)
meanY <- mean(Y)
TS <- round((Y-meanY)^2, 1)
yhat <- 2799 - 0.65*rice$rain.mm
SR <- round((yhat-meanY)^2, 1)
RS <- round((Y - yhat)^2)
sumlabels <- c("","","","SSY", "SSR", "RSS")
sums <- c("","","", round(sum(TS),1), round(sum(SR),1),round(sum(RS),1))

df <- data.frame(X, Y, yhat, TS, SR, RS)
df <- rbind(df, sumlabels, sums)

knitr::kable(
   df,
   col.names = c("X$_i$", "Y$_i$", "$\\hat{Y}_i$", "$(Y_i - \\overline{Y})^2$", "$(\\hat{Y}_i - \\overline{Y})^2$", "$(Y_i - \\hat{Y}_i)^2$"), 
   caption = "Table shows the rice crop yields in India (Y) from 1990--2000, along with annual rainfall (X), predicted crop yields based on regression equation \\@ref(eq:fittedrice) ($\\hat{Y}_i$), the squared deviations in Y$_i$ from $\\overline{Y}$, the squared deviations between Y$_i$ and $\\hat{Y}_i$, and the squared deviations between $\\overline{Y}$ and $\\hat{Y}$. The final row shows the sums of the squared deviations for the corresponding columns, representing the total sum of squares (SSY), sum of squares due to regression (SSR), and residual sum of squares (RSS)",
   align = rep("r", 6)
) %>% row_spec(12, bold = T) 


```
  
  
No we can construct and analysis of variance table:

```{r riceaov, echo = F, warning = F}


source <- c("Total", "Regression", "Residual")
df <- c(10, 1, 9)
SS <- c(190919.4, 39300.9, 151906)
MS <- c("",39300.9, 16878.4)
Finfo <- c("", 2.3,"")
df <- cbind(source, df, SS, MS, Finfo)
knitr::kable(df,
             col.names = c("source", "df", "SS", "MS", "F-ratio"),
             caption = "Analysis of variance table.")
```
  
  
We now have an F-ratio which we can compare to a critical F-ratio given a significance probability threshold, from an F distribution with the appropriate degrees of freedom. The degrees of freedom that are relevant are from the numerator ($df_1$) and denominator ($df_2$) of the F-ratio. In the example above, $df_1 = 1$ and $df_2 = 9$.  

We can plot an F distribution with those degrees of freedom in R, using the following code:  


Create a sequence of F values which will be plotted on the X axis of the F probability density distribution:
   
```{r}

x <- seq(0.25,6,0.1)

```

Generate probability densities for the F values in x:

```{r}

pd <- df(x, df1 = 1, df2 = 9, ncp = 0)

```


Plot the probability density of an F distribution with df1 = 1, df2 = 9:

```{r}

plot(x, pd, 
     type = "l",
     xlab = "", ylab = "")

```

We will use the typical threshold of probability for significance, where $\alpha = 0.05$ and obtain the critical F value corresponding to that probability: 

```{r}

criticalF <- qf(p = 1 - 0.05, 
                df1 = 1, df2 = 9)

```

Note that we specified 0.05 instead of $\alpha/2$ = 0.025 because the F test for a linear regression is one-tailed.

Add a red dashed vertical line where that critical F value is located and a mark where our test F-ratio falls. Note the "col" argument changes the color of the line, and "lty" changes the line type to a dashed line""

```{r}

plot(x, pd, 
     type = "l",
     xlab = "", ylab = "")

abline(v = criticalF, 
       col = "red", 
       lty = 3)

points(2.3, df(2.3,1,9), pch = 4)

text(2.3, 0.15, "F-ratio")  

```
  
  
  As you can see, the F-ratio we calculated from our dataset on rice crop yields and annual rainfall was not significant. The test F-ratio was much smaller than the critical F value. We can calculate a p value for our test F ratio using the following code:
  
Note that we specify below the "lower.tail" is F so that it gives us the probability in the upper tail, to the right of where our F-ratio falls on the pdf.  

```{r}

pf(2.3, 
   df1 = 1, 
   df2 = 9, 
   lower.tail = F)

```  
  
  
  This p value indicates that there is a 0.16 probability of seeing the results that we saw or more extreme results if the null hypothesis was true. This is larger than $\alpha = 0.05$. We therefore fail to reject our null hypothesis that the slope of our regression line is equal to 0, and state that the relationship was not significant.  
  
  We can also use a built-in statistics function in R to do the simple linear regression all at once:
  
```{r}

X <- rice[,1] #annual rainfall
Y <- rice[,2] #crop yield

summary(lm(Y ~ X))

```


  Note that the F-ratio, the p value, and the regression coefficients are the same as what we calculated above.  The estimate of (Intercept) is the same as $\beta_0$, though it differs slightly from our hand calculations due to rounding error. The estimate of the "X" coefficient is the same as $\beta_1$.  
  
  The summary output from the lm() function in R also reports a t value and a p value, [PR(>|t|)], for each of the coefficients. We are generally only interested in the p value for the predictor variable coefficients, and can ignore the t test results for the $\beta_0$ coefficient. The p value that was reported for the "X" coefficient is the same as the p value reported for the F test. This is because there is only one X predictor in a simple linear regression, so all of the explained variance in Y is from the single X variable, annual rainfall. Doing a t test on the $\beta_1$ coefficient for X in this case will therefore yield the same results as doing an F test. 
  

## Confidence Interval for $\hat{\beta_1}$

We can construct confidence intervals for an estimate of $\beta_1$ using the calculated MSr as an estimate of the population variance. The standard error of the regression coefficient is:

\begin{equation}
s_b = \sqrt{\frac{MSr}{\sum (X_i - \overline{X})^2}}
(\#eq:sb)
\end{equation}

Note the two components of the standard error in our estimate of $\beta_1$: the MSr and the SSX. If we want to increase our certainty in the slope of the regression, then we can either decrease our MSr by increasing sample size or increase the variation in X by increasing the range of X values for which we have measurements of Y.  

In addition to $s_b$, we need a t value to construct as confidence interval around the estimate of $\beta_1$.  We use the degrees of freedom from the residual variance (k-2) and calculate a t value under the null hypothesis that the true population $\beta_1$ = 0. We can use the following formula to obtain the t value: 

\begin{equation}
t = \frac{\hat{\beta}_1 - \beta_1}{s_b}, 
df = k-2
(\#eq:tsb)
\end{equation}  

The confidence interval can then be calculated as:   

\begin{equation}
\hat{\beta}_1 \pm t_{\alpha, k-2} \times s_b
(\#eq:CIb)
\end{equation}  

##Confidence Interval for $\hat{\beta}_0$

We can also construct confidence intervals for an estimate of $\beta_0$ The standard error of the intercept is:

\begin{equation}
s_a = \sqrt{MSr\Bigg(\frac{1}{k} + \frac{\overline{x}^2}{\sum (X_i - \overline{X})^2}\Bigg)}
(\#eq:sa)
\end{equation}

where k is the number of observations. 

Just as in the calculation of a t value for the $\beta_1$ confidence intervals, we use the degrees of freedom from the residual variance (k-2) to obtain a t value under the null hypothesis that the true population intercept = 0. We can use the following formula to obtain the t value: 

\begin{equation}
t = \frac{\hat{\beta}_0 - \beta_0}{s_a},
df = k-2
(\#eq:tsb)
\end{equation}  

The confidence interval can then be calculated as:   

\begin{equation}
\hat{\beta}_0 \pm t_{\alpha, k-2} \times s_a
(\#eq:CIb)
\end{equation}  

## Confidence Intervals for $\hat{Y}_i$ 

The predicted value for Y at a given X$_i$ can be considered in two different ways. For X$_i$, $\hat{Y}_i$ can represent the expected mean of Y, or it can represent the expected value of an individual Y. The predicted value will be the same for either case, but our certainty in the estimate will be different.   
Predicting for a single Y means incorporating random variation at observation i. Recall that variation is represented by $\epsilon_i$ in the regression formula, and that visually it is represented by the scatter of points around the regression line on a scatterplot. We must account for the individual variance in the Y observations at X$_i$ for an individual $\hat{Y_i}$, so the confidence intervals for an individual $\hat{Y_i}$ will be larger.  

Just as in the calculations for the standard error of $\hat{\beta}_1$ and $\hat{\beta}_0$, we will use the degrees of freedom of the MSr (k-2) in our calculations.

The standard error of $\hat{Y_i}$ for an expected mean of Y is:

\begin{equation}
s_{\hat{\overline{Y}}} = \sqrt{MSr\Bigg(\frac{1}{k} + \frac{(x_i - \overline{x})^2}{\sum (X_i - \overline{X})^2}\Bigg)}
(\#eq:symean)
\end{equation}

while the standard error of $\hat{Y_i}$ for an individual Y is

\begin{equation}
s_{\hat{Y}} = \sqrt{MSr\Bigg(1 + \frac{1}{k} + \frac{(x_i - \overline{x})^2}{\sum (X_i - \overline{X})^2}\Bigg)}
(\#eq:sy)
\end{equation}

Note the addition of 1 in the calculation of $s_{\hat{Y}}$ to account for the additional residual variance when predicting an individual Y$_i$.

The confidence interval for $\hat{\overline{Y}}$ is

\begin{equation}
\hat{\overline{Y}} \pm t_{\alpha, k-2} \times s_{\hat{\overline{Y}}}
(\#eq:CIsymean)
\end{equation}  

and the confidence interval for $\hat{Y}$ is

\begin{equation}
\hat{Y} \pm t_{\alpha, k-2} \times s_{\hat{Y}}
(\#eq:CIsy)
\end{equation}.


## Coefficient of Determination, $R^2$

The coefficient of determination ($R^2$) is the proportion of variance that is explained by the regression line. If a large proportion of the variation in Y is not due to the regression line, then even if you obtain significant results, X is not a very good predictor of Y. The formula for $R^2$ is

\begin{equation}
R^2 = \frac{SSR}{SSY}
(\#eq:r2)
\end{equation}

For a simple linear regression with only one X predictor, $R^2$ can also simply be referred to as $r^2$.

In some cases, we may choose to adjust the $R^2$ value ($R^2_{adj}$ for the degrees of freedom from the regression, which is only 1 in a simple linear regression.  

\begin{equation}
R^2_{adj} = 1 - frac{MSr}{MSY}
(\#eq:radj)
\end{equation}

Using the example of rice crop yields in India, we can calculate an $R^2$ of

\begin{equation}
\begin{aligned}
R^2 & = \frac{39300.7}{190919.4} \\
& = 0.21
\end{aligned}
(\#eq:r2)
\end{equation}

We can multiply $R^2$ by 100 to state that 21% of the variation in annual rice crop yields was explained by annual rainfall.  

We can also calculate $R^2_adj$ as

\begin{equation}
\begin{aligned}
R^2_adj & = 1 - \frac{MSr}{MSY} \\
R^2_adj & = 1 - \frac{16878.4}{190919.4/10} \\
& = 0.12
\end{aligned}
(\#eq:r2)
\end{equation}  

There is a large reduction from $R^2$ to $R^2_{adj}$ because of the small number of observations ($k = 11$). The $R^2_{adj}$ from datasets with large sample sizes will have a closer match to $R^2$

Let's look again at the summary output from the lm() function in R:

```{r}

X <- rice[,1] #annual rainfall
Y <- rice[,2] #crop yield

summary(lm(Y ~ X))

```

The lm() function reports that the "Multiple R-squared" = 0.2043, while the "Adjust R-squared' is 0.12, which are equivalent to our calculations above. The small difference between our calculated $R^2$ and the "Multiple R-squared" from R is again due to rounding error.  

Our results were not significant, but if they were, then $R^2 = 20.4\%$ and $R^2_{adj} = 11.6\%$ would indicate that a fairly large proportion of the variance in Y is not explained by X. This is one way of ascertaining lack of fit in the regression model.


## Correlation Coefficient, r

The correlation coefficient, r, is a measure of the linear correlation between two variables. It does not imply a causal relationship between a fixed X variable and a Y dependent variable as regression does.

We calculate r using the following formula:

\begin{equation}
r = \frac{\sum(X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum (X_i - \overline{X})^2 \times \sum (Y_i - \overline{Y})^2}} 
(\#eq:r)
\end{equation}

The correlation coefficient ranges from -1 to 1. A perfect positive linear relationship between two variables would result in  $r = 1$, while a perfect negative relationship would results in $r = -1$.  No relationship is indicated by $r = 0$.


## Fitness of the regression model

Another aspect of using linear regression to test the significance of a relationship between X and Y is whether or not the model is appropriate for the data.

If there are more than one observation for each level of X, then we can perform a test of lack of fit to determine if there is enough information and deviation to reject the straight line model.


## Exercises and Solutions

## Homework

## Laboratory Exercises {#LabCh13PLS}


Prepare an .Rmd document starting with the following text, where you substitute the corresponding information for author name and date.

```
---
title: "Lab13 Simple Linear Regression"
author: "YourFirstName YourLastName"
date: "enter date here"
output: html_document
---
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

### Plant Sciences 

In simple linear regression (SLR) we study the relationship between a *predictor, explanatory, or independent* variable (X, horizontal axis) and a *response or dependent* variable (Y, vertical axis). Different goals may be achieved with SLR. We may be interested in determining how much Y changes as X changes, or we may be interested in estimating the value of Y for values of X that were not observed. All goals require that we estimate parameters of a linear function for a straight line that best fits the data, which is represented by the linear model:

$$Y_{i} = \beta_{0} + \beta_{1}X_{i} + \epsilon_{i} \\ \epsilon_{i} \sim N(0, \ \sigma)$$
Best fit is achieved by minimizing the sum of squares of residuals or vertical distances between observations and the line of fit. The estimated parameters are the *slope* ($\beta_{1}$) tangent of the angle between the line and the horizontal, and the *intercept* ($\beta_{0}$), the height of the line when X = 0. Numerically, the *slope* is the rate of change that occurs between the Y (dependent) variable and the X (predictor) variable, or in other words, how the values of Y change with each unit change in the X variable. The *intercept* is the value the Y variable takes when the X variable is equal to 0.  

In this exercise we use the clover.txt data set used in lab02. These data represent the mass of clover plants grown for different periods at three different temperatures. Temperatures are in the first column coded as 1 for 5-15 C, 2 for 10-20 C and 3 for 15-25 C. The second column contains the number of days of growth and the last column contains the log of the plant mass in grams (g). Column names are in the first row of the file, so we specify header = TRUE in the line of code to read the data. Data are placed in a data frame named “clover.


The relationship between plant size (Y, log transformed) and plant age (X, in days) is modeled as a straight line whose slope is the relative growth rate (RGR) of the plants. Relative growth rate is the amount of growth per unit time and per unit size of the plant. Relative growth rate is to plant size as interest rate is to a debt.


FOR THIS LAB WE IGNORE THE TEMPERATURE GROUPS.
RGR means relative growth rate and is estimated by the slope of the regression.


#### Part 1. Plot of data and estimation of parameters. [25 points]

1A) Make a scatterplot of plant size vs. age, ignoring the temperature groups. 
1B) Obtain estimates for the slope and intercept and interpret the results in terms of plant growth. 


```{r}

clover <- read.csv("./Datasets/Lab01clover.txt", header = TRUE)
str(clover)


plot(lnwt ~ days, data = clover)


slr1 <- lm(formula = lnwt ~ days, data = clover)
summary(slr1)


```

**ANSWER THE FOLLOWING QUESTIONS:**

1B) What are the estimates for the slope and intercept? Interpret the results in terms of plant growth. 


#### Part 2. Test of null hypothesis and R-square. [30 points]


Test the null hypothesis that RGR = 0 by performing an ANOVA for regression. Do the calculations "by hand" and then repeat using R functions.


```{r}

coef(slr1) # Use help(coef) to understand what this function does

intcpt <- coef(slr1)[1] # extract estimated intercept, or "a" as it is called in the SG textbook

slope <- coef(slr1)[2] # Code needed here. Extract the estimated slope or b, just like we did with the intercept. 
  
  
lnwt.hat <- intcpt + slope * clover$days # creates a vector of estimated or predicted lnwt’s for each of the measured days using the regression model coefficients


errors <- clover$lnwt - lnwt.hat # creates a vector of the differences between the sample lnwt measurements and the predicted weights from the model, i.e., these are the “residuals”


(TotalSS <- sum((clover$lnwt - mean(clover$lnwt)) ^ 2)) # total sums of squares

(SSErrors <- sum(errors ^ 2)) # Code needed here. Type in the vector of the residuals to calculate residual sum of squares

(SSReg <- TotalSS - SSErrors) # regression sums of squares


n <- length(clover$lnwt) # Code needed here. Complete the code with the column that contains the observations

(dfe <- n - length(coef(slr1))) # df of the residuals

(dfTotal <- n - 1) # Complete the code to calculate the total df

(dfReg <- 1) # Complete the code to calculate the df of the regression


(MSReg <- SSReg / dfReg)

(MSE <- SSErrors / dfe) # Complete the code to calculate the MSE 

(Fcalc <- MSReg / MSE) # Calculated F statistic from the data 

(Fcritical <- qf(0.95, dfReg, dfe)) # Complete the function arguments with the degrees of freedom for the numerator and the denominator to get the F critical value.


anova(slr1) # check your previous “by hand” calculation against the anova table.


```

**ANSWER THE FOLLOWING QUESTIONS:**

2A) Do you reject the Ho: $\beta_{1}$ = 0? Why? What does this mean in terms of plant growth, in non-technical terms.


2B) What proportion of the variance of lnwt is explained by days, the predictor?


#### Part 3. Make a 95% confidence interval for the RGR or slope. [25 points]


First, do the calculations by hand, using the formulas from the textbook 179-180. Then use R functions.


```{r}


# Hint: check your answers with the anova table. It should be the same if your calculations are correct


(tcritical.slr1 <- qt(0.975, dfe)) # Code needed here.  Enter the probability for a 2-tailed test with alpha = 0.05


SSX <- sum((clover$days - mean(clover$days)) ^ 2) # get sum of square of X


(se.beta <- sqrt(SSErrors / (dfe * SSX))) # calculate the s.e. for beta


beta.lo <- slope - tcritical.slr1 * se.beta # lower CI boundary.  


beta.hi <- slope + tcritical.slr1 * se.beta # Code needed here. Complete the formula for the the upper CI boundary


c(beta.lo, beta.hi) # Display the lower and upper CI extremes 

confint(slr1)[2,] # easier way to get the CI using the function from R


```

**ANSWER THE FOLLOWING QUESTIONS:**

3A) What is the 95% confidence interval for the relative growth rate (RGR) of the plants?


#### Part 4. Make a 95% confidence interval for mean plant size at a given age. [20 points]


Assume that you are interested in using the fitted regression model (the line) to estimate the mean mass of plants when they are 33 days old. Use the regression equation to make the estimate and then calculate the standard error of the estimate to make a confidence interval for it. Do detailed calculations first and then use R functions.

You will first calculate the predicted mean of Y for 33 days and its standard error and make a confidence interval for it. Then you will calculate the standard error for a predicted individual Y and compare it.

```{r}


(lnwt.hat.33 <- intcpt + slope * 33) # In other words, Y = a + bX, where Y, or the predicted value, is the weight of plants at 33 days.  


(mass.hat.33 <- exp(lnwt.hat.33))  # Since the response variable is log transformed, here we need to do a back transformation.

#Calculate the standard error of the predicted mean of Y:
se.lnwt.33.m <- sqrt(MSE * (1 / n + (33 - mean(clover$days)) ^ 2 / SSX)) # SG book pg. 192 and pg. d205 No. 6


# Complete the code to calculate the upper and lower CI for the mass of plants at 33 days:
(CI.lnwt.up <- lnwt.hat.33 + tcritical.slr1 * se.lnwt.33.m)

(CI.lnwt.lo <- lnwt.hat.33 - tcritical.slr1 * se.lnwt.33.m)


pred.data <- data.frame(lnwt = NA, group = NA, days = 33) # Here we create a data frame that we will use to “predict” or estimate lnwt in the next line.  Only the value for days is included since that is the X, independent, or predictor variable.


(ci.r <- predict(slr1, newdata = pred.data, interval = "confidence")) # this line uses predict() to predict or estimate lnwt and its CI, as we did by hand above.

# The interval = “confidence” argument makes the function return a confidence interval for the estimated log of plant mass.


exp(ci.r) # Complete the code to do a back transformation on the predicted values.


# Calculate the standard error of the predicted individual Y:
se.lnwt.33.ind <- sqrt(MSE * (1 + 1 / n + (33 - mean(clover$days)) ^ 2 / SSX)) # SG book pg. 205 No. 7


```

**ANSWER THE FOLLOWING QUESTIONS:**

4a) Explain why we need to use exp().


4b) Report the upper and lower extremes of the CI for the predicted mean of Y


4c) Is the standard error estimate for a predicted mean of Y or a predicted individual Y based on the model bigger? Why?


### Animal Sciences


The same instructions for completion and submission of work used in previous labs applies here. Refer to previous labs for the details.


In simple linear regression (SLR) we study the relationship between a *predictor, explanatory or independent* variable (X, horizontal axis) and a *response or dependent* variable (Y, vertical axis). Different goals may be achieved with SLR. We may be interested in determining how much Y changes as X changes, or we may be interested in estimating the value of Y for values of X that were not observed. All goals require that we estimate parameters of a linear function for a straight line that best fits the data:

$$Y_{i} = \beta_{0} + \beta_{1}X_{i} + \epsilon_{i} \\ \epsilon_{i} \sim N(0, \ \sigma)$$

Best fit is achieved by minimizing the sum of squares of residuals or vertical distances between observations and the line of fit. The estimated parameters are the *slope* ($\beta_{1}$) tangent of the angle between the line and the horizontal, and the *intercept* ($\beta_{0}$), the height of the line when X = 0. Numerically, the *slope* is the rate of change that occurs between the Y (dependent) variable and the X (predictor) variable, or in other words, how the values of Y change with each unit change in the X variable. The *intercept* is the value the Y variable takes when the X variable is equal to 0.  

In this exercise we will examine the relationship between body weight and heart weight in cats. This data was originally published in a short paper by R.A. Fisher in 1947 (see paper in CANVAS). R.A. Fisher was a British Statistician that developed the foundations for modern statistical theory. 

Data for the exercise consists of body weight (Bwt) in kilograms and heart weight (Hwt) in grams of 144 cats. We will examine the relationship
between heart weight and body weight in cats.


#### Part 1. Plot of data and estimation of parameters. [25 points]

1A) Make a scatterplot of body weight vs. heart weight. 
1B) Obtain estimates for the slope and intercept and interpret the results in terms of heart weight. 


```{r}
cats <- read.csv("./Datasets/Lab13Cats.csv")
str(cats)

plot(Bwt_Kg ~ Hwt_g, data = cats)

slr1 <- lm(formula = Bwt_Kg ~ Hwt_g, data = cats)
summary(slr1)

```

**ANSWER THE FOLLOWING QUESTIONS:**

1B) What are the estimates for the slope and intercept? Interpret the results in terms of heart weight and its relationship to body weight. 


#### Part 2. Test of null hypothesis and R-square. [30 points]

Test the null hypothesis that there is no relationship between body weight and heart weight (i.e., $\beta_{1}$ = 0) by performing an ANOVA for regression. Do the calculations "by hand" and then repeat using R functions.


2A) Do you reject the Ho: $\beta_{1}$ = 0? Why? What does this mean in terms of the relationship between body and heart weight, in non-technical terms
2B) What proportion of the variance of heart weight is explained by body weight, the predictor?


```{r}

coef(slr1) # Use help(coef) to understand what this function does

intcpt <- coef(slr1)[1] # extract estimated intercept, or "a" as it is called in the SG textbook

slope <- coef(slr1)[2]# Code needed here. Extract the estimated slope or b, just like we did with the intercept. 
  
  
Hwt.hat <- intcpt + slope * cats$Bwt # creates a vector of estimated or predicted heart weights for each of the measured days using the regression model coefficients


errors <- cats$Hwt - Hwt.hat # creates a vector of the differences between the sample heart weight measurements and the predicted weights from the model, i.e., these are the “residuals”


(TotalSS <- sum((cats$Hwt - mean(cats$Hwt)) ^ 2)) # total sums of squares

(SSErrors <- sum(errors ^ 2)) # Code needed here. Type in the vector of the residuals to calculate residual sum of squares

(SSReg <- TotalSS - SSErrors) # regression sums of squares


n <- length(cats$Hwt) # Code needed here. Complete the code with the column that contains the observations

(dfe <- n - length(coef(slr1))) # df of the residuals

(dfTotal <- n-1) # Complete the code to calculate the total df

(dfReg <- 1) # Complete the code to calculate the df of the regression


(MSReg <- SSReg / dfReg)

(MSE <- SSErrors / dfe) # Complete the code to calculate the MSE 

(Fcalc <- MSReg / MSE) # Calculated F statistic from the data 

(Fcritical <- qf(0.95, dfReg, dfe)) # Complete the function arguments with the degrees of freedom for the numerator and the denominator to get the F critical value.


anova(slr1) # check your previous “by hand” calculation against the anova table.


```

**ANSWER THE FOLLOWING QUESTIONS:**

2A) Do you reject the Ho: $\beta_{1}$ = 0? Why? What does this mean in terms of the relationship between body weight and heart weight, in non-technical terms


2B) What proportion of the variance of heart weight is explained by body weight, the predictor?


#### Part 3. Make a 95% confidence interval for the estimation of the slope. [25 points]


First, do the calculations by hand, using the formulas from the textbook 179-180. Then use R functions.


```{r}


# Hint: check your answers with the anova table. It should be the same if your calculations are correct


(tcritical.slr1 <- qt(0.975 , dfe)) # Code needed here.  Enter the probability for a 2-tailed test with alpha = 0.05


SSX <- sum((cats$Bwt - mean(cats$Bwt)) ^ 2) # get sum of square of X


(se.beta <- sqrt(SSErrors / (dfe * SSX))) # calculate the s.e. for beta


beta.lo <- slope - tcritical.slr1 * se.beta # lower CI boundary.  


beta.hi <- slope + tcritical.slr1 * se.beta # Code needed here. Complete the formula for the the upper CI boundary


c(beta.lo, beta.hi) # Display the lower and upper CI extremes 

confint(slr1)[2,] # easier way to get the CI using the function from R


```

**ANSWER THE FOLLOWING QUESTIONS:**

3A) What is the 95% confidence interval for the relationship between body weight and heart weight?


#### Part 4. Make a 95% confidence interval for mean heart weight at a given body weight. [20 points]


Assume that you are interested in using the fitted regression model (the line) to estimate the mean heart weight of cats when they are 2.9 kg. Use the regression equation to make the estimate and then calculate the standard error of the estimate to make a confidence interval for it. Do detailed calculations first and then use R functions.

You will first calculate the predicted mean of Y for 2.9 kg. and its standard error and make a confidence interval for it. Then you will calculate the standard error for a predicted individual Y and compare it.

```{r}


(Hwt.hat.2.9 <- intcpt + slope * 2.9) # In other words, Y = a + bX, where Y, or the predicted value, is the weight of plants at 33 days.  


(g.hat.2.9 <- exp(Hwt.hat.2.9))  # Since the response variable is log transformed, here we need to do a back transformation.

#Calculate the standard error of the predicted mean of Y:
se.Hwt.2.9.m <- sqrt(MSE * (1 / n + (2.9 - mean(cats$Bwt_Kg)) ^ 2 / SSX)) # SG book pg. 192 and pg. d205 No. 6


# Complete the code to calculate the upper and lower CI for the heart weight of cats when they are 2.9 kg:
(CI.Hwt.up <- Hwt.hat.2.9 + tcritical.slr1 * se.Hwt.2.9.m)

(CI.Hwt.lo <- Hwt.hat.2.9 - tcritical.slr1 * se.Hwt.2.9.m)


pred.data <- data.frame(Hwt_g = g.hat.2.9, Bwt_kg = 2.9) # Here we create a data frame that we will use to “predict” or estimate heart weight in the next line.  Only the value for body weight is included since that is the X, independent, or predictor variable.


(ci.r <- predict(slr1, newdata = pred.data, interval = "confidence")) # this line uses predict() to predict or estimate heart weight and its CI, as we did by hand above.

# The interval = “confidence” argument makes the function return a confidence interval for the estimated log of plant mass.

exp(ci.r) # Complete the code to do a back transformation on the predicted values.


# Calculate the standard error of the predicted individual Y:
se.Hwt.33.ind <- sqrt(MSE * (1 + 1 / n + (2.9 - mean(cats$Bwt_Kg)) ^ 2 / SSX)) # SG book pg. 205 No. 7


````

**ANSWER THE FOLLOWING QUESTIONS:**

4A) Explain why we need to use exp().


4B) Report the upper and lower extremes of the CI for the predicted mean of Y
 

4C) Is the standard error estimate for a predicted mean of Y or a predicted individual Y based on the model bigger? Why?