09.0_Anova1.Rmd

---
output: 
  html_document: 
    fig_caption: yes
    number_sections: yes
    theme: readable
    toc: yes
    toc_depth: 3
---

```{r setup0, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
# load packages for chapter
options(scipen = 999)
options(digits = 10)
library(bookdown)
library(emmeans)
library(ggplot2)
library(dplyr)
library(kableExtra)
library(knitr)
library(tables)
library(pander)
library(multcomp)
library(agricolae)
library(nlme)
library(car)
library(tidyr)

```

# Analysis of Variance {#chAnova}


see Formulas SG Ch 7-9 Nov 15, 2017 in Notability
See CRDwithSS in Explain Everything

## Learning Objectives for Chapter

1. Translate scientific questions into null and alternative hypotheses.
1. State the null and alternative hypotheses for ANOVA.
1. State the logical structure of testing differences among means using ANOVA.
1. Write down the model for ANOVA and label its components.
1. Define experimental unit and determine number of experimental units in an experiment.
1. State the typical assumptions for ANOVA.
1. Test ANOVA assumptions.
1. Describe how variance is partitioned in a simple one-way ANOVA.
1. Calculate sum of squares and the corresponding degrees of freedom.
1. Compare ANOVA with t-tests for hypothesis testing.
1. Calculate of experimental error with and without subsamples.
1. Run an ANOVA and interpret the results in terms of the original scientific question.

## Introduction to ANOVA

In the previous chapter we used t-tests to test the null hypothesis that two means are equal. The general idea was that if sample means differ a lot more than expected if true means were equal, we conclude that means were not equal. In each test we incurred a risk of making a mistake, either by rejecting a true null hypothesis of by failing to reject a false one. People are particularly keen on making sure that they do not reject true null hypotheses^[There may be some reasons to favor concern about error type I vs. II, but neither error is inherently more troubling. It is more useful to assess the relative importance of errors by thinking about their consequences. What happens if you fail to reject a false hypothesis? What happens if you reject one that is true?]. This means that we want to make sure that we control the probability or making an error type I, called $\alpha$ and that we keep it at the selected value, usually 0.05.

However, there are situations when researchers want to test more than two treatments at the same time. For example, a researcher may want to test the effects of four different fertilizers on grain yield, or of four different diets on milk quality. If we used independent t-tests to assess the hypothesis that all means are equal, with four treatments (A, B, C, D) we would have to do 6 tests (AB, AC, AD, BC, BD, CD), with a probability of making a mistake in any one of them equal to 0.05. If the tests are independent, the probability of making at least one error in the set of comparisons is $1 - 0.95^6 = `r round(1-0.95^6, 3)`$, which is much greater than the nominal 0.05. In an experiment with 10 treatments, one can make 45 comparisons. If the comparisons are independent, which is the worse case scenario, the probability of making at least one error type I is $1 - 0.95^{45} = 0.90$!
Analysis of variance helps correct this problem by performing a single test of the null hypothesis, regardless of the number of treatments. Preview: the remedy is limited if the null hypothesis is rejected (Be the first to ask why in lecture and get an Easter egg!! Make sure to mention that you want your Easter egg.)

```{block, type='mydef'}
The probability of making an error in one test is $\alpha$ and it is called the *test-wise* error rate.\
The probability of making at least one error in a set or "family" of tests is called *family-wise* error rate.\
For all tests done in a experiment, the error rate is called *experiment-wise*.
```

Before we present the method in detail, we illustrate it with an intuitive example. The null hypothesis of equality of all means is rejected if the variation among groups or sample averages is much larger than expected on the basis of the variation within samples. All objects in each of the square groups below were randomly selected from a single larger set called  *parent* set. The set of objects in any square is called a *group*. Does it look like all groups have the same parent?

```{r anova.intuition1, echo=FALSE, out.width='60%', fig.cap="Does it look like all groups come from the same parent population?"}

# Code to generate 4 groups from two populations.

library(randomcoloR)

set.seed(1005)

par(mfrow = c(2,2), mai = c(1, 0.1, 0.1, 0.1))

for (i in 1:4) {
  plot(rep(1:3, 3), 
         rep(1:3, each = 3),
         pch = 21,
         cex = 6,
       xlim = c(-1, 5), 
       ylim = c(0.5, 3.5),
       bg = randomColor(count = 9, 
                        hue = c(rep("orange", 3), "yellow")[i], 
                        luminosity = c(rep("light", 3), "bright")[i]), 
       axes = FALSE,
       xlab = paste("GROUP", i, sep = " "), 
       ylab = "")
  }

```

Is the answer the same for the next figure?

```{r anova.intuition2, echo=FALSE, out.width='60%', fig.cap="Does it look like all groups come from the same parent population?"}

# Code to generate 4 groups from a single population.

set.seed(3833)

par(mfrow = c(2,2), mai = c(1, 0.1, 0.1, 0.1))

for (i in 1:4) {
  plot(rep(1:3, 3), 
         rep(1:3, each = 3),
         pch = 21,
         cex = 6,
       xlim = c(-1, 5), 
       ylim = c(0.5, 3.5),
       bg = randomColor(count = 9, 
                        hue = "orange", 
                        luminosity = "light"), 
       axes = FALSE,
       xlab = paste("GROUP", i, sep = " "), 
       ylab = "")
  }


```

Chances are that you thought one group was different from the rest in the first case, whereas no group stood out in the second case. Your eye (actually, mostly your brain) automatically assessed the variability between and within groups and compared them. One group in the first picture differed more from the rest than expected based on the variation of color within groups. This is the basis for ANOVA.

The intuitive recognition of group differences above is probably not the same for all people because of variation in the way people perceive and process color. The subjective method to assess differences is not reliable. Therefore, a formal and objective method has to be used. Most of the rest of this chapter consists of explaining how the intuition is formalized to allow calculations that are not subjective and that lead to the same results, regardless of the person doing the test.


In analysis of variance we use data to calculate the following:

1. Total variation among observations, variation between groups and variation within groups.
2. Ratio of variance between:within groups and its expected statistical distribution.
3. Probability of obtaining a ratio equal to or more extreme than the observed one if means are all equal.

If the ratio (calculated F-value) observed is too extreme, we reject the null hypothesis.


The main idea behind ANOVA is based directly on the most important equation in this course:

<br>

\begin{equation}
\sigma^2_{\bar{Y}} = \frac{\sigma_Y^2}{n} \quad \text{ for populations} \\[20pt]
S^2_{\bar{Y}} = \frac{S_Y^2}{n} \quad \text{ for samples} \\[20pt]
\end{equation}

<br>

```{block, type = 'stattip'}
- The variance of the average is the variance of the original random variable divided by the size of the sample used to calculate the average.
```


When there are several treatments, and IF THE TREATMENTS HAVE NO EFFECTS on the response variable, the observations in each treatment are a sample from the same population, and all samples come from the same population. Therefore, there is one sample from the same population for each treatment in the experiment. Each sample provides an average, so we have as many averages as there are treatments. Because we have many averages, we can estimate the variance among averages $\sigma_{\bar{Y}}$ directly, using the calculation formula for the estimated variance of any random variable applied over k independent observations of the random variable:

$$S^2_{\bar{Y}} = \frac{1}{k-1} \sum_{i=1}^k (\bar{Y}_i - \bar{Y}_{..})^2$$.

We obtain the estimated variance of Y simply by multiplying the estimated variance of the averages times the sample size:

$$S^2_Y = k \ S^2_{\bar{Y}}$$

Note that this calculation results in a valid estimate of the variance ONLY if all means and variances are equal.

We can also estimate the variance using the pooled variance:

$$S^2_Y = \frac{1}{k \ (r - 1)} \sum_{i=1}^k \sum_{j=1}^r (\bar{Y}_{ij} - \bar{Y}_{.k})^2$$

The pooled variance is the average squared deviations from each sample average, and thus, it does not need the means to be equal to be a valid estimate of the common variance for all treatments. This pooled variance is based on the same concept used for comparing two means using independent samples in [Chapter 8](#Case1). Pooling variances is desirable because it leads to better precision and narrower confidence intervals. A pooled estimated variance is based on more observations and is therefore more stable (the variance of the estimated variance is smaller!!), moreover, the variance of Student's t distribution is also reduced as the degrees of freedom increase. Pooling leads to smaller critical t-values, which equals more power (lower probability of error type II) and narrower confidence intervals.

If indeed the first estimate of the variance above, based on the variation among treatment averages, is a valid one, the quotient of the first estimate over the second one, based on the pooled variance, should have an F distribution with values about 1.0. We saw this when testing for [differences between independent variance estimates] (#compare2Variances). The statistical distribution resulting from the quotient of two variance estimates is called an F-distribution, which was fully described in the *F Distribution* section of the [Probability chapter](#chProb). If the quotient calculated, called "calculated F-value," is too far from 1.0 we conclude that the two variances are not equal, therefore, the estimation of the variance using the averages is not a valid estimate, and the averages are considered too different to come from the same population. Thus, one reasons, at least two means must be different, and the null hypothesis is rejected. This is further illustrated with a numerical example below.


## Model and Partitioning of Variance

When the different components of an experiment are identified and the effects are additive, then we can use a linear equation to calculate the value of any observation in the experiment. When we assign each of k treatments randomly to r replicates, we have an experimental design called "Completely Randomized Design" or CRD. The model for this design states that each observation is a random draw from a normal distribution that has a potentially different mean for each treatment, but with a common variance for all treatments. Moreover, the treatment mean is expressed as an overall mean plus a treatment effect. The model is written as follows:

<br>
\begin{align}
&Y_{ij} = \mu_{..} + \tau_i + \epsilon_{ij} \quad \quad i = 1, \dots, k, \quad j = 1, \dots, r \\[10pt]
&\epsilon_{ij} \sim iid \ N(0, \sigma^2) \\[15pt]
&\quad \quad \text{where} \\[15pt]
&\mu_{..} \quad \text{is the overall mean} \quad \mu_{..} = \frac{1}{k}\sum_{i = 1}^{k} \mu_{i.}\\[15pt]
&\tau_i \quad \text{is the effect of treatment i } \\[15pt]
&\epsilon_{ij} \quad \text{is the experimental error for treatment i and replicate j} \\[15pt]
& iid \quad \text{means "independent and identically distributed"}
(\#eq:modelCRD)
\end{align}
<br>


When we do not know the parameters of the actual distributions, which is usually the case, parameters are estimated and each observation is partitioned as follows:

<br>
\begin{align}
&Y_{ij} = \hat{\mu}_{..} + \hat{\tau}_i + \hat{\epsilon}_{ij} \\[10pt]
&= \bar{Y}_{..} + \hat{\tau}_i + e_{ij} \\[15pt]
&= \bar{Y}_{i.} + e_{ij} \quad \quad \text{where} \\[15pt]

&\bar{Y}{..} \quad \text{is the overall average calculated as } \frac{1}{k}\sum_{i = 1}^{k} \bar{Y}_{i.}\ \\[15pt]
&\bar{Y}_{i.} \quad \text{is the average for treatment i} \\[15pt]
&e_{ij} \quad \text{is the residual error for treatment i and repetition j} \\[15pt]
(\#eq:estModelCRD)
\end{align}
<br>


Based on the model above, each observation is partitioned into components:
- overall mean, which is estimated with the average of treatment averages,
- treatment effects, which are estimated as the differences between treatment averages and overall average,
- error, which is estimated with the residual or difference between observation and treatment average.

<br>
\begin{align}
Y_{ij} &= \bar{Y}_{..} + \hat{\tau}_i + \hat{\epsilon}_{ij} \\[15pt]
Y_{ij}- \bar{Y}_{..} &= \hat{\tau}_i + \hat{\epsilon}_{ij} \\[15pt]
Y_{ij}- \bar{Y}_{..} &= (\bar{Y}_{i.} - \bar{Y}_{..})+ (\bar{Y}_{ij} - \bar{Y}_{i.}) \\[15pt]
(\#eq:obsPartCRD)
\end{align}
<br>

The total deviation from observation to overall average is partitioned into the deviation from treatment average to overall average plus residual, where the residual is the difference between observation and treatment average. It can be shown that the sum of all total deviations squared equals the sum of squared estimated treatment effects plus the sum of squared residuals: total sum of squares equals treatment sum of squares plus residual sum of squares.


<br>
\begin{align}
\sum_{i=1}^k \sum_{j=1}^r (Y_{ij}- \bar{Y}_{..})^2 &= \sum_{i=1}^k \sum_{j=1}^r (\bar{Y}_{i.} - \bar{Y}_{..})^2 + \sum_{i=1}^k \sum_{j=1}^r (\bar{Y}_{ij} - \bar{Y}_{i.})^2 \\[15pt]
TSS &= SST + SSE\\[15pt]
\text{where} \\[15pt]
TSS &= \sum_{i=1}^k \sum_{j=1}^r (Y_{ij}- \bar{Y}_{..})^2 \\[15pt]
SST &= \sum_{i=1}^k \sum_{j=1}^r (\bar{Y}_{i.} - \bar{Y}_{..})^2 = \ r \ \sum_{i=1}^k (\bar{Y}_{i.} - \bar{Y}_{..})^2\\[15pt]
SSE &= \sum_{i=1}^k \sum_{j=1}^r (\bar{Y}_{ij} - \bar{Y}_{i.})^2\\[15pt]
(\#eq:ssPartCRD)
\end{align}
<br>

## Degrees of freedom

Degrees of freedom are quantities associated with sums of squares. Each sum of squares has an associated number that is called its degrees of freedom. We operationally define **degrees of freedom** associated with a sum of squares as the number of different terms in the sum minus the number of independent parameters estimated to be used in the calculation.

In reality, degrees of freedom refers to the number of dimensions in which vectors can vary. The basic mathematics of statistics involves large-dimensional spaces and sub spaces. In short, and for those who are curious, a sample of size N is a vector in N dimensions and has N degrees of freedom because it can vary freely over N dimensions. Once a sample is set, the overall average is a vector in N dimensions with each coordinate value equal to the average of the observations. The average vector and the observation vector form a hyper plane in N-1 dimensions where the difference between observation and average reside. The difference $Y_{i} - \bar{Y}_{.}$ is the vector of deviations or residuals (for the simplest model $Y_i = \mu + \epsilon_i$) from observation to average. Given a set sample, this vector of residuals has only N-1 degrees of freedom, because it must stay in the subspace spanned by the observation and the overall average. The restriction comes from the fact that given the overall average and N-1 of the observation coordinates, the last one can be calculated as $Y_N = N \ \bar{Y}_. - \sum_{i=1}^{N-1} Y_i$. 

This concept is illustrated for a sample consisting of three observations, so we can plot them in a 3D figure.

<br>

```{r 3Dsample, echo=FALSE, warning=FALSE, fig.cap="Three-dimensional representation of a sample with 3 observations. The longest arrow (hypotenuse) represents the sample vector. The base of the triangle is the average and the third side is the vector of deviations from the average to the observation."}

# library(rgl)
smpl <- c(3, 4, 8)
avg <- rep(mean(smpl), 3)
rsdl <- smpl - avg
xyz0 <- c(0, 0, 0)
xyz.tips <- rbind(smpl, avg, smpl)
xyz.orig <- rbind(xyz0, xyz0, avg)
colnames(xyz.tips) <- c("obs1", "obs2", "obs3")
colnames(xyz.orig) <- c("obs1", "obs2", "obs3")
ssr <- rsdl %*% rsdl
# 
# plot3d(xyz.tips, 
#        xlab = "observation 1", 
#        ylab = "observation 2", 
#        zlab = "observation 3", 
#        box = FALSE, 
#        xlim = c(0, 10), 
#        ylim = c(0, 10), 
#        zlim = c(0, 10))
# 
# 
# arrow3d(xyz0, xyz.tips[1,], type = "rotation", col = "grey", barblen = 0.05, width = 0.4)
# arrow3d(xyz0, xyz.tips[2,], type = "rotation", col = "lightblue", barblen = 0.05, width = 0.4)
# arrow3d(xyz.tips[1,], xyz.tips[2,], type = "rotation", col = "maroon", barblen = 0.05, width = 0.4)
# planes3d(a = -4,
#          b = 5,
#          c = -1,
#          d = 0, 
#          alpha = 0.4)

library(plot3D)

arrows3D(x0 = xyz.orig[, 1], 
         y0 = xyz.orig[, 2], 
         z0 = xyz.orig[, 3], 
         x1 = xyz.tips[, 1], 
         y1 = xyz.tips[, 2], 
         z1 = xyz.tips[, 3], 
         xlim = c(0, 10), 
         ylim = c(0, 10), 
         zlim = c(0, 10), 
         phi = 35, 
         theta = 450, 
         lwd = 2, 
         d = 3, 
         bty = "g", 
         ticktype = "detailed")

```

<br>

Note that the three arrows form a square triangle with the observation vector being the hypotenuse, and the other two sides being the average and the residual. Therefore the length of the observation vector equals the length of the average vector plus the length of the residual vector. Using the Pythagoras theorem repeatedly, we can see that the length of the observation vector is the sum of squares of individual observations; the length of the average vector is 3 times the squared average and the length of the residual vector is the sum of squares of the residuals or deviations. The length of the average vector is called "correction factor" because it is used to calculate the total variation about the average by subtracting it from the length of the observation vector.

A really good reading to understand the concept of degrees of freedom is Walker [@Walker1940]. The following quote is taken from that paper:

>"A canoe or an automobile moves over a two-dimensional surface which lies upon a three-dimensional space, is a section of a three-dimensional space. At any given moment, the position of the canoe, or auto, can be given by two coordinates. Referred to a four-dimensional space-time universe, three coordinates would be needed to give its location, and its path would be a space of three dimensions, lying upon one of four."

>
> --- Walker (1940)

Thus, the car and canoe move in a 4-dimensional space but they have only 3 degrees of freedom, because they must move on a preset surface. Given the surface, their position on the surface and time is completely determined by 3 numbers. Analogously, the residuals exist in an N-dimensional space, but they must be on the hyper plane defined by the observation and the mean vector, which has N-1 dimensions or degrees of freedom. The geometric interpretation of sums of squares is applicable to all linear models, and in particular to the few that we study in this course. Although graphical representations that are realistic can only be done for the simplest models, graphical analogies are valid for all models. If you think you understand the geometric interpretation or concepts more easily than the equations, let your instructors know and they will be able to help you pursue better understanding using the geometric approach.

**Total degrees of freedom**

Consider an experiment where r plots are assigned randomly to each of k treatments (i is for treatments and j is for replicates or plots). Such experiment is a completely randomized design and has a total of N = r k plots or experimental units. The concept of experimental unit is explained in detail in the [Chapter about Experimental Design](#chEdesign). For now, just think of an experimental unit as a unit of experimental material (e.g.animal, group of animals, pot, land plot, Petri dish, etc.) that yields an observation independent of the rest of the observations.

When we use the average to estimate the overall mean, the set of deviations from the overall average has only rk-1 degrees of freedom (df). Thus, the total sum of squares or length of the vector of residuals has rk-1 df. This is called the total degrees of freedom associated with the total sum of squares of a set of samples.

<br>
\begin{align}
\text{Total sum of squares } &= TSS = \sum_{i-1}^k \sum_{j=1}^r (Y_{ij} - \bar{Y}_{..})^2 \\[15pt]
\text{Total df } &= r \ k \quad \text{observations} - 1 \ \text{estimated mean}
\end{align}
<br>

Using the operational definition above we can calculate the total df as follows: the summation has rk independent terms (one for each observation), but we use one parameter estimate, the estimate of the overall mean. Therefore the total df are rk-1.

**Degrees of freedom of treatments**

<br>
\begin{align}
&\text{Sum of squares of Treatments } \\[15pt]
= SST &= \sum_{i-1}^k \sum_{j=1}^r (\bar{Y}_{i.} - \bar{Y}_{..})^2 \\[15pt]
 &= r \sum_{i-1}^k (\bar{Y}_{i.} - \bar{Y}_{..})^2 \\[15pt]
\text{Treatment df } &= k \quad \text{treatments} - 1 \ \text{estimated mean}
\end{align}
<br>

The summation has k independent terms, one for each treatment, and it uses the estimate of the overall mean. As a result, the treatment degrees of freedom equals k-1.


**Degrees of freedom of residuals**

<br>
\begin{align}
\text{Residual sum of squares } &= SSE = \sum_{i-1}^k \sum_{j=1}^r (Y_{ij} - \bar{Y}_{i.})^2 \\[15pt]
\text{Residual df } &= r \ k \quad \text{observations} - k \ \text{estimated treatment means }\\[15pt]
= dfe &= k(r-1)
\end{align}
<br>

The summation has rk independent terms, one for each observation, and it uses the k estimates for the treatment means. Therefore, the residual degrees of freedom are rk - k = k (r-1)

Both sum of squares and degrees of freedom are additive in the sense that the total is equal to the sum of the components:

<br>
\begin{align}
TSS &= SST + SSE \\[15pt]
\text{df Total } &= \text{df Treatment }+ dfe
\end{align}
<br>


## ANOVA Table

The equations above show how each observation and the total variation are partitioned into treatment and residual. Total sum of squares is partitioned into SS of treatments and SS of residual or error ^[We try to use the term "residual" for the calculated values, which are estimates of the unknown true errors. We mention the term "error" because most books use "SSE" to refer to the SS of residuals. The "E" comes from "error."]. Each sum of squares has an associated degrees of freedom. By dividing the SS by the df we obtain **mean squares**. The mean square of residuals (MSE) is the best estimate of the common variance for all samples, and it does not depend on treatment means being equal. The mean square of treatments (MST) has an expectation equal to the common variance PLUS a component that is directly proportional to the difference among means. Only when means are equal is the MST another estimate of the variance of the error. See the optional section of Expected Mean Squares for details.

<br>

|Source            |  df      | Sum of Squares (SS) | Mean Squares (MS) | Calculated F |
|:----------------:|:--------:|:-------------------:|:-----------------:|:------------:|
|Total             | k r - 1  |     TSS             |                   |              |
|Among Treatments  | k - 1    |     SST             |        MST        |   MST / MSE  |
|Within Treatments | k (r-1)  |     SSE             |        MSE        |              |

<br>


Sums of squares, mean squares and calculated F are all statistics. The values based on one sample are specific realized values of the random variables, which have their own expected values variances and distributions. If the null hypothesis that all treatment means are the same is true and all assumptions are correct, the calculated F has an F distribution with mean or expectation equal to $dfe/(dfe - 2)$, where dfe is the residual or error degrees of freedom.

```{block, type = 'stattip'}
- When MST is much greater than MSE, calculated F will be much greater than expected if treatment means were equal.
- If calculated F is greater than critical F value, or equivalently, if the probability of an F more extreme than calculated F is smaller than $\alpha$, we REJECT Ho.

```

## Assumptions of ANOVA

In order for the ANOVA results and interpretation to be valid, the following assumptions must be true:

a. the variable of interest must be normally distributed
b. each experimental unit must be independent from any other experimental unit
c. variances of the different treatments must be all equal ("homogeneous")
d. treatment and residual effects must be additive

These assumptions are part of the model above and are important for ANOVA, although some deviation from certain assumptions is not always a grave problem. The main part of the model states clearly that each observation is composed of three additive parts: overall mean, treatment effect and error. Further, the errors,-one for each observation- are assumed to all have identical, independent distributions that are normal with mean 0 and common variance $\sigma^2$ (or standard deviation $\sigma$).

<br>
```{block, type = 'stattip'}
ANOVA involves two tests of equality of variances:
1. The focal test that MST and MSE both estimate the same variance, which is a test of $H_0$.
2. The ancillary test that variances within treatments are equal, which is a corroboration of the assumption of homogeneity of variances.
```
<br>


## ANOVA example

We will study the ANOVA procedure with an example. First, the analysis is presented with the theory but as concisely as possible, from research question to conclusion, using R functions. Then, each step is explained in detail and calculations are presented step by step.

This is a fictitious example with elements based on actual research. Alfalfa geneticists and breeders created two transgenic varieties with genetic modifications of the metabolic pathways that synthesize ligning. Lignin is indigestible fiber that reduces the nutritional quality of forages. A common, unmodified variety and the two GMO varieties need to be compared to determine if the modification leads to better nutrition of cattle. Twelve animals were randomly selected from a large herd of beef cattle for the experiment. Varieties were labeled A, B and C, where A is the unmodified one. Twelve identical ping-pong balls were used for the randomization of treatment assignment to experimental units (animals). Four balls were labelled "A," four "B" and four "C." Balls were put into a bucket and mixed, then a blindfolded operator drew them randomly, one at a time and without replacement. The random sequence of letters was written down. Animals were lined up haphazardly and assigned the treatments from left to right in the order given by the random sequence. Each animal was housed in an individual pen and fed the corresponding variety of alfalfa for 30 days. Weight gain in lbs per day was calculated for the last 10 days of the experiment and used as the response variable.

<br>
```{r TblAnovaExmpl1, echo=FALSE, warning=FALSE}

aaAnovaExmpl1.dat <- read.table(header = TRUE, text = "
variety wgain
A 1.0
B 2.0
C 3.0
A 2.0
B 3.0
C 4.0
A 2.0
B 2.0
C 3.0
A 3.0
B 3.0
C 2.0
"
)

knitr::kable(
aaAnovaExmpl1.dat, 
   digits = 2, 
   align = "cc", 
   caption = 'Fictitious data. Weight gain in lbs/day of steers fed different varieties of alfalfa. The experiment was a completely randomized design with four replicates of each treatment'
) %>% 
   kable_styling(bootstrap_options = c("condensed"), 
                 font_size = 11, 
                 full_width = F, 
                 position = "center") %>% 
   column_spec(c(1,2), width = "1.5in")

```
<br>

**Null hypothesis**: Weight gain is the same for all alfalfa varieties tested.

\begin{align}
H_0: \quad &\mu_A = \mu_B = \mu_C \quad &\text{or equivalently:} \\
&\tau_A = \tau_B = \tau_C = 0 \quad &\text{treatment effects are zero}
\end{align}

**Alternative hypothesis** There is at least one pair of varieties that differs in the resulting weight gains.

\begin{align}
H_A: \quad &\mu_A \ne \mu_B \quad \text{or} \quad \mu_A \ne \mu_C \quad \text{or} \quad \mu_B \ne \mu_C \\
\end{align}


###Formulas and calculations in R

We use the `lm` and `anova` functions of R. The `lm` function finds the optimal values for the estimated parameters by minimizing the residual sum of squares. The `anova` uses the estimated parameters to partition the total variance and gives the resulting calculated F with the corresponding probability of observing a value at least that extreme if indeed all varieties result in the same weight gain.

```{r, echo = TRUE, warning=FALSE}

transg.aa.m1 <- lm(wgain ~ variety, data = aaAnovaExmpl1.dat)

summary(transg.aa.m1)

```

The output shows the formula used for the model, the quartiles for the residuals, a table of coefficients with the estimated overall mean (Intercept), $\tau_B$ (varietyB), $\tau_C$ (varietyC) and the corresponding standard errors and t-tests for the null hypotheses that each coefficient is zero. We are not interested in testing that the coefficients are zero, so the t-tests can be ignored. The bottom part of the report shows the standard deviation of residuals and their degrees of freedom, the $R^2$ and adjusted $R^2$, and finally the calculated F statistic with its degrees of freedom and probability of observing a value equal to that or more extreme if $H_0$ were true. The $R^2$ is the proportion of the total variation represented by the variation among treatments. We will ignore the adjusted $R^2$ for now.

Before we move further, we need to inspect the residuals to make sure that there are no deviations from the assumptions. If residuals clearly do no meet assumptions, we cannot use ANOVA and the results above are not relevant.

```{r AnovaResPlot1,  echo = TRUE, warning=FALSE, out.width = '90%', fig.align='center', fig.cap ="Left: Plot of residuals against estimated treatment means for the transgenic alfalfa experiment. Points were jittered horizontally because several residuals were equal. Right: quantile-quantile plot of residuals. If residuals have a normal distribution they should not deviate significantly from the straight line."}

par(mfrow = c(1,2))

plot(residuals(transg.aa.m1) ~ jitter(fitted(transg.aa.m1)), 
     ylab = "Residuals", 
     xlab = "Treatment averages (lbs/day)")

qqp(residuals(transg.aa.m1), id = FALSE)

```
<br>

The graph on the left shows that the vertical variation of points does not differ extremely among treatments, which are depicted as different values in the horizontal axis. Therefore, we do not have sufficient information to question the assumption of homogeneity of variance. The graph on the right shows that the observed quantiles do not deviate significantly (they are within the dashed curves) from the normal quantiles, so we cannot question the assumption of normality. In this data we do not have enough information to assess the independence of the observations, so that assumption is based on the way that animals were treated and kept separately, and it is not challenged. One possible way to challenge the independence of residuals would be to determine if observations have more similar residuals when animals are in closer pens, but the spatial distribution of pens is not known (the data are fictitious and no spatial distribution was created)^[It is strongly recommended that experimenters always record the operator ID, date, time and spatial location of all observations. For lab experiments, location on benches, batches, runs etc. should be recorded with each observation. The time-place-set-group information will allow experimenters to test for lack of homogeneity in variance and lack of independence in the residuals, as well as for nuisance effects of groupings or batches]. 

Although there are other more formal tests to assess whether assumptions are met, a visual inspection of the residuals is usually sufficient. Formal methods to test for deviation from assumptions are beyond the scope of this course. This data set is very small and the residuals do not show major problems, so the analysis is pursued. Note that when doing real data analysis, this step should be more detailed.

```{r, echo = TRUE, warning=FALSE}

kable(anova(transg.aa.m1), digits = c(0, 0, 3, 3, 4)) %>% 
   kable_styling(full_width = FALSE)

```

The analysis of variance table has columns for source of variations, degrees of freedom, sum of squares, mean squares, calculated F-value and probability of observing an F that large or larger if the null hypothesis were true. Of course, the F value and the probability are the same as in the previous table. The probability reported is the probability that an F-distributed random variable with 2 df in the numerator and 9 df in the denominator will take values equal to or greater than the calculated F. If the means are indeed all the same, then the calculated F is supposed to be an F-distributed random variable with 2 df in the numerator and 9 df in the denominator. If the probability is too low, we reject the F distribution, which cascades back to rejecting $H_0$.

Because the probability associated with the calculated F-value is larger than 0.05, we cannot reject the null hypothesis and the study is inconclusive. This is the same as to say that the calculated F is smaller than the critical $F$. We do not have enough evidence to say that the varieties lead to different weight gains, because the variation among treatment means is not sufficiently larger than that within treatments, due to experimental error (variation among experimental units due to other, unmeasured causes).

Keep in mind that not being able to reject the null hypothesis is not the same as accepting it. The reason for this is that *it is very easy not to reject a null hypothesis*: all you have to do is not try hard enough, for example by obtaining a very small sample. If not being able to reject the null meant that you accept it, then you would be able to accept any null hypothesis, which obviously is not very useful. It would simply lead you to be wrong most of the time.

If the above is not clear, consider the following. Say you took a a multiple choice test and obtained 75% correct. The questions are a sample of your knowledge. Your professor hypothesizes that your real knowledge of the material is at 73%. With the sample size available, he cannot reject his null hypothesis. Will you accept a grade of 73%?

```{block, type = 'stattip'}
- Failure to reject the null hypothesis DOES NOT mean that you accept it. It means the study is inconclusive, unless you did a calculation of power.
```

Even though treatments were not significantly different, we can calculate confidence intervals for the means and we can estimate the minimum difference between means that could be detected with the present variance and sample size. In all cases, the best estimate of the variance of the error we have is the MSE with dfe degrees of freedom.

**Confidence interval for a treatment mean**

<br>
\begin{align}
&{U \atop L} = \hat{\mu}_{i.} \pm t_{(1-\alpha/2), dfe} \ S_{\hat{\mu}_{i.}} \quad \text{ where} \\[15pt]
& \hat{\mu}_{i.} = \bar{Y}_{i.} \quad \text{is the estimated treatment mean or average} \\[15pt]
&t_{(1-\alpha/2), dfe} \quad \text{ is the } \quad 1-\alpha/2 \quad \text{ t quantile with dfe degrees of freedom} \\[15pt]
&S_{\hat{\mu}_{i.}} = S_{\bar{Y}_{i.}} \quad \text{is the standard deviation of the estimated treatment mean} \\[15pt]
&S_{\bar{Y}_{i.}}^2 = MSE/r \\[15pt]
(\#eq:CiTrtMeanCRD)
\end{align}
<br>

The key step in the calculation is finding $S_{\bar{Y}_{i.}}$. We recall that the treatment average is the average of all replicates for the treatment and apply the most important equation of this course. If the estimated variance of $Y$ is $S^2_Y$, then the estimated variance of $\bar{Y}_{i.}$ is $S^2_Y/r_i$ where $r_i$ is the number of replicates in treatment i. If the experiment is balanced and all treatments have the same number of replications, then $r = r_i$ which in the example is 4 replications. For $S^2_Y$ we use the best estimate of the variance of the error we have, which is the MSE. Note that the MSE is preferred to using the estimated variance based just on the r replicates for the treatment because it has more degrees of freedom. This works if the assumption of homogeneity of variance is correct.

For the calculation of confidence intervals we us the `emmeans` function in the `emmeans` [@Lenth2018] package of R. The `emmeans` function computes the estimated treatment means, their standard errors and confidence intervals. Note that the code below repeats the code for the table to have it printed in a nice format using the functions `kable` and `kableExtras` Thus, you can see the table as direct output form R and then as formatted by the software used to write this book.

<br>
```{r aaAnovaTbl}

emmeans::emmeans(transg.aa.m1, "variety")

emmeans(transg.aa.m1, "variety") %>%
   kable(caption = "Estimated mean weight gains for animals 
         fed each variety of alfalfa, with the corresponding 
         standard errors and confidence intervals.", 
         digits = c(0, 1, 4, 0, 4, 4)) %>%
   kable_styling(full_width = FALSE)
```
<br>

Table \@ref(tab:aaAnovaTbl) shows that all estimated treatment means have the same SE and df of the error, which leads to identical confidence interval widths. The t-value for the confidence intervals, known as "critical" t or colloquially "t from table" (as opposed to calculated t) is $t_{\ 0.975, \ 9} =$ `r qt(0.975,9)`


**Comparing two treatment means**

In this specific example the single F-test did not lead to rejection of $H_0$, so comparisons among means should not be pursued. However, we present the calculations that are necessary for comparing means using the *Least Significant Difference* (LSD) procedure. In this procedure, we calculate the smallest difference necessary to conclude that any two means are different. Any two means that differ by more than the LSD are considered to be "significantly" different, which means that we reject the null that they are the same.

As we saw in [Chapter 8: Two populations](#ch2pops), when we test the $H_0$ that two means are equal we have two random variables: each of the two estimated means. Because experimental units were randomly assigned to treatments and treated independently, the estimates should be, and are assumed to be, independent random variables. The test of equality becomes a test of whether the difference between the two random variable realizations is significantly different from zero. That is, we ask "is the difference much greater that it would be expected if the two means were actually equal?" The difference between two normal random variables is also a normal random variable with mean equal to the difference in the true means and variance equal to the sum of the variances of the two estimated means. Because we are assuming that there is homogeneity of variance, then the two variances are equal, so we use a pooled estimated variance. In the two sample case in Chapter 8 we pooled the two estimated variances. When we have several treatments, we use the estimated variance pooling all within-treatment variation, which is the MSE.


<br>
\begin{align}
&H_0: \mu_{1.} = \mu_{2.} \quad \equiv \quad \mu_{1.} - \mu_{2.} = 0 \\[15pt]
&H_A:  \mu_{1.} \ne \mu_{2.}\\[15pt]
& \bar{d} = \bar{Y}_{1.} - \bar{Y}_{2.} \sim N(\mu_{\bar{d}}, \sigma^2_{\bar{d}}) \\[15pt]
& \sigma^2_{\bar{d}} =  \sigma^2_{\bar{Y}_{1.}} + \sigma^2_{\bar{Y}_{2.}} = 2 \ \sigma^2_Y/r \quad \text{ which is estimated by } \quad S^2_{\bar{d}} = 2 \ MSE/r\\[15pt]
&\text{ therefore, if } \quad \mu_{\bar{d}} = 0 \implies t_{calc} = \frac{\bar{d}}{S_{\bar{d}}} \sim t_{dfe} \\[15pt]
(\#eq:DifTrtMeanCRD)
\end{align}
<br>

The decision rule is to compare the calculated t with the critical "two-tailed" t for $\alpha$. If the calculated t is greater than the critical t, then $H_0$ is rejected and the treatment means are said to be "significantly different." Note that if $$\bar{d}/S_{\bar{d}} \gt t_{(1-\alpha/2)}$$ then $$\bar{d} \gt t_{(1-\alpha/2)} \ S_{\bar{d}}$$ so the product of the critical t times the standard error of the difference between to averages is the smallest difference that will lead to a rejection of the null hypothesis.

```{r}

library(agricolae) #load package necessary for LSD test

LSD.test(transg.aa.m1, 
         "variety", 
         console = TRUE,
         p.adj = "none")

```

The output reports:
- MSE, 
- a table with estimated treatment means with their group standard deviations without pooling, number of observations in each group, confidence intervals (same as reported by `emmeams` above) and rank of each extreme of the confidence intervals,
- $\alpha$, dfe and critical t for the given $\alpha$, and dfe,
- least significant difference,
- table with treatment averages connected with letters.

The last part of the output, a table of averages followed with letters is a standard way to present the results. Means that are followed by at least one letter in common are not significantly different from each other. In this case there is only one letter following all the means because there are no significant differences.

### Detailed calculations

In this section we will repeat some of the analysis above in more detail. We will show the detailed calculations that are "behind the scene" when we used the high-level R functions.

There are two principal goals for the calculations:

1. get the calculated-F to compare to the "F from table," and
1. get the standard error of estimated means and differences between estimated means to test pairs of means or make confidence intervals.

To obtain the calculated F we need to complete the ANOVA table, which contains the MSE. The standard error of estimated means and of their differences is calculated using the MSE.

Sums of squares, the central component of the ANOVA table can be calculated by making a table with all components for each of the observations and then squaring and summing the components.

```{r}

# First calculate treatment averages and add them to each corresponding observation
# in a new column. We use the aggregate function.

var.avg <- aggregate(wgain ~ variety, 
                     data = aaAnovaExmpl1.dat, 
                     FUN = mean)

names(var.avg)[2] <- "avg.wgain" # Change name to avoid conflict during merge.

aaAnovaExmpl1.dat <- merge(aaAnovaExmpl1.dat, var.avg, all = TRUE)

# Calculate treatment effects
aaAnovaExmpl1.dat$trt.fx <- aaAnovaExmpl1.dat$avg.wgain - mean(aaAnovaExmpl1.dat$wgain)

# Calculate residuals
aaAnovaExmpl1.dat$rsdl <- aaAnovaExmpl1.dat$wgain - aaAnovaExmpl1.dat$avg.wgain

# Calculate total deviations from average
aaAnovaExmpl1.dat$tot.dev <- with(aaAnovaExmpl1.dat, trt.fx + rsdl)

print(aaAnovaExmpl1.dat)

```

The data frame now has the original observations plus their decomposition into overall average ($Y_{..} = `r mean(aaAnovaExmpl1.dat$wgain)`), treatment effect and residual. The total deviation from each observation to the overall average is the sum of treatment effect and residual, for example for the first two rows:

\begin{equation}
Y_{11} = 1.0 = 2.5 + (-0.5) + (-1.0) \\[15pt]
Y_{12} = 3.0 = 2.5 + (-0.5) + 1.0
\end{equation}

Total sum of squares is the sum of the squared values in the `tot.dev` column:

\begin{align}
TSS &= (-1.5)^2 + 0.5^2 + (-0.5)^2 + (-0.5)^2 + 0.5^2 + (-0.5)^2 \\
&+ 0.5^2 + (-0.5)^2 + 0.5^2 + 1.5^2 + 0.5^2 + (-0.5)^2 \\
&= 7.0
\end{align}

Treatment and residual SS can be found in a similar way using the corresponding columns.

\begin{align}
SST &= (-0.5)^2 + (-0.5)^2 + (-0.5)^2 + (-0.5)^2 + 0^2 + 0^2 \\
&+ 0^2 + 0^2 + 0.5^2 + 0.5^2 + 0.5^2 + 0.5^2 \\
&= 2.0
\end{align}

\begin{align}
SSE &= (-1)^2 + 1^2 + 0^2 + 0^2 + 0.5^2 + (-0.5)^2 + 0.5^2 \\
&+ (-0.5)^2 + 0^2 + 1^2 + 0^2 + (-1)^2 \\
&= 5.0
\end{align}

Degrees of freedom are as follows: For SSE there are 12 independent terms in the summation, and we estimated 3 treatment means to calculate it. Therefore, dfe = 12 - 3 = 9. Using the geometric approach, the observation vector is in a 12-dimensional space and the vector of treatment means is in a 3-dimensional space, so the residual vector is restricted to a 9-dimensions.

For TSS there are 12 independent terms and one estimate of the overall mean. Thus, total df = 12 - 1 = 11. Treatment or variety SS has 3 independent terms that repeat over each of the replications and it uses one estimate for the overall mean. Thus, df treatments = 3 - 1 = 2.

$$ MSE = \frac{SSE}{dfe} = \frac{5.0} {9} = 0.55556$$

$$MST =  \frac{SST}{df_{treat}} = \frac{2.0} {2} = 1.0$$

$$F_{calc} = \frac{MST}{MSE} = 1.8 \quad \text{with 2 df in the numerator and 9 df in the denominator}$$

Critical F and the probability associated with the observed F are calculated in R as follows. Recall that the value of the critical F is a quantile, more specifically the $1 - \alpha$ quantile:

```{r}

# Critical F quantile
qf(p = 0.05, df1 = 2, df2 = 9, lower.tail = FALSE)

# Probability of F > Fcalc
pf(q = 1.8, df1 = 2, df2 = 9, lower.tail = FALSE)

```

This completes all the information in the ANOVA table. Now we proceed to make a confidence interval for an estimated treatment mean, for example for variety C. The standard error of the estimated mean is the MSE divided by the number of observations used to estimated the mean. In this case, all treatments has the same sample size or number of replications, and thus, all treatment standard errors are the same. However, in other experiments, sample sizes can differ among treatments and the standard error will be larger for the treatments with fewer observations.

$$S_C = \sqrt{\frac{MSE}{r}} = \sqrt{\frac{0.55556}{4}} = 0.372678$$

The critical value of t for the CI is obtained using the `qt` function in R. For a confidence interval we always use the 2-tailed quantile, which leaves $\alpha/2$ on each tail.:

```{r}
# Critical t
qt(p = 0.05/2, df = 9, lower.tail = FALSE)

# Average for variety C
mean(aaAnovaExmpl1.dat$wgain[aaAnovaExmpl1.dat$variety == "C"])

```
We corroborate that the value obtained is the same as above when we used the `LSD.test` function.

$${U \atop L} = 3.0 \ \pm \ 2.262157 \cdot 0.372678 = {3.843056 \atop 2.156944}$$

Finally, we construct a test and calculate the LSD to compare the means of varieties A and C. Because we are calculating a difference between two estimated means, the variance is twice that of each estimated mean. Critical t is the same, given that we use the same $\alpha = 0.05$ and that the dfe are the same as before. The mean of variety A is 2.0.

$$S_{\bar{d}} = S_{C-A} = \sqrt{\frac{2MSE}{r}} = \sqrt{\frac{2\cdot0.55556}{4}} = 0.5270463$$

$$\bar{d} = 3.0 - 2.0 = 1.0$$
$$t_{calc} = \frac{\bar{d}}{S_{\bar{d}}} = \frac{1.0}{0.5270463} = 1.897367$$
The probability associated with the calculated t is found in R taking into account that the test is two-tailed, so the probability for one tail has to be multiplied times 2:

```{r}

2 * pt(1.897367, df = 9, lower.tail = FALSE)

```

The LSD is the product of the critical t and $S_{\bar{d}}$

$$ LSD = 2.262157 \cdot 0.5270463 = 1.192262$$
We cannot reject the null hypothesis that the means for varieties A and C are equal because of three equivalent facts (in addition to the non-significant ANOVA test):

1. The probability associated with calculated t is greater than 0.05.
1. Calculated t is smaller than critical t.
1. The difference between averages is smaller than the LSD.


## Exercises and Solutions


1. Explain what analysis of variance means practically?  How can one use variances to compare the means of different treatments?


2. What are the assumptions of the analysis of variance and why are each of them required prior to analysis?


3. Data is collected on 4 groups of 30 Romney ewes that were clipped and shorn on different days across 3 replicated time periods in New Zealand.  Mean clean fleece weight (g) is shown in the table below.

<br>

Table: (\#tab:SheepWool) Mean clean fleece weight (g) is measured from ewes that were clipped and shorn on different days across three consecutive 112 day intervals (Bigham, 1974)

|   Treatment   |      Rep 1     |      Rep 2     |      Rep 3     |   Total   |    Mean   |
|--------------:|:--------------:|:--------------:|:--------------:|:---------:|:---------:|
|    A(28,28)   |      1.44      |      1.13      |      1.93      |    4.50   |           |
|    B(56,56)   |      1.34      |      1.11      |      1.87      |    4.32   |           |
|   C(112,112)  |      1.30      |      1.03      |      1.88      |    4.21   |           |
|   D(28,112)   |      1.33      |      1.00      |      1.87      |    4.20   |           |

<br>
```{r}

sheep.wool <- data.frame(
  'A' = c(1.44, 1.13, 1.93),
  'B' = c(1.34, 1.11, 1.87),
  'C' = c(1.30, 1.03, 1.88),
  'D' = c(1.33, 1.00, 1.87))

```

a. What is the MST and MSE of this data?

b. What is the calculated F-value of this data?

c. Is there a significant difference among the treatments at a 5% significance level?

d. Is there a significant difference among the treatments at a 1% significance level?

e. Is there a significant difference among the treatments at a 10% significance level?


4. If treatments are considered significantly different at the 5% level, will they also be considered significantly at the 10% level?  at the 1% level?

5. Data is collected on the yield (bushels/acre) in four different varieties of wheat replicated at a single location.  

<br>

Table: (\#tab:WheatYield) Wheat yield (bushels/acre) is measured from four different varieties.  

|   Treatment   |      Rep 1     |      Rep 2     |      Rep 3     |      Rep 4     |      Rep 5     |      Rep 6     | 
|--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|:--------------:|
|       A       |       60       |       40       |       45       |       55       |                |                |
|       B       |       25       |       30       |       25       |       35       |       30       |       25       |
|       C       |       55       |       50       |       40       |       60       |       55       |                |
|       D       |       60       |       55       |       55       |       45       |       50       |                |

<br>

```{r}

wheat.yield <- data.frame(
  'A' = c(60, 40, 45, 55, NA, NA),
  'B' = c(25, 30, 25, 35, 30, 25),
  'C' = c(55, 50, 40, 60, 55, NA),
  'D' = c(60, 55, 55, 45, 50,  NA))

```

a. What is the MST and MSE of this data?

b. What is the calculated F-value?

c. Is there a significant difference among the treatments at the 5% significance level?


## Homework

### Introduction to ANOVA

This set of exercises introduces the need for and shows the use of a method called Analysis of Variance (ANOVA). This method is used to determine if there are differences in the means of several populations. Instead of two populations, now there are many. We start with a type of ANOVA called One-way ANOVA. The term "One-way" means that the populations may differ in only one factor. For example, the factor can be the amount of fertilizer applied to a crop or the type of diet fed to an animal. Conversely, "Two-way" ANOVA involves more than one factor, for example when plots differ not only in the amount of fertilizer but also in the amount of irrigation. For now we focus on one-way ANOVA.

### Need for ANOVA

We already know how to compare two means. When we have many means and want to determine if any of them are different we could simply test all possible pairs. Why is this not a good idea unless we take precautions? This first part of this homework illustrates this.	

### Comparison of Diets

You are looking for diets that reduce blood cholesterol in rats. You created 14 test diets and did an experiment in which 4 randomly selected rats were independently fed each diet. Blood cholesterol was measured on each rat at the end of 30 days of feeding.  Because you do this before knowing about ANOVA you decide to test for differences using the independent observations procedure for two population means, taking two diets at a time. You use alpha=0.05 for each test. Assume that the test are all independent.

1. How many different tests of $H_0: \mu_1 =\mu_2$ do you make? Consider that each time you make a test you are choosing 2 out of 14 means. The order of the means does not matter, because we are making 2-tailed tests.

2. What is the probability that you make a type I error in each test separately?

3. What is the probability that you make a type I error in just one of all the tests?

4. What is the probability that you make a type I error in none of the tests?

5. What is the probability that you make a type I error in at least one test (family-wise error rate)?

### Use of ANOVA

When you make a lot of tests, your chance of making at least one mistake can be quite large. ANOVA avoids the problem with your "all-tests" approach by doing a single test that simultaneously checks for evidence of at least one difference. Interestingly, although the test is to determine if there are differences among means, it actually is a test of equality of variances estimated in two different ways. So first we practice testing for equality (or difference) of variances. When two random variables have normal distributions with equal variance, the ratio of sample variances from each population has an F distribution (see [F Distribution](#FDist).


### Test of Equality of Variances

The data below are samples of forage yield (ton/ha) from a variety of fescue grown with and without irrigation. Test the hypothesis that irrigated and rainfed yields have the same variance.

<br>
```{r}

test.of.eq.var <- data.frame(
rainfed = c(4.55531, 3.28384,3.69763, 4.97093, 4.86167, 3.51156, 5.37135, 4.17079, 3.51320, 4.02560, 2.79041, 3.55350, 3.22606, 5.89298, 1.66526, 4.16665, 3.50212, 3.72783, 2.36477, 3.30673, 4.40777, 3.66480, 3.35171, 3.10409, 2.85976, 3.81750, 5.39221, 4.99803,4.11388, 4.11004),
irrigated = c(13.10275, 11.66548, 8.40485, 8.37005, 10.32271, 10.22372, 8.08630, 12.99069, 11.70000, 12.10000, 7.87410, 10.22707, 12.10700, 11.00452, 10.64577, 8.68308, 6.83079, 12.25721, 10.20000, 12.52602, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))

```

<br>
```{block, type = 'rtip'}
Data frames are rectangular tables with data. All columns must have the same length. Because our two treatments do not contain the same number of samples, 'NA' values are included to make columns of equal length that can be joined into a data frame.**
```
<br>
For the following calculations, assume an $\alpha = 0.05$, and $df_{rainfed} = 29$ and $df_{irrigated} = 19$ to calculate these values.

6. Calculate the sample variances and report the "calculated" or sample F value. 

7. What is the probability of observing an F-value larger than the one observed? 

8. What is the critical F "from table?"

9. Interpret the results of the test of equality of variances.

### Estimation of the Variance Between and Within Samples

For this section of the exercise we will use yield (ton/ha) data about 10 fescue varieties, all grown in rainfed conditions. We assume that the variances are the same for all varieties. This is an important point to understand: we assume that the variance is the same for all varieties, regardless of their means. We are referring to the variances within each sample. Because the variances are assumed to be the same, we can follow the idea used in Case 1 of two populations means (textbook page 88) and pool the estimates to get a better overall estimate of the variance.				

These are randomly simulated data. All variances are equal to 1 and the true means are below. These values are given just FYI, to see how samples differ.

```{r}

fescue.samples <- data.frame(
'var1' = c(3.10, 3.88, 2.37, 3.95, 2.42, 1.72, 3.28, 2.51, 2.44), 
'var2' = c(3.84, 2.65, 3.10, 2.83, 1.81, 3.86, 4.25, 4.21, 5.44), 
'var3' = c(5.80, 5.87, 4.25, 2.56, 4.32, 3.05, 4.31, 3.67, 2.36), 
'var4' = c(5.03, 2.18, 3.49, 4.50, 1.99, 2.95, 3.48, 2.74, 3.89), 
'var5' = c(3.90, 4.24, 4.60, 5.00, 4.04, 4.85, 4.65, 4.26, 5.47), 
'var6' = c(7.57, 3.76, 6.66, 5.87, 5.69, 6.41, 5.12, 6.74, 6.04), 
'var7' = c(2.47, 4.23, 2.63, 4.68, 3.22, 3.88, 3.17, 3.62, 3.22), 
'var8' = c(3.23, 4.15, 4.12, 6.05, 3.66, 4.90, 3.88, 3.76, 3.15), 
'var9' = c(4.01, 5.46, 4.31, 5.63, 5.22, 4.38, 4.75, 4.09, 4.60),
'var10' = c(6.96, 5.79, 6.76, 6.25, 6.83, 7.24, 7.26, 5.00, 4.66))

#Actual Population Means
pop.means <- data.frame(
'var1' = 3.0, 
'var2' = 3.5, 
'var3' = 4.0, 
'var4' = 3.0, 
'var5' = 5.0, 
'var6' = 6.0, 
'var7' = 3.2, 
'var8' = 3.9, 
'var9' = 4.5, 
'var10' = 6.0)

df <- data.frame('var1' = 8, 'var2' = 8, 'var3' = 8, 'var4' = 8, 'var5' = 8, 'var6' = 8, 'var7' = 8, 'var8' = 8, 'var9' = 8, 'var10' = 8)

#Sample Averages for Each Variety
sample.avg <- sapply(fescue.samples, mean)

#Sample Variances for Each Variety
sample.var <- sapply(fescue.samples, var)

```

For the following calculations, assume an $\alpha = 0.05$, and $df_{var_i} = 8$ to calculate these values.

$S^2_{\bar{Y}} = \frac{S^2_{Y}}{r}$

$S^2 = \frac{(r_1 - 1)S^2_{1} + (r_2 - 1)S^2_{2} +(r_3 - 1)S^2_{3} + (r_4 - 1)S^2_{4} + (r_5 - 1)S^2_{5} + (r_6 - 1)S^2_{6} + (r_7 - 1)S^2_{7} + (r_8 - 1)S^2_{8} + (r_9 - 1)S^2_{9} + (r_10 - 1)S^2_{10}}{(r_1 -1) + (r_2 -1) + (r_3 -1) + (r_4 -1) + (r_5 -1) + (r_6 -1) + (r_7 -1) + (r_8 -1) + (r_9 -1) + (r_{10} -1)}$

10. Estimate the pooled variance for the 10 varieties of fescue.


Now we recall that the variance of averages for samples from a population is the variance of individual observations divided by sample size. If it is true that all means are the same for all varieties, then we can obtain another estimate of the individual variance by calculating the variance among averages and multiplying by sample size.	


11. Estimate the variance of yield using the variance among averages and sample size. Refer to the most important equation for this course.

12. What is the overall average?

13. What is the pooled within variance?

14.  What is the variance of averages?

15.  What is the second variance estimate?

16.  What is $F_{calc}$?

17.  What is $F_{critical}$?

18.  What is P($F$ > $F_{calc}$)?

19.  What is $df_{numerator}$?

20.  What is $df_{denominator}$?


## Laboratory Exercises

### Plant Sciences {#LabCh09PLS}

Prepare an .Rmd document starting with the following text, where you substitute the corresponding information for author name and date.


"---" Unquote the three dashes
title: "Lab05 CRD Anova"
author: "YourFirstName YourLastName"
date: "enter date here"
output: html_document
"---" Unquote the three dashes


```{r setupPS, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(pander)
library(ggplot2)

```

#### Instructions

For this lab you will modify this file and submit this file with the file name changed so it has your email ID (the part before @) in lower case instead of "email." Do not add spaces to the file name.

This is a markdown document. You will type your code and run it one line at a time as you add it in the lines indicated below. Add code **ONLY** in the areas between "\```{r}" and "\```". These areas are highlighted with a light grey color. Run each line and parts to learn and experiment until you get the result you want. Keep the lines that worked and move on. At any time you can see if your document "knits" or not by clicking on the Knit HTML icon at the top. Once you have completed all work, knit your document and save the html file produced with the same file name but with an html extension (Lab01email.html).

**Submit BOTH files for your lab report using the appropriate Canvas tool**

For each part and question below, type your code in the grey area below, between the sets of back-ticks (```) to perform the desired computation and get output. Type your answers **below** the corresponding grey area.

In this exercise we analyze the data resulting from an experiment at the UC Davis Plant Sciences research fields. The data and description of the experimental setup have been simplified to facilitate understanding. For the purpose of this lab we will consider that the experiment was designed as a Completely Randomized Design in which 12 treatments were applied randomly to 13 plots each. We will also assume that the variances of the errors or residuals are the same in all treatments. One medusahead plant was grown in each plot and its total seed production was recorded at maturity (seedMass.g).

Treatments resulted from combining two levels of nitrogen fertilization (n: no fertilizer added and N: nitrogen added), two levels of watering (w: no water other than rain, W: with water added) and three environments (s: areas without addition of seed representing the typical California Annual Grassland, S: areas where native perennial grasses were seeded, E: edge between seeded and unseeded areas). Based on previous experience and plant biology, we expect medusahead seed production to be lowest without water or fertilizer and when exposed to competition by native perennial grasses.

This exercise has four parts. First, we read in and explore the data with some descriptive statistics. We will use a transformation to normalize the distribution of the response variable. Second, each observation is partitioned into the components indicated by the model and then the corresponding sums of squares (SS) and degrees of freedom (df) are computed. The ANOVA table and tests are done with the resulting values of SS and dfe. Third, we repeat the analysis using pre-existing R functions to do the ANOVA tables and tests directly. Finally, we calculate confidence intervals for treatment means an back-transform them to be able to interpret results in the original units of grams of seed produced per plant.

#### Part 1. Inspection and summary of data [25 points]

Get a histogram of the data. Make a graph showing box-plots of the data for each treatment. Notice that the data have a highly skewed distribution, so a logarithmic transformation is necessary. Create a new column called logSmass that contains the log of seed mass after adding 0.3 to the mass. We add 0.3 to avoid problems with the zeroes, because log(0) is not defined. Plot histograms and box-plots of the new variable. Inspect the box-plots and determine if there appear to be any effects of treatments on the seed production of the invasive exotic weed. Create a table of averages and standard errors by treatment.

In the first R chunk we read in the data.
```{r}

# seed <- read.csv(file = "Lab05SeedMassData.txt", header = TRUE)

seed <- read.table(header = TRUE, text = "
id block Treatment seedMass.g
1 1 WNE 0.8452
2 1 WNE 1.628599896
3 1 WNE 1.71330012
5 1 WNE 0.605599925
63 2 WNE 1.478700073
64 2 WNE 1.655799925
65 2 WNE 3.579000285
66 2 WNE 2.52810012
67 2 WNE 0.676500068
119 3 WNE 0.84060003
120 3 WNE 0.455200004
121 3 WNE 0.642700076
122 3 WNE 0
6 1 wNE 2.39799978
7 1 wNE 3.630199812
8 1 wNE 3.19679986
9 1 wNE 1.579800105
10 1 wNE 2.524800075
73 2 wNE 1.628599875
74 2 wNE 5.239899819
75 2 wNE 2.374499816
76 2 wNE 2.957700192
128 3 wNE 0.39629997
129 3 wNE 1.11970005
130 3 wNE 2.129599824
131 3 wNE 1.958600049
13 1 wnE 0.049800007
14 1 wnE 0.46560003
15 1 wnE 0.358300019
68 2 wnE 0.616600044
69 2 wnE 1.018299904
70 2 wnE 0.402999948
71 2 wnE 0.600599976
72 2 wnE 0.420299968
132 3 wnE 1.267800064
133 3 wnE 0.722599913
134 3 wnE 0.6414
135 3 wnE 1.559600013
136 3 wnE 0.301799972
16 1 WnE 0.49499996
17 1 WnE 0.42920001
18 1 WnE 0.52509996
19 1 WnE 0.314599977
78 2 WnE 1.070900032
79 2 WnE 0.523999944
80 2 WnE 0.879599924
81 2 WnE 1.015500081
123 3 WnE 0.745699944
124 3 WnE 0.804199928
125 3 WnE 0.633399966
126 3 WnE 0.965900001
127 3 WnE 1.932200036
20 1 wnS 0.3136
21 1 wnS 0.43789998
22 1 wnS 0.326600029
23 1 wnS 2.688000014
24 1 wnS 0.438300029
83 2 wnS 0.184299984
85 2 wnS 0.3579
86 2 wnS 0.2591
152 3 wnS 0.299800035
153 3 wnS 0.713900075
154 3 wnS 0.384100024
155 3 wnS 0.67930006
156 3 wnS 0.631599974
25 1 WNS 2.424700152
26 1 WNS 0.5808
27 1 WNS 0.584400068
28 1 WNS 0.67930005
91 2 WNS 0.7138
92 2 WNS 0.057500004
93 2 WNS 0.18880002
94 2 WNS 2.87149987
95 2 WNS 0.408300021
147 3 WNS 0.62479995
148 3 WNS 0.471400036
150 3 WNS 1.432499985
151 3 WNS 1.666099926
30 1 wNS 0.81229995
31 1 wNS 3.78009968
32 1 wNS 2.014800174
33 1 wNS 1.697400132
34 1 wNS 1.161500076
87 2 wNS 0.997699956
88 2 wNS 2.960099866
89 2 wNS 1.23570003
90 2 wNS 1.034799964
138 3 wNS 1.05819988
139 3 wNS 2.100399994
140 3 wNS 1.347200127
141 3 wNS 2.373300001
35 1 WnS 0.455399956
36 1 WnS 0.258800003
38 1 WnS 0.30340002
39 1 WnS 0.2277
96 2 WnS 0.28540002
97 2 WnS 0.915400058
98 2 WnS 0.3118
99 2 WnS 0.35759997
142 3 WnS 0.209499996
143 3 WnS 0.355199988
144 3 WnS 0.479600032
145 3 WnS 0.195100021
146 3 WnS 0.455800025
40 1 WNs 2.977100165
41 1 WNs 0
42 1 WNs 0
43 1 WNs 0.93639996
105 2 WNs 0.717100059
106 2 WNs 2.6535999
107 2 WNs 4.149200016
108 2 WNs 5.559399812
109 2 WNs 2.09089998
167 3 WNs 6.93320061
168 3 WNs 4.72049974
169 3 WNs 8.660299999
171 3 WNs 2.437699887
44 1 Wns 0.650700024
45 1 Wns 0.93370002
46 1 Wns 2.126800006
47 1 Wns 1.311599922
48 1 Wns 2.114299946
115 2 Wns 2.113199912
116 2 Wns 7.767500208
117 2 Wns 3.437700174
118 2 Wns 2.931799789
172 3 Wns 0.635800038
173 3 Wns 1.607500132
174 3 Wns 0.947599979
175 3 Wns 1.075900105
49 1 wns 1.70000015
51 1 wns 1.504
52 1 wns 0.94400004
53 1 wns 3.488500337
100 2 wns 2.2786001
101 2 wns 1.578499845
102 2 wns 0.764299998
103 2 wns 0.474700014
104 2 wns 1.31689998
157 3 wns 1.770300075
159 3 wns 2.36320013
160 3 wns 4.271699849
161 3 wns 4.158900105
54 1 wNs 5.17460008
55 1 wNs 3.88629957
56 1 wNs 0.592299989
57 1 wNs 2.063099988
58 1 wNs 1.889700111
110 2 wNs 3.202000016
111 2 wNs 5.395399962
112 2 wNs 3.959800395
114 2 wNs 6.27350022
163 3 wNs 2.42960022
164 3 wNs 4.8853
165 3 wNs 1.079000118
166 3 wNs 4.73389994
")

seed$Treatment <- factor(as.character(seed$Treatment), levels = c("wns", "wnE", "wnS", "wNs", "wNE", "wNS", "Wns", "WnE", "WnS", "WNs", "WNE", "WNS")) # Better order for treatments


```

The last line of code changes the order of the treatment levels such that treatments s, E and S appear in that sequence for each combination of water and nitrogen. This order facilitates the detection of patterns of response.

We then look at the structure of the data frame to make sure the data were read correctly, and we make a boxplot and a histogram of the seed mass in grams produced by each plant.

```{r}
# Look at the data

str(seed) # Note the column that has our dependent variable

boxplot(seedMass.g ~ Treatment, seed)

# Obtain a histogram of the data for seed mass

hist(seed$seedMass.g) # Distribution is too far from Normal

```

Note how the distribution of seed mass is highly skewed with a long right tail. The data model we will use for the ANOVA requires that the *residuals* be normally distributed, not the original data. Therefore, a proper check of normality must be applied on the residuals after removing treatment effects.

The model for a Completely Randomized Design (CRD) is

$$Y_{ij} = \mu_{..} + \tau_i + \epsilon_{ij}$$
where $Y_{ij}$ is the observation (mass of seeds in g) for treatment i and replicate j, $\mu_{..}$ is the overall mean, $\tau_i$ is the effect of treatment i and $\epsilon_{ij}$ is a random error with identical independent normal distributions for all replicates and treatments, which is written as

$$\epsilon_{ij} \sim iid \quad N(0,\sigma) \quad \forall \quad i, j$$

A histogram of the residuals of the model with treatments can be obtained in one line of code as follows. Note how functions can be nested, and they are calculated from the inside out. 

```{r}

hist(
   residuals(
      lm(seedMass.g ~ Treatment,
         data = seed)
   )
)

```

Residuals are slightly skewed. A logarithmic transformation seems to do the trick, yielding residuals that are more normal. Note that we add 1.0 to each observation because some observations are 0, and log(0) is not defined. Moreover, the addition of a fixed quantity to all observations improves the effect of the transformation. In symbols, the transformation is

$$Z_{ij} = \log(Y_{ij} + 1.0)$$

where $Z_{ij}$ is called logSmass in the R environment.

Transformations of original variables can be very useful to meet the assumptions necessary for inference with many models. This is a topic that requires specific treatment, but for now we note that a log transformation seems to work.

```{r}

seed$logSmass <- log(seed$seedMass.g + 1)

hist(
   residuals(
      lm(logSmass ~ Treatment,
         data = seed)
   )
)

```

A boxplot of the transformed seed mass by treatment allows us to look for trends in the results.

```{r}

boxplot(logSmass ~ Treatment, 
        data = seed, 
        ylab = 'log Seed mass (g)')

```

\
\
**ANSWER THE FOLLOWING QUESTIONS:**

Inspect the boxplot. What treatments appear to differ? What would you expect to see based on the hypotheses and previous knowledge? Do the data appear to support your expectations?

\
\
The descriptive statistics for these data are augmented by with a table of averages, medians and standard errors for each treatment. Keep in mind that we are working with the transformed variable, so we include back-transformed values that are in units of grams of seed per plant. The back transformation equation is

$$\bar{Y}_i. = \exp(\bar{Z}_i.) - 1.0$$

where $\bar{Z}_i.$ is the treatment average for the transformed variable. Transformations of response and predictor variables can be extremely useful to statistical modeling in general and in particular to meet the assumptions necessary to do inference in ANOVA. Recall that inference ([see Chapter about inference](#)) means to make statistical statements about populations and their parameters based on samples.

To obtain descriptive statistics we will use the `aggregate` function. This is a versatile function that aggregates the data into groups defined by a formula and applies a predefined or a custom-built formula to obtain one result for each group. The custom function can be specified directly inside `aggregate`. Below, `smeans` is a data frame that has two columns: one for treatment names and the other with the mean `logSmass` for each treatment. The `formula = ` argument tells R what variable is to be aggregated (logSmass) and what variable to use as grouping factor (Treatment). These variables are to be found in the data frame specified by `data = `. Each group will be processed with the function `mean` as specified in `FUN = `. Note that the parentheses are not necessary to refer to the function that calculates the averages.

```{r}
# Read help about aggregate() function.

smeans <- aggregate(formula = logSmass ~ Treatment, 
                    data = seed, 
                    FUN = mean)

smedians <- aggregate(logSmass ~ Treatment, 
                      data = seed, 
                      FUN = median)

# Use the aggregate() function to get the standard error of the average for each treatment. Note that the function applied gets the standard deviation and divides by the square root of the sample size to get the standard error.

sserrors <- aggregate(logSmass ~ Treatment, 
                      data = seed, 
                      FUN = function(x) sd(x)/sqrt(length(x)))

```

For the calculation of the standard errors (the square root of the variance of each group average) we had to use a custom function defined inside `agregate` and using the `function` function where x is the set of values in each group.

Part 1 is completed by creating a table with the values calculated above and the back-transformed treatment averages. In the first line of code, the separate data frames created for averages, standard errors and medians are joined as columns with `cbind`, and only columns 1, 2, 4 and 6 are kept. This removes the redundant column for treatments, which is present in each original data frame. If you want to see the redundant column, see what happens if you remove the `[, c(1, 2, 4, 6)]` part of the code.

```{r}

table.seed.mass <- cbind(smeans, sserrors, smedians)[, c(1, 2, 4, 6)]

names(table.seed.mass) <- c("Treatment", 
                            "Mean.log", 
                            "SE.log", 
                            "Median.log")

table.seed.mass$Mean.g <- exp(table.seed.mass$Mean.log) - 1

pander(table.seed.mass)

```

The same information can be displayed graphically with a bar plot, where the height of the bars represents seed production for each treatment, with error bars at the top.

```{r}

ggplot(table.seed.mass, 
       aes(Treatment, Mean.log)) + 
   geom_col() +  
   geom_errorbar(
      aes(
         ymin = Mean.log - SE.log, 
         ymax = Mean.log + SE.log), 
      width = 0.2) + 
   labs(y = "Log of Seed mass in g  ± s.e.", 
        x = "Treatment") + 
   theme_classic()

```

\
\
**ANSWER THE FOLLOWING QUESTIONS**

What does`cbind` do?
What does `aggregate` do?


#### Part 2. Partition of Sum of Squares and Degrees of Freedom [30 points]

Now, we use basic functions to partition the total sum of squares of logSmass into treatments and residual or error. The purpose of doing this is for you to understand the calculations in some detail, without having to actually use a calculator.  Start by creating a column with treatment averages for each observation. This column will have a repeating number for all observations in a treatment. We calculate the total sum of squares as the sum of squares of deviations from each observation to the overall average:

$$total \ deviation = Y_{ij} - \bar{Y}_{..}$$
$$total \ sum \ of \ squares = TSS = \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{r} \ \left(Y_{ij} - \bar{Y}_{..} \right)^2 $$

Treatment sum of squares is the sum of squared deviations from the treatment average to the overall average, for all observations. This deviation is the estimated *treatment effect* $\hat\tau_i$.

$$sum \ of \ squares \ of \ treatments= SST =  \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{r}  \hat\tau_i^2  = \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{r} \left( \bar{Y}_{i.} - \bar{Y}_{..} \right)^2$$

Calculate the corresponding degrees of freedom and prepare a complete analysis of variance table with columns for Source, SS, df, and MS. Make it into a data frame and then print it nicely with `pander`. Calculate the F test and the critical F to test the Ho: mean seed production is the same in all treatments. Interpret the results.


```{r}

names(smeans)[2] <- "AvgLogMass"

seed <- merge(seed, 
              smeans, 
              by = "Treatment", 
              all = TRUE)

head(seed)

# Total sum of squares:
# total deviation of observations from the overall average
ssTot <- sum((seed$logSmass - mean(seed$logSmass)) ^ 2)

# Treatment sum of squares:
# deviation of treatment average from the overall average
ssTrt <- sum((seed$AvgLogMass - mean(seed$logSmass)) ^ 2)

# Residual sum of squares:
# deviation of observations from treatment average
ssRes <- sum((seed$logSmass - seed$AvgLogMass) ^ 2)

# calculate the degrees of freedom by completing the code below

```


```{r}

df.Trt <- length(levels(seed$Treatment)) - 1

dfe <- length(seed$AvgLogMass) - df.Trt - 1

df.Tot <- df.Trt + dfe

# Now calculate the means squares

MSTrt <- ssTrt / df.Trt

MSE <- ssRes / dfe

# Calculate the F ratio and F critical value

(Fcalc <- MSTrt / MSE)

(Fcrit <- qf(0.05, 
             df1 = df.Trt, 
             df2 = dfe, 
             lower.tail = FALSE))


```


ANSWER THE FOLLOWING QUESTIONS:

Are there differences among treatments? Report the F statistic and result. State your conclusions.


#### Part 3. ANOVA using R functions.[20 points]


Use the R functions `aov` and `anova` to obtain the same tests of the null hypothesis that all means are equal. Report the results of using each function and compare to the previous results. The purpose of this part is for you to become familiar with different R functions that accomplish the same task. Read the help about the function `oneway.test` and use it, making sure that all arguments have proper values. Pay particular attention to the assumption of equal variances. Note that this function allows you to do the test even when variances are not equal, but we assume that the variances are equal.


```{r}

# read help about function aov(), anova(), oneway.test()

summary(aov(formula = logSmass ~ Treatment, 
            data = seed))

linear.model1 <- lm(formula = logSmass ~ Treatment, 
                    data = seed)

anova(linear.model1)

oneway.test(logSmass ~ Treatment, 
            data = seed, 
            var.equal = TRUE)

```


ANSWER THE FOLLOWING QUESTION:

Do the results differ among functions? Compare to the results from Part 2 and Part 3.


#### Part 4. Confidence intervals for treatment means [25 points]

Create 95% confidence intervals for all the treatment means, back-transform to see mass in g by exponentiation and subtracting 0.3, then add them to the table created in Part 1. Explain how the line to add the CI's to the ci.data works.


```{r}

ci.data <- data.frame(Treatment = levels(seed$Treatment))

ci.data <- cbind(ci.data, predict(linear.model1, 
                                  newdata = ci.data, 
                                  interval = "confidence")) 

str(ci.data)

# Complete the code below to do the back transformation of the CI's and make a table with them

ci.data.bt <- ci.data[,2:4]

# Note that we will be doing a back-transformation on 3 columns from the ci.data table: the estimated mean values for each of 12 treatments, the lower 95% CIs, and the upper 95% CIs

# Back-transformation code goes here ...


library(plotrix)
plotCI(ci.data.bt$fit, 
       uiw = ci.data.bt$upr - ci.data.bt$fit, 
       liw = ci.data.bt$fit - ci.data.bt$lwr, 
       ylab = "Seed Mass (g)")


```


ANSWER THE FOLLOWING QUESTIONS.

(1) Can you conclude that any of the treatments are effective at controlling medusahead seed production, if your threshold for control is 2 g? In other words, do any treatments result in an expected seed mass production that is significantly less than 2 g?


(2) Explain how the line to add the CI's to the ci.data works (i.e., explain the `cbind` function).


### Animal Sciences {#LabCh09ANS}


Prepare an .Rmd document starting with the following text, where you substitute the corresponding information for author name and date.

"---" Unquote the three dashes
title: "Lab05 CRD Anova"
author: "YourFirstName YourLastName"
date: "enter date here"
output: html_document
"---" Unquote the three dashes

```{r setupAS, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(pander)
library(ggplot2)

```

#### Instructions

For this lab you will modify this file and submit this file with the file name changed so it has your email ID (the part before @) in lower case instead of "email." Do not add spaces to the file name.

This is a markdown document. You will type your code and run it one line at a time as you add it in the lines indicated below. Add code **ONLY** in the areas between "\```{r}" and "\```". These areas are highlighted with a light grey color. Run each line and parts to learn and experiment until you get the result you want. Keep the lines that worked and move on. At any time you can see if your document "knits" or not by clicking on the Knit HTML icon at the top. Once you have completed all work, knit your document and save the html file produced with the same file name but with an html extension (Lab05email.html).

**Submit BOTH files for your lab report using the appropriate Canvas tool**

For each part and question below, type your code in the grey areas, between the sets of back-ticks (```) to perform the desired computation and get output. Type your answers **below** the corresponding grey area.

In this exercise we analyze the data resulting from an experiment that measured the concentration of protein in cow's milk as a function of diet and period of lactation. Period of lactation are sets of 5 weeks starting on the date of the calf's birth, and are labeled A-D. The data are modified from the `Milk` data set that comes with Base R. For the purpose of this exercise we assume that each row of data comes from a different cow.

This exercise has four parts. First, we read in and explore the data with some descriptive statistics. We check if we need to use a transformation to normalize the distribution of residuals. Second, each observation is partitioned into the components indicated by the ANOVA model and then the corresponding sums of squares (SS) and degrees of freedom (df) are computed. The ANOVA table and tests are done with the resulting values of SS and dfe. Third, we repeat the analysis using pre-existing R functions to do the ANOVA tables and tests directly. Finally, we calculate confidence intervals for treatment means an back-transform them to be able to interpret results in the original units of milk protein.

#### Part 1. Inspection and summary of data [25 points]

Get a histogram of the data. Make a graph showing box-plots of the data for each treatment. Notice that the data have a highly skewed distribution, so a logarithmic transformation is necessary. We will create a new column called logProt that contains the log of milk protein content after subtracting 2.64%. Then we will plot histograms and box-plots of the new variable, inspect the box-plots and determine if there appear to be any effects of treatments (lactation period and diet) the concentration of protein in the milk. We create a table of averages and standard errors by treatment.

In the first R chunk we read in the data.

```{r}

# milk.prot <- read.csv(file = "milk.prot.txt", header = TRUE)

milk.prot <- read.table(header = TRUE, text = "
Treatment Period Diet protein
bar+lupA A bar+lup 3.16247700898787
bar+lupA A bar+lup 3.08724498141275
bar+lupA A bar+lup 3.46609229309978
bar+lupA A bar+lup 3.6114421162452
bar+lupA A bar+lup 3.46406702334731
bar+lupA A bar+lup 3.36414430399419
bar+lupA A bar+lup 3.45400121050413
bar+lupA A bar+lup 3.19441594578852
bar+lupA A bar+lup 3.35326928628574
bar+lupA A bar+lup 3.22934381716112
bar+lupA A bar+lup 3.33543175929905
bar+lupA A bar+lup 3.5461325365186
bar+lupA A bar+lup 3.33014969744167
bar+lupA A bar+lup 3.27738459138318
bar+lupA A bar+lup 3.61608739128879
bar+lupA A bar+lup 3.57269856794826
bar+lupA A bar+lup 3.57269856794826
bar+lupA A bar+lup 3.56377228768805
bar+lupA A bar+lup 3.62309023649696
bar+lupA A bar+lup 3.38815934301783
bar+lupA A bar+lup 3.39003264549302
bar+lupA A bar+lup 3.69796748620264
bar+lupA A bar+lup 3.5593358550955
bar+lupA A bar+lup 3.37515060805783
bar+lupA A bar+lup 3.93172455189781
bar+lupA A bar+lup 3.81254081175683
bar+lupA A bar+lup 3.7723438892983
bar+lupB B bar+lup 3.24936791495157
bar+lupB B bar+lup 3.21695450412385
bar+lupB B bar+lup 3.24071829671401
bar+lupB B bar+lup 3.26162361847266
bar+lupB B bar+lup 3.11788028618845
bar+lupB B bar+lup 3.26338848183654
bar+lupB B bar+lup 3.1938468686581
bar+lupB B bar+lup 3.21028616958602
bar+lupB B bar+lup 3.54647855732033
bar+lupB B bar+lup 3.25635015740983
bar+lupB B bar+lup 3.55349802155413
bar+lupB B bar+lup 3.23045228530264
bar+lupB B bar+lup 3.60873626268452
bar+lupB B bar+lup 3.14492519743976
bar+lupB B bar+lup 3.32176884990615
bar+lupB B bar+lup 3.39612749347632
bar+lupB B bar+lup 3.32743600301904
bar+lupB B bar+lup 3.51427222456755
bar+lupB B bar+lup 3.57242380725101
bar+lupB B bar+lup 3.55584721786233
bar+lupB B bar+lup 3.62356902754425
bar+lupB B bar+lup 3.37401840989471
bar+lupB B bar+lup 3.50748741046134
bar+lupB B bar+lup 3.53256534972504
bar+lupB B bar+lup 3.71373258501917
bar+lupB B bar+lup 3.26515687845922
bar+lupB B bar+lup 3.95140962311483
bar+lupC C bar+lup 3.21581657493573
bar+lupC C bar+lup 3.44800859079776
bar+lupC C bar+lup 3.05291167767717
bar+lupC C bar+lup 3.01560466016744
bar+lupC C bar+lup 3.26808563987888
bar+lupC C bar+lup 3.15847493306766
bar+lupC C bar+lup 3.44582631217777
bar+lupC C bar+lup 3.3480301686479
bar+lupC C bar+lup 3.18789847677067
bar+lupC C bar+lup 3.6660333640695
bar+lupC C bar+lup 3.11627883449707
bar+lupC C bar+lup 3.39059231774347
bar+lupC C bar+lup 3.6171777865758
bar+lupC C bar+lup 3.45019523834246
bar+lupC C bar+lup 3.24868037494299
bar+lupC C bar+lup 3.45019523834246
bar+lupC C bar+lup 3.57745198893342
bar+lupC C bar+lup 3.19668177573324
bar+lupC C bar+lup 3.16979172595156
bar+lupC C bar+lup 3.44582631217777
bar+lupC C bar+lup 3.29211669426641
bar+lupC C bar+lup 3.78656131482943
bar+lupC C bar+lup 3.50630722143868
bar+lupC C bar+lup 3.92757897670371
bar+lupC C bar+lup 3.35250536604907
bar+lupC C bar+lup 3.73321160451148
bar+lupC C bar+lup 3.18831464115259
bar+lupD D bar+lup 3.12623709074007
bar+lupD D bar+lup 3.17169459716754
bar+lupD D bar+lup 3.39124380992673
bar+lupD D bar+lup 3.53688468513483
bar+lupD D bar+lup 3.37943092312895
bar+lupD D bar+lup 3.4375038999136
bar+lupD D bar+lup 3.37008635578213
bar+lupD D bar+lup 3.51248690170277
bar+lupD D bar+lup 3.24861665707967
bar+lupD D bar+lup 4.19483265796442
bar+lupD D bar+lup 3.75315138527902
bar+lupD D bar+lup 3.19021008845822
bar+lupD D bar+lup 3.00565324979326
bar+lupD D bar+lup 3.39601046509553
bar+lupD D bar+lup 3.75315138527902
bar+lupD D bar+lup 3.74987334137213
bar+lupD D bar+lup 4.01863236126144
barleyA A barley 3.21958971793313
barleyA A barley 3.40301871623453
barleyA A barley 3.50152934825351
barleyA A barley 3.2777551240594
barleyA A barley 3.35527902404323
barleyA A barley 3.29381369336062
barleyA A barley 3.45720753257547
barleyA A barley 3.43583176700737
barleyA A barley 3.6589813362789
barleyA A barley 3.46626662788544
barleyA A barley 3.4093965937507
barleyA A barley 3.40641429773502
barleyA A barley 3.78479439947903
barleyA A barley 3.61572576157793
barleyA A barley 3.54012447891671
barleyA A barley 3.74371163446542
barleyA A barley 3.74371163446542
barleyA A barley 3.58646003246399
barleyA A barley 3.77256589988619
barleyA A barley 3.67901802958386
barleyA A barley 3.93595317757239
barleyA A barley 3.93871620603499
barleyA A barley 3.78233890699106
barleyA A barley 3.75325288164326
barleyA A barley 3.62423901528139
barleyB B barley 3.12248643668717
barleyB B barley 3.1833831655686
barleyB B barley 3.19651323869693
barleyB B barley 3.23890704839903
barleyB B barley 3.35588423809517
barleyB B barley 3.2083844872523
barleyB B barley 3.46802834620995
barleyB B barley 3.5475869572146
barleyB B barley 3.55194409350653
barleyB B barley 3.45800307265941
barleyB B barley 3.39242637826698
barleyB B barley 3.32422300702216
barleyB B barley 3.51335024348643
barleyB B barley 3.65234582572941
barleyB B barley 3.58743578295748
barleyB B barley 3.74903315029059
barleyB B barley 3.63342772094056
barleyB B barley 3.32942104357986
barleyB B barley 3.66011818640367
barleyB B barley 3.3139200324209
barleyB B barley 3.56071082643042
barleyB B barley 3.89886362429008
barleyB B barley 3.56291349001264
barleyB B barley 3.70284894926828
barleyB B barley 3.83694690162881
barleyC C barley 3.24984772541843
barleyC C barley 3.02806867166766
barleyC C barley 2.97722909449215
barleyC C barley 3.3042385914231
barleyC C barley 3.26432739051296
barleyC C barley 3.36964715879539
barleyC C barley 3.38443727877473
barleyC C barley 3.14986995607598
barleyC C barley 3.15828711751181
barleyC C barley 3.55489711476464
barleyC C barley 3.61838457145728
barleyC C barley 3.61605541224996
barleyC C barley 3.33061250616517
barleyC C barley 3.40898170288263
barleyC C barley 3.80163517433127
barleyC C barley 3.80433586446984
barleyC C barley 3.23523496727666
barleyC C barley 3.65870138900657
barleyC C barley 3.34611225223677
barleyC C barley 3.75393510087424
barleyC C barley 3.55269480358039
barleyC C barley 3.61431159731975
barleyC C barley 4.05801921547774
barleyC C barley 4.16070303427735
barleyC C barley 4.07739981710658
barleyD D barley 3.34998211701732
barleyD D barley 3.27810352281012
barleyD D barley 3.13268376425181
barleyD D barley 3.62959693702492
barleyD D barley 3.23145524255033
barleyD D barley 3.72418401852007
barleyD D barley 3.41265614968261
barleyD D barley 3.86239745579332
barleyD D barley 3.34736464607758
barleyD D barley 3.54516881896749
barleyD D barley 3.36520307449937
barleyD D barley 4.05014583644687
barleyD D barley 3.82716258843302
barleyD D barley 3.90530736258486
barleyD D barley 3.53297348054545
barleyD D barley 3.71839474054614
barleyD D barley 4.53054760216554
lupinsA A lupins 2.95603029280222
lupinsA A lupins 3.17289912491032
lupinsA A lupins 3.2108375205081
lupinsA A lupins 3.14859988098279
lupinsA A lupins 3.22389375887796
lupinsA A lupins 3.0897690493546
lupinsA A lupins 3.26433959404596
lupinsA A lupins 3.1713575354305
lupinsA A lupins 3.41003766428486
lupinsA A lupins 3.34372591926749
lupinsA A lupins 3.43453941314132
lupinsA A lupins 3.17909640732043
lupinsA A lupins 3.60897167260467
lupinsA A lupins 3.45541156047723
lupinsA A lupins 3.38611695701222
lupinsA A lupins 3.44492330687577
lupinsA A lupins 3.45752182649412
lupinsA A lupins 3.77937948894192
lupinsA A lupins 3.99600309156165
lupinsA A lupins 3.65073736963817
lupinsA A lupins 3.50724127766915
lupinsA A lupins 3.39800554938786
lupinsA A lupins 3.28707199749165
lupinsA A lupins 3.67088861428483
lupinsA A lupins 3.31587400920237
lupinsA A lupins 4.13921452740736
lupinsA A lupins 4.1993184650446
lupinsB B lupins 3.05903881648069
lupinsB B lupins 3.06059630293279
lupinsB B lupins 2.99635917438271
lupinsB B lupins 3.11378054556574
lupinsB B lupins 3.0466904311162
lupinsB B lupins 2.98641495493623
lupinsB B lupins 2.90540540400742
lupinsB B lupins 3.02257855521687
lupinsB B lupins 3.30836577833734
lupinsB B lupins 3.1154476250325
lupinsB B lupins 3.07316897172911
lupinsB B lupins 3.23128235620088
lupinsB B lupins 3.26611304512492
lupinsB B lupins 3.15647431931853
lupinsB B lupins 3.36971049805063
lupinsB B lupins 3.50360828445366
lupinsB B lupins 3.4164069900062
lupinsB B lupins 3.35029034810974
lupinsB B lupins 3.43702801495283
lupinsB B lupins 3.2199479989707
lupinsB B lupins 3.609408606032
lupinsB B lupins 4.57193094248879
lupinsB B lupins 3.51589495753408
lupinsB B lupins 3.64303555480527
lupinsB B lupins 3.21433162140314
lupinsB B lupins 3.73014980454211
lupinsB B lupins 3.92145079924361
lupinsC C lupins 3.06590780468262
lupinsC C lupins 3.02059230996939
lupinsC C lupins 2.86743593436082
lupinsC C lupins 2.89494446060554
lupinsC C lupins 3.04780205976617
lupinsC C lupins 3.17953007221293
lupinsC C lupins 3.19286293885429
lupinsC C lupins 3.22603060060185
lupinsC C lupins 3.0805972658309
lupinsC C lupins 3.25624750596647
lupinsC C lupins 3.33257618780012
lupinsC C lupins 3.6250333091386
lupinsC C lupins 3.06799059326193
lupinsC C lupins 3.21419538670027
lupinsC C lupins 3.39122240348853
lupinsC C lupins 3.43364747957442
lupinsC C lupins 3.22404818913958
lupinsC C lupins 3.15112175078154
lupinsC C lupins 2.93863888085488
lupinsC C lupins 3.40519473867389
lupinsC C lupins 3.570483577836
lupinsC C lupins 3.37513185317005
lupinsC C lupins 3.85341302609114
lupinsC C lupins 3.26291484346087
lupinsC C lupins 3.71106031119639
lupinsC C lupins 3.37970618982359
lupinsC C lupins 3.26034535288705
lupinsD D lupins 2.84223981242853
lupinsD D lupins 2.88396586957888
lupinsD D lupins 3.06592157327319
lupinsD D lupins 3.15478932401863
lupinsD D lupins 3.4110674466176
lupinsD D lupins 3.27648807200425
lupinsD D lupins 3.32507280585903
lupinsD D lupins 3.42603743847819
lupinsD D lupins 3.19290053993781
lupinsD D lupins 3.5296744345488
lupinsD D lupins 3.09821010221529
lupinsD D lupins 2.97601302622587
lupinsD D lupins 3.62808203709135
lupinsD D lupins 3.16417469088188
lupinsD D lupins 3.2452918380361
lupinsD D lupins 3.82744489060173
")

```

We then look at the structure of the data frame to make sure the data were read correctly, and we make a boxplot and a histogram of milk.prot.

```{r}

# Look at the data

str(milk.prot) # Note the column that has our dependent variable

boxplot(protein ~ Treatment, milk.prot)

# Obtain a histogram of the data for milk protein concentration.

hist(milk.prot$protein) # Distribution is slightly skewed

```

Note how the distribution of milk protein is skewed with a slightly long right tail. The data model we will use for the ANOVA requires that the *residuals* be normally distributed, not the original data. Therefore, a proper check of normality must be applied on the residuals after removing treatment effects. R has a functions called `residuals` that extracts the residuals from the linear model object.

The model for a Completely Randomized Design (CRD) is

$$Y_{ij} = \mu_{..} + \tau_i + \epsilon_{ij}$$
where $Y_{ij}$ is the observation (percent milk protein for a cow in a period) for treatment i and replicate j, $\mu_{..}$ is the overall mean, $\tau_i$ is the effect of treatment i and $\epsilon_{ij}$ is a random error with identical independent normal distributions for all replicates and treatments, which is written as

$$\epsilon_{ij} \sim iid \quad N(0,\sigma) \quad \forall \quad i, j$$

A histogram of the residuals of the model with treatments can be obtained in one line of code as follows. Note how functions can be nested, and they are calculated from the inside out. 

```{r}

hist(
   residuals(
      lm(protein ~ Treatment,
         data = milk.prot)
   )
)

```

Residuals are slightly skewed. A logarithmic transformation seems to do the trick, yielding residuals that are more normal. Note that we subtract 2.64 from each observation to improve the effect of the transformation. In symbols, the transformation is

$$Z_{ij} = \log(Y_{ij} - 2.64)$$

where $Z_{ij}$ is called logProt in the R environment.

Transformations of original variables can be very useful to meet the assumptions necessary for inference with many models. This is a topic that requires specific treatment, but for now we note that a log transformation seems to work.

```{r}

milk.prot$logProt <- log(milk.prot$protein - 2.64)

hist(
   residuals(
      lm(logProt ~ Treatment,
         data = milk.prot)
   )
)

```

A boxplot of the transformed protein content by treatment allows us to look for trends in the results.

```{r}

boxplot(logProt ~ Treatment, 
        data = milk.prot, 
        ylab = 'log protein %')

```


\
**ANSWER THE FOLLOWING QUESTIONS:**

Inspect the boxplot. What treatments appear to differ? What would you expect to see based on the hypotheses and previous knowledge? Do the data appear to support your expectations?

\
\
The descriptive statistics for these data are augmented with a table of averages, medians and standard errors for each treatment. Keep in mind that we are working with the transformed variable, so we include back-transformed values that are in units of percent milk protein. The back transformation equation is

$$\bar{Y}_i. = \exp(\bar{Z}_i.) + 2.64$$

where $\bar{Z}_i.$ is the treatment average for the transformed variable. Transformations of response and predictor variables can be extremely useful to statistical modeling in general and in particular to meet the assumptions necessary to do inference in ANOVA. Recall that [inference](#InferenceEstimation) means to make statistical statements about populations and their parameters based on samples.

To obtain descriptive statistics we will use the `aggregate` function. This is a versatile function that aggregates the data into groups defined by a formula and applies a predefined or a custom-built formula to obtain one result for each group. The custom function can be specified directly inside `aggregate`. Below, `smeans` is a data frame that has two columns: one for treatment names and the other with the mean `logProt` for each treatment. The `formula = ` argument tells R what variable is to be aggregated (logProt) and what variable to use as grouping factor (Treatment). These variables are to be found in the data frame specified by `data = `. Each group will be processed with the functions `mean`, `median` and a function created inside the `aggregate` call to calculate the standard error, as specified in `FUN = ` argument. Recall that the standard error is the square root of the variance of an average, and that the variance of an average is the variance of the original random variable (logProt) divided by the sample size.

$$V[\bar{X}] = \frac{V[X]}{n}$$


```{r}
# Read help about aggregate() function to understand syntax.

smeans <- aggregate(formula = logProt ~ Treatment, 
                    data = milk.prot, 
                    FUN = mean)

smedians <- aggregate(logProt ~ Treatment, 
                      data = milk.prot, 
                      FUN = median)

# Use the aggregate() function to get the standard error of the average for each treatment. Note that the function applied gets the standard deviation and divides it by the square root of the sample size to get the standard error.

sserrors <- aggregate(logProt ~ Treatment, 
                      data = milk.prot, 
                      FUN = function(x) sd(x)/sqrt(length(x)))

```

For the calculation of the standard errors (the square root of the variance of each group average) we had to use a custom function defined inside `agregate` and using the `function` function where x is the set of values for protein in each treatment group.

Part 1 is completed by creating a table with the values calculated above and the back-transformed treatment averages. In the first line of code, the separate data frames created for averages, standard errors and medians are joined as columns with `cbind`, and only columns 1, 2, 4 and 6 are kept. This removes the redundant column for treatments, which is present in each original data frame. If you want to see the redundant column, see what happens if you remove the `[, c(1, 2, 4, 6)]` part of the code.

```{r}

table.milk.prot <- cbind(smeans, sserrors, smedians)[, c(1, 2, 4, 6)]

names(table.milk.prot) <- c("Treatment", 
                            "Mean.log", 
                            "SE.log", 
                            "Median.log")

table.milk.prot$Mean.Prot <- exp(table.milk.prot$Mean.log) + 2.64

pander(table.milk.prot)

```

The same information can be displayed graphically with a bar plot, where the height of the bars represents milk protein content for each treatment, with error bars at the top.

```{r}

ggplot(table.milk.prot, 
       aes(Treatment, Mean.log)) + 
   geom_col() +  
   geom_errorbar(
      aes(
         ymin = Mean.log - SE.log, 
         ymax = Mean.log + SE.log), 
      width = 0.2) + 
   labs(y = "Log of protein ± s.e.", 
        x = "Treatment") + 
   theme_classic()

```

\
\
**ANSWER THE FOLLOWING QUESTIONS**

What does`cbind` do?
What does `aggregate` do?


#### Part 2. Partition of Sum of Squares and Degrees of Freedom [30 points]

Now, we use basic functions to partition the total sum of squares of logProt into treatments and residual or error. The purpose of doing this is for you to understand the calculations in some detail, without having to actually use a calculator.  Start by creating a column with treatment averages for each observation. This column will have a repeating number for all observations in a treatment. We calculate the total sum of squares as the sum of squares of deviations from each observation to the overall average:

$$total \ deviation = Y_{ij} - \bar{Y}_{..}$$
$$total \ sum \ of \ squares = TSS = \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{r} \ \left(Y_{ij} - \bar{Y}_{..} \right)^2 $$

Treatment sum of squares is the sum of squared deviations from the treatment average to the overall average, for all observations. This deviation is the estimated *treatment effect* $\hat\tau_i$.

$$sum \ of \ squares \ of \ treatments= SST =  \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{r}  \hat\tau_i^2  = \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{r} \left( \bar{Y}_{i.} - \bar{Y}_{..} \right)^2$$

We calculate the corresponding degrees of freedom and prepare a complete analysis of variance table with columns for Source, SS, df, and MS. Results are put into a data frame and printed neatly with `pander`. By calculating the F test and the critical F we are able to test the Ho: mean protein content is the same in all treatments.


```{r}

names(smeans)[2] <- "AvgLogProt"

milk.prot <- merge(milk.prot, 
              smeans, 
              by = "Treatment", 
              all = TRUE)

head(milk.prot)

# Total sum of squares:
# total deviation of observations from the overall average
ssTot <- sum((milk.prot$logProt - mean(milk.prot$logProt)) ^ 2)

# Treatment sum of squares:
# deviation of treatment average from the overall average
ssTrt <- sum((milk.prot$AvgLogProt - mean(milk.prot$logProt)) ^ 2)

# Residual sum of squares:
# deviation of observations from treatment average
ssRes <- sum((milk.prot$logProt - milk.prot$AvgLogProt) ^ 2)

```

Degrees of freedom are calculates as the total number of independent (different) terms in each sum, minus the number of parameters estimated for each sum. In the case of total sum of squares, there are as many different terms as there are observations, because each observation is potentially different. Only the overall mean is estimated with the overall average. For the sum of squares of treatments we have as many different terms as there are treatments, and we use the overall average as estimate of the overall mean. Because the total degrees of freedom equal the sum of the components, the residual df can be calculated as the total df (n-1) minus treatment df.

```{r}

df.Trt <- length(levels(milk.prot$Treatment)) - 1

dfe <- length(milk.prot$AvgLogProt) - df.Trt - 1

df.Tot <- df.Trt + dfe

# Now calculate the means squares

(MSTrt <- ssTrt / df.Trt)

(MSE <- ssRes / dfe)

# Calculate the F ratio and F critical value

(Fcalc <- MSTrt / MSE)

(Fcrit <- qf(0.05, 
             df1 = df.Trt, 
             df2 = dfe, 
             lower.tail = FALSE))


```


ANSWER THE FOLLOWING QUESTIONS:

Are there differences among treatments? State your conclusions.


#### Part 3. ANOVA using R functions.[20 points]


Use the R functions `aov` and `anova` to obtain the same tests of the null hypothesis that all means are equal. Report the results of using each function and compare to the previous results. The purpose of this part is for you to become familiar with different R functions that accomplish the same task. Read the help about the function `oneway.test` and use it, making sure that all arguments have proper values. Pay particular attention to the assumption of equal variances. Note that this function allows you to do the test even when variances are not equal, but we assume that the variances are equal.


```{r}

# read help about function aov(), anova(), oneway.test()

summary(aov(formula = logProt ~ Treatment, 
            data = milk.prot))

linear.model1 <- lm(formula = logProt ~ Treatment, 
                    data = milk.prot)

anova(linear.model1)

oneway.test(logProt ~ Treatment, 
            data = milk.prot, 
            var.equal = TRUE)

```


ANSWER THE FOLLOWING QUESTION:

Do the results differ among functions? Compare to the results from Part 2 and Part 3.


#### Part 4. Confidence intervals for treatment means [25 points]

Create 95% confidence intervals for all the treatment means, back-transform to see milk protein content by exponentiation and adding 2.64, then add them to the table created in Part 1. Explain how the line to add the CI's to the ci.data works.


```{r}

ci.data <- data.frame(Treatment = levels(milk.prot$Treatment))

ci.data <- cbind(ci.data, predict(linear.model1, 
                                  newdata = ci.data, 
                                  interval = "confidence")) 

str(ci.data)

# Complete the code below to do the back transformation of the CI's and make a table with them

ci.data.bt <- exp(ci.data[,2:4]) + 2.64 # All columns are back transformed one at a time

ci.data.bt$Treatment <- ci.data$Treatment # add treatment names back


library(plotrix)
plotCI(ci.data.bt$fit,
       uiw = ci.data.bt$upr - ci.data.bt$fit, 
       liw = ci.data.bt$fit - ci.data.bt$lwr, 
       ylab = "milk.prot", 
       xlab = "", 
       axes = FALSE)
axis(side = 2)
axis(side = 1, at = seq_along(ci.data.bt$Treatment), labels = ci.data.bt$Treatment)


```

The confidence interval graph is built first without axes, then the axes are custom-added so we can have treatment names instead of values in the x-axis.


ANSWER THE FOLLOWING QUESTIONS.

Can you conclude that any of the treatments are effective to produce a protein content of 3.4% at least in one period? In other words, do any treatments result in an expected protein content that is significantly greater than 3.4%


Explain the `cbind` function works.