-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path14.0_ChiSquare.Rmd
804 lines (471 loc) · 32.9 KB
/
14.0_ChiSquare.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---
# Chi-square: Goodness of Fit and Test of Independence {#chChisq}
## Learning Objectives for Chapter
<!-- Jen's: -->
<!-- 1. Construct a contingency table and calculate its degrees of freedom. -->
<!-- 2. State if a larger or smaller chi square value indicates stronger dependency between variables in a test of independence. -->
<!-- 3. Describe the theory behind how we calculate the expected values in a test of independence. -->
1. Define independence between random variables.
1. Identify problems that can be addressed as goodness of fit or independence tests.
1. Define degrees of freedom for the $\chi^2$ distribution.
1. Test for independence of categorical random variables.
1. Test the goodness of fit of theoretical distributions to data.
1. Look up quantiles and probabilities for the $\chi^2$ distribution.
## The $\chi^2$ distribution
The chi-square $\chi^2$ distribution is widely used in a variety of statistical tests. If describes the distribution of the sum of the squares of k independent standard normal random variables. Two of the most important applications of the $\chi^2$ distribution are the test for goodness of fit of an observed distribution to a theoretical one and the test of independence between two categorical random variables.
In both tests, we calculate a $\chi^2$ test statistic based the differences between observed and expected frequencies, given some theoretical model.
ADD: PLOT OF CHI SQUARE HERE
## Goodness of Fit: Discrete distributions
In goodness of fit tests for discrete distributions, we are dealing with a single random variable with discrete classes. Supposed we take a sample from a population with k different classes for the random variable. We have an expected theoretical probability distribution across the k classes and want to test the "goodness of fit" of the population to the theoretical distribution using our sample.
Our data would consist of k observed frequencies ($O_1:O_k$), which we would compare to k expected frequencies ($E_1:E_k$). Then we can calculate $\chi^2$ as
<br>
\begin{equation}
\chi^2 = \sum_{i = 1}^k \frac{(O_i - E_i)^2}{E_i} \\[20pt]
df = k-1
(\#eq:chi)
\end{equation}
<br>
If there are only 2 classes involved (i.e., k = 2, df = 1), then we use an adjusted $\chi^2$ formula:
\begin{equation}
\chi^2_{adj} = \sum_{i = 1}^2 \frac{(\mid{O_i - E_i}\mid - 1/2)^2}{E_i}
,\; df = 1
(\#eq:chiadj)
\end{equation}
The smaller the differences between the observed frequencies and what we expected based on our theoretical model, the smaller the $\chi^2$ value. The larger ther $\chi^2$ values, the less likely our results will support that the population follows our theoretical distribution. This deviates from previous tests, where our general desire was the reject the null hypothesis, because it meant that there was a significant effect of some variable of interest on another. Here, our null hypothesis is that the variable in our population of interest follows our theoretical distribution, and to reject the null hypothesis would suggest that it does not.
An example of a discrete theoretical distribution for a single variable is based on Mendel's experiments with the common pea. He studied seven traits in total, but we will focus on two: seed shape and seed color. Mendel produced an F1 generation from crossing true-breeding parental pea plants producing wrinkled seeds (the recessive trait) with true-breeding parental plants that produced round seeds (the dominant trait). All of the F1 produced round seeds. He then produced an F2 generation through self-fertilization of the F1 hybrids, and this time he observed a 3:1 phenotypic ratio of smooth to wrinkled. This ratio of 3:1 for the dominant to recessive phentoypic traits was true in all of the traits he studied. We can use his dataset on pea seed shape to test the goodness of fit to a 3:1 ratio.
These were the observed frequencies for each seed shape phenotype:
```{r, echo = F, warning = F}
mendel <- read.csv("./Datasets/MendelPeas.csv")
roundO <- sum(mendel[mendel$shape == "round", "count"])
wrinkledO <- sum(mendel[mendel$shape == "wrinkled", "count"])
(Observed <- c(roundO, wrinkledO))
```
We first sum the total number of pea observations to calculate the expected frequencies under a 3:1 ratio.
$423 + 133 = 556$
Then we can figure out the expected values given a 3:1 ratio:
```{r}
Er <- (556/4)*3
Ew <- (556/4)*1
(Expected <- c(Er, Ew))
```
Now, we can combine the expected frequenices with the observed frequencies in a table:
```{r}
mTab <- data.frame(Observed, Expected)
rownames(mTab) <- c("round", "wrinkled")
mTab
```
We can calculate a $\chi^2$ value for both seed shapes, but we will need to use the $\chi^2_{adj}$ formula because we only have $k - 2$ classes.
```{r}
(chiAdj <- sum((abs(Observed - Expected) - 0.5)^2 / Expected))
```
We now have a $\chi^2_{adj}$ test statistic that we can compare to a critical $\chi^2$ value with $df = 1$ at $\alpha = 0.05$. In R, we can use the qchisq() function to get a critical $\chi^2$ value, and the pchisq() function to obtain a p value for our test statistic:
```{r}
(criticalChi <- qchisq(p = 1 - 0.05, df = 1))
(pval <- pchisq(chiAdj, df = 1, lower.tail = F))
```
As you can see, our $\chi^2_{adj}$ value of 0.29 is much smaller than the critical $\chi^2$ value of 3.84, and our p value is 0.59. There is a 59% probability of seeing our results or more extreme results if the null hypothesis was true. We therefore conclude that our observed phenotypic ratio matches the expected ratio of 3:1.
We can use the same data to test the assumption that the combination of two traits will have a phenotypic ratio of 9:3:3:1 under Mendel's law of independent assortment. We will look at the combination of seed shape and color, with round/yellow(RY), round/green (RG), wrinkled/yellow (WY), and wrinkled/green (WG) as the possible phenotypes. For seed color, yellow is the dominant trait. We would expect to see the phenotypes to follow a 9:3:3:1 ratio in the order RY:RG:WY:WG.
```{r}
mendel
(Observed2 <- mendel[,c(3:4)])
#Just as before, the total number of plants is 556.
ExpRY <- 9*(556/16)
ExpRG <- 3*(556/16)
ExpWY <- 3*(556/16)
ExpWG <- 1*(556/16)
(Expected2 <- c(ExpRY, ExpRG, ExpWY, ExpWG))
mTab2 <- data.frame(Observed2, Expected2)
mTab2
```
We can use the regular $\chi^2$ formula to obtain our test statistic because we have more than 2 classes:
```{r}
(chi2 <- sum((Observed2$count - Expected2)^2 / Expected2))
```
And use R to find a critical $\chi^2$ value and p value again:
```{r}
k <- 4 # We have 4 classes
(criticalChi2 <- qchisq(p = 1 - 0.05, df = k-1))
(pval2 <- pchisq(chi2, df = k-1, lower.tail = F))
```
We once again fail to reject the null hypothesis because our p value is much greater than 0.05.
We can also use the chisq.test() function in R to perform these tests very easily. We will demonstrate how to do this using the test for two traits. The function requires two arguments for a goodness of fit test: \\
x = vector of observed frequencies \\
p = vector of expected probabilities \\
```{r}
x <- Observed2$count
p <- c(9/16, 3/16, 3/16, 1/16)
chisq.test(x = x, p = p)
```
You can see that all results from the chisq.test() function match our calculations exactly.
## Goodness of Fit: Continuous distributions
We often wish to know if our data follows a particular continuous probability distribution. One common application of a $\chi^2$ goodness of fit test for a continuous distribution is to test if continuous data follows a normal distribution.
To perform this test, we bin the continuous data into discrete classes of equal intervals so that we can calculate a $\chi^2$ value for each discrete interval and sum them to obtain an overall $\chi^2$test statistic. We will use 144 adult cat heart weights from a free dataset that comes with the MASS package in R (cats) to test that the cat heart weights are normally distributed.
First, we order the observations from smallest to largest, then set up vectors that record the midpoints and endpoints of 9 discrete bins of the data range:
```{r}
library(MASS)
cwt <- cats
#sorting the heart weights into a vector
weights <- sort(cwt$Hwt)
```
Now, we can do a quick check to see if the data appears normally distributed before even doing our goodness of fit test:
```{r}
hist(weights)
```
The data looks right-skewed, so we might consider log-transforming the weights to make them more normally distributed:
```{r}
hist(log(weights))
```
That looks much better. We will use the log-transformed heart weights instead to test for normality.
First, we will set up a vector containing the endpoints of 9 discrete intervals of equal length within the range of observed log(weight) values:
```{r}
classes <- 1:9 #number of intervals
#log transform
logwt <- log(weights)
#Get the total range of values:
range <- max(logwt) - min(logwt)
#Length of each bin interval when k = 9:
binInt <- range/9
#The endpoints of the bin intervals:
endpoints <- seq(min(logwt), max(logwt), binInt)
```
The next part is a little trickier. We need to get a frequency of observed weights in each bin interval. We will show you how to do this using a for loop in R.
```{r}
#set up a vector of NAs that will hold the observed frequencies
Ofreq <- rep(NA, 9)
#We use a logical query for each interval, where we ask which logwts are less than or equal to endpoint i, starting at i = 2 since the first endpoint is the minimum logwt., and greater than the previous endpoint.
#Then we sum each result to get the frequency of observations in each interval i.
for(i in 2:10){
Ofreq[i-1] <- sum(logwt > (endpoints[i-1]) &
logwt <= endpoints[i])
}
#we add 1 to Ofreq[1] since the minimum logwt value was not included in the first interval frequency using the code above.
Ofreq[1] <- Ofreq[1] + 1
Ofreq
```
Now that we have all of our observed frequencies, we need to work on getting expected frequencies under the null hypothesis that our data are normally distributed. We first find the z score for each interval using the endpoint of the interval, the standard deviation of the weights, and the total mean of the weights:
```{r}
meanWt <- mean(logwt)
sdWt <- sd(logwt)
(zscores <- (endpoints[-1] - meanWt) / sdWt)
```
We can now use the z scores to find the probability within each interval from a standard normal probability distribution. We can do this by finding the left-tailed cumulative probability at each endpoint and substracting the left-tailed cumulative probability at the previous endpoint. For example, if we have the probability at a z score of -1:
```{r, echo = F}
pnorm(-1)
x <- seq(-4, 4, 0.1)
y <- dnorm(x)
plot(x, y, type = "l", ylab = "", xlab = "")
polygon(c( x[x<=-1], -1 ), c(y[x<=-1],0 ), col="blue")
text(-3, 0.1, "p = 0.16 at z = -1")
```
And we know the left-tailed probability at z = -2:
```{r, echo = F}
pnorm(-2)
x <- seq(-4, 4, 0.1)
y <- dnorm(x)
plot(x, y, type = "l", ylab = "", xlab = "")
polygon(c( x[x<=-2], -2 ), c(y[x<=-2],0 ), col="gray")
text(-3, 0.1, "p = 0.02 at z = -1")
```
Then the probability between $z = -2$ and $z = -1$ is $p(z \leq{-1}) - p(z \leq{-2})$:
```{r, echo = F}
pnorm(-1) - pnorm(-2)
x <- seq(-4, 4, 0.1)
y <- dnorm(x)
plot(x, y, type = "l", ylab = "", xlab = "")
polygon(c(-2, x[x<=-1 & x >= -2], -1 ), c(0,y[x<=-1 & x >= -2],0 ), col="red")
text(-2.25, 0.1, "p = 0.14")
```
To calculate this in R, we can use the pnorm() function:
```{r}
pnorm(-1) - pnorm(-2)
```
We will do this for each interval endpoint in our logwt dataset using a for loop:
```{r}
probs <- rep(NA, 9)
probs[1] <- pnorm(zscores[1])
for(i in 2:9){
probs[i] <- pnorm(zscores[i]) - pnorm(zscores[i-1])
}
```
Now that we have the expected probabilities for each interval if our data were normally distributed, we can calculate expected frequencies by multiplying the expected probabilities by the total number of logwts in our dataset:
```{r}
(Efreq <- probs * sum(Ofreq))
```
For each interval, we can calculate the squared deviation of the observed frequency from the expected frequency divided by the expected frequency, and then sum these values up to obtain our $\chi^2$ statistic:
```{r}
(chiSq <- (sum((Ofreq - Efreq)^2/Efreq)))
```
And as per usual, we can compare this $\chi^2$ statistic to a critical $\chi^2$ and obtain a p value.
The number of degrees of freedom is in general the total number of dimensions in which the results can vary. For the test of goodness of fit, the degrees of freedom are the total number of observed values used in the calculation of $\chi^2$ minus the number of calculated values or parameters used to estimate the expected values. In this example, we have 9 observed frequencies, and in order to calculate the expected frequencies, we used the total number of observations (144), the estimated mean (2.34) and the estimated standard deviation (0.22). Therefore, $df = 9 - 3$.
```{r}
k <- 9
#Critical chi-squared value at alpha = 0.05
qchisq(p = 1 - 0.05, df = k-3)
#p value at our test statistic
pchisq(chiSq, k-3, lower.tail = F)
```
We can see that our $\chi^2$ test statistic is much smaller than the critical value and that p > 0.05, so we fail to reject the null hypothesis that the log of the cat heart weight data is normally distributed.
## The $\chi^2$ Test of Independence: Contingency Tables
The test of independence is related to the goodness of fit test. Just as in goodness of fit tests, we are concerned with observed frequencies versus expected frequencies under a theoretical model. The theoretical model in the case of the test of independence is specifically that two categorical variables are independent. The categorical variables each have a number of classes. When we classify our observations according to the two variables in a two way table with j rows and k columns, then we create a **contingency table**. For such a contingency table, $df = (j-1)(k-1)$.
We show an example of a contingency table in Table \@ref(tab:cont).
```{r cont, echo = F, warning = F}
planttab <- read.csv("./Datasets/rareplants2.csv")[-1]
knitr::kable(x = planttab,
col.names = c("Species type", "Dry", "Wet", "Wet or Dry", "Total"),
caption = "Observed frequencies of common versus rare plant species in South Australia and Victoria, Australia according to habitat type (dry only, wet only, and wet or dry)."
)
```
Does habitat (wet or dry) influence the prevalence of common or rare plant sepcies in South Australia and Victoria?
The null hypothesis of our test is that the habitat type and species type are independent of one another. We have observed frequencies in the table above, but how do we calculate expected frequencies? We have to understand that if the two variables are independent, then the joint probability in each cell of our contingency table will be the product of the corresponding row and colum marginal probabilities. In other words, the joint probability of a plant species being both common and found in a dry only habitat would equal the probability of it being found in a dry only habitat multiplied by the probability of it being common, *if* the habitat type and species type are indendent of one another.
We first build a table that contains the estimated joint probabilities for each cell under the null hypothesis of independence.
```{r}
#create a table from the original data table, with appropriate dimensions just to hold joint probabilities:
plantP <- planttab[1:2,2:4]
#total sample size:
Grand <- as.numeric(planttab[3,5])
#Create vectors of the marginal probabilities
(HabP <- as.numeric(planttab[3,2:4])/Grand)
(SpP <- as.numeric(planttab[1:2,5])/Grand)
#multiply the row and column marginal probabilities to get the joint probabilities in each cell:
for(j in 1:2){
for(k in 1:3){
plantP[j,k] <- SpP[j] * HabP[k]
}
}
rownames(plantP) <- c("common", "rare")
knitr::kable(x = round(plantP,2),
col.names = c("Dry", "Wet", "Wet or Dry"),
caption = "Joint probabilities given the marginal probabilities in rows and columns"
)
```
Now that we have theoretical probabilities, we can multiply them by the total sample size (n = 694) to obtain expected frequencies in each cell of our contingency table.
```{r}
Exp <- plantP * Grand
knitr::kable(x = Exp,
col.names = c("Dry", "Wet", "Wet or Dry"),
caption = "Expected frequencies if the rareness of plant species was independent of habitat type"
)
```
We know that we have $(2-1) \times (3-1) = 2$ degrees of freedom, so there is no need to use the $\chi^2_{adj}$ formula. We can find the sum of each cell contribution to the test statistic using the normal $\chi^2$ formula, equation \@ref(eq:chi):
```{r}
#Matrices of the observed and expected frequencies:
Obs <- apply(planttab[1:2, 2:4], c(1,2), "as.numeric")
Exp
(chiSq <- sum((Obs - Exp)^2/Exp))
```
Our critical $\chi^2$ value and p value are:
```{r}
(crit <- qchisq(1-0.05, df = 2))
(pval <- pchisq(chiSq, df = 2, lower.tail = F))
```
We can also use the chisq.test() function in R to perform the same test:
```{r}
chisq.test(Obs)
```
You can see we obtain identical results.
We see that our test statistic is much larger than our critical value (18.06 > 5.99), and that our p value ($p = 0.00012$) is much smaller than 0.05. We reject the null hypothesis and conclude that habitat type *does* have a relationship with the rareness of plant species in South Australia and Victoria.
If we take a closer look at the individual contributions to the $\chi^2$ statistic:
```{r}
(Obs - Exp)^2/Exp
```
we see that the frequency of rare plants in wet and wet or dry habitats in particular have larger deviations of observed from expected frequencies.
Lets look again at our original observed frequencies and compare with our expected frequencies.
Observed:
```{r, echo = F}
knitr::kable(x = planttab[1:2, 1:4],
col.names = c("Species type", "Dry", "Wet", "Wet or Dry")
)
```
Expected:
```{r, echo = F}
rownames(Exp) <- NULL
Exp$spt <- c("common", "rare")
Exp <- cbind(Exp$spt, round(Exp[1:3],0))
knitr::kable(x = Exp,
col.names = c("Species type", "Dry", "Wet", "Wet or Dry")
)
```
It appears that rare species are found less than expected in mixed habitats, but slightly more than expected in wet only habitats. This suggests that wet only habitats might offer better conditions than variable moisture habitats for rare plant species.
## Exercises and Solutions
## Homework Problems
## Laboratory Exercises {#LabCh14PLS}
```
---
title: "Lab14 Chi Square: Contingency Tables"
author: "YourFirstName YourLastName"
date: "enter date here"
output: html_document
---
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Plant Sciences
We can use the $\chi^{2}$ distribution to test for a relationship between two categorical variables. This type of test is called a **test of independence**. We specifically test the null hypothesis that there is statistical independence between the two categorical variables. We will be working on a dataset published by @PierantozziPierluigi2012PaCI. They were interested in the effect of olive plant cultivar type on the number of seeds produced per olive fruit. Olive fruit can contain 1, 2, or no seeds. The common case is 1 seed, though rarely they will contain two seeds, and some may contain no seeds at all due to embryo death. In 2007, they collected 1,072 olives from different cultivar plants in Central Italy, noting the number of seeds each fruit contained.
To perform this test of independence, we will (1) create a contingency table, (2) calculate the frequencies we would expect to observe if cultivar type and number of seeds are independent, (3) calculate a $\chi^{2}$ test statistic that contains information about the deviation of our observations from what is expected under $H_{0}$ , and finally (4) compare the $\chi^{2}$ test statistic against the critical $\chi^{2}$ value to decide if we reject or fail to reject $H_{0}$
#### Part 1. Create contingency table
The first step in performing a test of independence is to build a **contingency table** with j rows and k columns. One variable is classed into j factor levels in the rows, and the second variable is classed into k factor levels in the columns. The frequencies of each combination of attributes between the two variables fill the table and are used to calculate the marginal totals for each attribute.
1A) Read in the olive data set as a csv file
1B) Create a contingency table using the table() function.
1C) Calculate the marginal totals of each attribute type of each of the two variables.
```{r}
olives <- read.csv("./Datasets/olives.csv")
##use table() to calculate the frequencies of each cultivar/No.seeds combination in the data.
(olivect <- table(olives))
#marginal totals:
(cultSums <- rowSums(olivect))
(seedSums <- colSums(olivect))
```
**ANSWER THE FOLLOWING QUESTIONS:**
1D) State the null and alternative hypotheses.
#### Part 2. Expected Frequencies
We have a table of our observed frequencies, but how do we calculate what we would have expected if No.Seeds is independent of cultivar type? Under the $H_{0}$, we would expect the joint probability of each cell in our contingency table to be based on the probability of two independent "events" (represented by the row and column categories) happening together. The probability of two independent events occuring simultaneously is simply the multiplication of the probabilities of each independent event. For example, if $H_{0}$ is true, the probability that an olive fruit randomly sampled from this population of olive plants is both an Ascolana Tenera cultivar **and** contains 2 seeds is equal to the probability of being an Ascolana Tenera cultivar type multiplied by the probability of having 2 seeds. We estimate these probabilities using the marginal totals we calculated above divided by the total number of plants we sampled.
2A) Calculate the marginal probabilities of each variable attribute.
2B) Using the marginal probabilities from 2A, create a table of expected joint probabilities for each combination of cultivar type and number of seeds.
2C) Using the joint probabilities from 2B, calculate the expected frequencies of each combination of cultivar type and number of seeds.
```{r}
#First, calculate the total number of olive fruits used in this study:
(total <- sum(cultSums)) #or
total <- nrow(olives)
#2A:
#Calculate the marginal probabilities of being a particular cultivar type:
(cultp <- cultSums/total)
#Calculate the marginal probabilities of having a certain number of seeds:
(seedp <- seedSums/total)
#2B:
#Fill a table with the calculated joint probabilities for each combination of cultivar and number of seeds:
##Copy the dimensions of the cont. table to a new table. We will do this by simply copying the original contingency table, then replacing the original observed frequencies with joint probabilities:
jprobs <- as.matrix(olivect)
#Fill the "Ascolana Tenera" cultivar row with joint probabilities by multiplying the marginal "Ascolana Tenera" probability by all the marginal probabilities from seedp. Since the probabilities from seedp are in the same order as the jprobs columns, this will work fine:
jprobs["Ascolana Tenera", ] <- cultp["Ascolana Tenera"] * seedp
#Do the same for the other four cultivar types:
jprobs["Carolea", ] <- cultp["Carolea"] * seedp
jprobs["Leccino", ] <- cultp["Leccino"] * seedp
jprobs["Picholine", ] <- cultp["Picholine"] * seedp
#2C:
#Calculate the expected frequencies based on the joint probabilities under the null hypothesis multiplied by the total number of fruits sampled.
(oliveExp <- jprobs * total)
```
**ANSWER THE FOLLOWING QUESTIONS:**
2D) Explain in your own words how we use probability theory to calculate expected frequencies if number of seeds was independent of cultivar type.
#### Part 3. Calculate $\chi^{2}$
The $\chi^{2}$ values are calculated from the square of the differences between the observed and expected frequencies, weighted by the expected frequencies.
3A) Use the following formula to calculate the $\chi^{2}$ test statistic for our contigency table:
$$\chi^{2} = \Large\Sigma \space\small\frac{(observed - expected)^2} {expected}$$
```{r}
#Calculate chi square values for each cell:
(chisqtable <- (olivect - oliveExp)^2 / oliveExp)
(chisq <- sum(chisqtable))
```
**ANSWER THE FOLLOWING QUESTIONS:**
3B) Looking at the chisqtable table object that holds individual $\chi^{2}$ values, are there any cultivar/No.Seeds combinations that stand out as particularly different from what we would have expected under $H_{0}$? How did you determine this?
3C) What was the direction of difference between the observed and expected frequencies in the the cells you named for 3B? Based on this observation, if it is desirable to have more seeds per olive because this yields a larger fruit with a higher pulp/pit ratio [@PierantozziPierluigi2012PaCI], what cultivar(s) would you recommend to olive farmers to maximize profits?
#### Part 4. Results
4A) Calculate the degrees of freedom
4B) Calculate the critical $\chi^{2}$ value
4C) Obtain the p value
4D) Use a built in R function to do a test of independence
```{r}
#Calculate the degrees of freedom
j <- nrow(olivect); k = ncol(olivect)
df <- (j - 1) * (k - 1)
#Calculate the critical value
(crit <- qchisq(0.95, df))
#Is our test statistic equal to or greater than our critical value?
chisq >= crit
#You could also simply get the p value to use as your decision rule:
(pvalue <- pchisq(chisq, df, lower.tail = F))
#We just performed a test of independence by hand, but R actually has a single function that will do it all at once for you:
chisq.test(olivect)
```
**ANSWER THE FOLLOWING QUESTIONS:**
4E) Write a conclusion. Include the test statistic, p value, whether or not you reject the null hypothesis, and a statement or two about what you would recommend to olive producers based on these results.
### Animal Sciences
We can use the $\chi^{2}$ distribution to test for a relationship between two categorical variables. This type of test is called a *test of independence*. We specifically test the null hypothesis that there is statistical independence between the two categorical variables. A sample of 111 mice was divided into three groups, 57 that received a standard dose of pathogenic bacteria followed by antiserum, 58 that received the same dose of bacteria, followed by an experimental treatment, and a control group of 54 that received the bacteria, but no treatment. After sufficient time had elapsed for an incubation period and for the disease to run its course, 45 dead mice and 124 survivors were counted. Of those that died, 13 had received bacteria and antiserum, 7 had received the experimental treatment, while 25 had received bacteria only. A question of interest is if the antiserum had in any way protected the mice so that there were proportionally
more survivors in that group.
To perform this test of independence, we will (1) create a contingency table, (2) calculate the frequencies we would expect to observe if cultivar type and number of seeds are independent, (3) calculate a $\chi^{2}$ test statistic that contains information about the deviation of our observations from what is expected under $H_{0}$ , and finally (4) compare the $\chi^{2}$ test statistic against the critical $\chi^{2}$ value to decide if we reject or fail to reject $H_{0}$
#### Part 1. Create contingency table
The first step in performing a test of independence is to build a *contingency table* with j rows and k columns. One variable is classed into j factor levels in the rows, and the second variable is classed into k factor levels in the columns. The frequencies of each combination of attributes between the two variables fill the table and are used to calculate the marginal totals for each attribute.
1A) Read in the mice data set as a csv file
1B) Create a contingency table using the table() function.
1C) Calculate the marginal totals of each attribute type of each of the two variables.
```{r}
mice <- read.csv("./Datasets/mice.csv")
##use table() to calculate the frequencies of each treatment/result combination in the data.
(micect <- table(mice))
#marginal totals:
(treatSums <- rowSums(micect))
(survSums <- colSums(micect))
```
**ANSWER THE FOLLOWING QUESTIONS:**
1D) State the null and alternative hypotheses.
#### Part 2. Expected Frequencies
We have a table of our observed frequencies, but how do we calculate what we would have expected if the survival of mice is independent of treatment? Under the $H_{0}$, we would expect the joint probability of each cell in our contingency table to be based on the probability of two independent "events" (represented by the row and column categories) happening together. The probability of two independent events occuring simultaneously is simply the multiplication of the probabilities of each independent event. For example, if $H_{0}$ is true, the probability that a mice randomly sampled from this population of olive plants would receive the control treatment *and* remain alive is equal to the probability of receiving the control treatment multiplied by the probability of remaining alive. We estimate these probabilities using the marginal totals we calculated above, divided by the total number of mice we sampled.
2A) Calculate the marginal probabilities of each variable attribute.
2B) Using the marginal probabilities from 2A, create a table of expected joint probabilities for each combination of treatment type and survival outcome.
2C) Using the joint probabilities from 2B, calculate the expected frequencies of each combination of treatment type and survival outcome.
```{r}
#First, calculate the total number of mice used in this study:
(total <- sum(treatSums)) #or
total <- nrow(mice)
#2A:
#Calculate the marginal probabilities of getting a particular treatment:
(treatp <- treatSums/total)
#Calculate the marginal probabilities of surviving or not:
(survp <- survSums/total)
#2B:
#Fill a table with the calculated joint probabilities for each combination of treatment type and survival outcome:
##Copy the dimensions of the cont. table to a new table. We will do this by simply copying the original contingency table, then replacing the original observed frequencies with joint probabilities:
jprobs <- as.matrix(micect)
#Fill the "antiserum" treatment row with joint probabilities by multiplying the marginal "antiserum" probability by all the marginal probabilities from survp. Since the probabilities from survp are in the same order as the jprobs columns, this will work fine:
jprobs["antiserum", ] <- treatp["antiserum"] * survp
#Do the same for the other treatment type:
jprobs["experimental", ] <- treatp["experimental"] * survp
jprobs["control", ] <- treatp["control"] * survp
#2C:
#Calculate the expected frequencies based on the joint probabilities under the null hypothesis multiplied by the total number of mice sampled.
(miceExp <- jprobs * total)
```
**ANSWER THE FOLLOWING QUESTIONS:**
2D) Explain in your own words how we use probability theory to calculate expected frequencies if survival outcome was independent of treatment type.
#### Part 3. Calculate $\chi^{2}$
The $\chi^{2}$ values are calculated from the square of the differences between the observed and expected frequencies, weighted by the expected frequencies.
3A) Use the following formula to calculate the $\chi^{2}$ test statistic for our contigency table:
$$\chi^{2} = \Large\Sigma \space\small\frac{(observed - expected)^2} {expected}$$
```{r}
# Calculate chi square values for each cell:
(chisqtable <- (micect - miceExp)^2 / miceExp)
(chisq <- sum(chisqtable))
```
**ANSWER THE FOLLOWING QUESTIONS:**
3B) Looking at the chisqtable table object that holds individual $\chi^{2}$ values, are there any treatment/survival outcome combinations that stand out as particularly different from what we would have expected under $H_{0}$? How did you determine this?
3C) What was the direction of difference between the observed and expected frequencies in the the cells you named for 3B? Based on this observation, what treatment type would you recommend to control the pathogenic bacteria?
#### Part 4. Results
4A) Calculate the degrees of freedom
4B) Calculate the critical $\chi^{2}$ value
4C) Obtain the p value
4D) Use a built in R function to do a test of independence
```{r}
# Calculate the degrees of freedom
j <- nrow(micect); k = ncol(micect)
df <- (j - 1) * (k - 1)
# Calculate the critical value
(crit <- qchisq(0.95, df))
# Is our test statistic equal to or greater than our critical value?
chisq >= crit
# You could also simply get the p value to use as your decision rule:
(pvalue <- pchisq(chisq, df, lower.tail = F))
# We just performed a test of independence by hand, but R actually has a single
# function that will do it all at once for you:
chisq.test(micect)
```
**ANSWER THE FOLLOWING QUESTIONS:**
4E) Write a conclusion. Include the test statistic, p value, whether or not you reject the null hypothesis, and a statement or two about what recommendations you would make based your results.