-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path13.0_LinearRegression.Rmd
1171 lines (642 loc) · 54.6 KB
/
13.0_LinearRegression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---
# Linear Regression {#chLinReg}
## Learning Objectives for Chapter
1. State the null and alternative hypotheses for simple linear regression (SLR).
1. Describe how variance is partitioned in SLR.
1. Calculate the slope and intercept and their confidence intervals.
1. Interpret what the slope and intercept mean in real, biological terms.
1. Calculate and interpret the coefficient of determination.
1. List examples of uses for simple linear regression.
1. Write the model and assumptions for simple linear regression.
1. List strategies to increase the precision of slope estimates.
1. Determine if a problem requires an estimate for the expected value or for a single observation.
1. Calculate and interpret an ANOVA table for SLR.
1. Discuss the difference between estimating parameters and making predictions in SLR.
1. Build confidence intervals for predictions of expected value and individual responses.
## Introduction to Simple Linear Regression
### Goals of Simple linear regression
Simple linear regression can be used for multiple purposes, the vast majority of which can be classified into two categories:
1. Prediction. Emphasis is on values of Y expected for given levels of X.
2. Explanation or understanding. Emphasis is on regression coefficients.
In the case of **prediction** the goal of SLR is to provide a means to estimate values of the response variable $Y$ or expected values of the response variable $\mu{Y|X}$ for certain values of the independent variable that may not have been observed. In this application we are not really concerned about why Y is related to X or whether they have a cause-and-effect relationship. For example, we may use the relationship between femur length and age to estimate the age at death of fossilized mammals. Obviously, femur length does not cause animals to be younger or older. Similarly, we may use the thickness of the triceps skin pinch to estimate a person's body fat content. The thickness of the pinch does not cause people to have more or less fat.
In the case of **understanding** the goal is to determine how and how much a causal variable affects another; how strong is their relationship. For example, we determine the relationship between total food consumed and weight gain in a beef steer. The food consumed is the cause of the gain, and the relationship is important because it establishes how much gain there is per unit food consumed. In this case, the slope of the relationship is the food conversion efficiency. In this case we are specifically interested in having good estimates of the slope and its sampling variance. We are also interested in the variance of $\epsilon$ because that is an indicator of the effect of other factors on gain and food conversion efficiency.
### Fixed and Random X cases
There are two types of situations in simple linear regression and in regression in general: fixed X and random X. The main point of this section is that all equations presented in this chapter are strictly valid for the fixed X case, and are approximations in the random X case.
**Fixed X** means that the values of the predictor or independent variable X are known without error and fixed values. X is not a random variable and its values did not result from a random process or experiment. The theory and calculations presented in this chapter are specifically valid for this situation.
According to [@RossetTibshirani2018] the fixed X case is the most commonly used one and it is characterized by the following:
1. The covariate values $X_1, . . . , X_k$ used to estimate parameters are not random but constitute a set of selected or designed values, and the only randomness the estimates is due to the responses $Y1, . . . , Yk$.
2. The covariates values used for prediction exactly match $X_1, . . . , X_k$ and the corresponding responses are independent realizations of $Y1, . . . , Yk$, respectively.
Examples of fixed X cases are:
- Application of prescribed quantities of fertilizer to field plots to study yield.
- Previously selected levels of feeding for animals to study weight gain.
- Drug dose given to rats to determine effect on metabolic rate.
- Dose of active principle per acre to study effect on pest density.
**Random X** means that the values of the predictor X are not set by the experimenter, but instead are random values resulting from a random process. The random process may be the process of measurement and or the process of sampling the individuals or experimental units where the response and the predictor are to be measured.
Examples of random X case are:
- Measure the available P concentration in soil and yield in field plots.
- Measure triceps pinch thickness and body fat content of several people.
- Measure total income and food expenses in a set of randomly selected families.
- Measured concentration of a hormone in blood and milk yield of dairy cows.
When there is measurement error in the values of X, one consequence is that the estimate of the slope will be biased downwards. If it is possible, selecting a wide range of X values, relative to the variance of measurements, will ameliorate the problem. Other things being equal, selection of a wide range of X values increases the precision of the estimations, because all the equations for variances include the sum of squares of X in a denominator.
### Population Relationship
In simple linear regression, we are interested in the relationship between two continuous variables. More specifically, we are interested in how one variable (Y, the dependent variable) changes with each unit change in another variable (X, the independent variable). The mathematical model that describes how the mean of Y changes as a function of X is that for a straight line. Thus, simple linear regression works for populations where the relationship between expected or mean Y and X a is of the following form:
<br>
$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i = \mu_{Y|X_i} + \epsilon_i\\[25pt]
\epsilon_{i} \sim N(0, \ \sigma^2)$$
<br>
where $Y_i$ is the observed value of variable $Y$ for observation $i$, $X_i$ is the corresponding value of the predictor variable for that observation. The variable $\epsilon_i$ represents the random deviation from the observed value of Y to the true line that represents the population relationship between Y and X. The "error" $\epsilon$ is normally distributed with mean 0 and variance $\sigma_{\epsilon}^2$.
Note that the true values of Y for each X are not on the line; they have a normal distribution about the line for any given X. That normal distribution has mean equal to $\beta_0 + \beta_1X_i$ and variance $\sigma_{\epsilon}^2$. Thus, the points on the true line are the means of normal populations, and there is one of these normal populations for each possible level of X, not just those levels sampled. Regardless of the mean and the value of X, all the distributions are independent and identical normal distributions with the same variance $\sigma_{\epsilon}^2$.
If the population were known, we could define a region about the line containing the central 95% of the population values. For any level of X, say $X_h$, that region is centered at $\mu_{Y|X_i} = \beta_0 + \beta_1X_h$ and ranges from $\mu_{Y|X_i} + z_{0.025} \ \sigma_{\epsilon}$ to $\mu_{Y|X_i} + z_{0.025} \ \sigma_{\epsilon}$. The point of this is to emphasize that even when the population values are known, there is variation of individual observations about the line.
$\beta_0$ and $\beta_1$ are called *regression coefficients*, and they have their own special meaning. $\beta_0$ is the **intercept**, and it is simply $\hat{Y}_i$ when $X_i$ is equal to 0. $\beta_1$ the **slope**, and it represents the change in Y for each unit change in X. For example, in Figure \@ref(fig:slrPop) the slope is 0.9, which means that when X increases by 1, $\mu_{Y|X}$ increases by 0.9. It is extremely important to consider the units of Y, X, $\beta_0$ and $\beta_1$, both for prediction and understanding purposes.
<br>
```{r slrPop, echo = F, fig.cap = "Scatterplot of population values and true regression line. Each dot represents a pair of x and Y from the population. Imagine that the whole population of dots is plotted. The blue line is the true population line, which represents the mean of the response variable Y for each value of X. Each point deviates vertically from the mean for that level of X (the point on the line) by an amount that has a normal distribution and is called epsilon. The epsilon for one point is represented in red. The slope of the line, called beta, is the change in the mean of Y per unit change in X. Usually, only a smaple of points is available, so the blue line and the parameters are not known."}
library(latex2exp)
library(shape)
set.seed(400)
x <- 1:5000/500
Y <- 5 + 0.9 * x +rnorm(5000, sd = 1)
plot(x, Y, pch = ".", # xaxt = "n", yaxt = "n",
xlab = "X", ylab = "Y",
xlim = c(0,10), ylim = c(0,17),
bty = "l", cex.lab = 1.5, cex = 1,
main = "Population line and X-Y values")
lines(x = c(6, 6), y = c(8, 10.4), col = "red", lty = 1, lwd = 2)
points(6, 8, pch = 20)
lines(0:10, 5 + 0.9 * 0:10, lwd = 2, col = "blue")
text(7.5, 9, labels = TeX("$\\epsilon_i = \\mu_{Y|X} - Y_i$"),cex = 1.5, col = "red")
arrows(x0 = 0.5, y0 = 5.45, x1 = 4, y1 = 5.45, code = 0, lwd = 2, lty = 2)
arrows(x0 = 4, x1 = 4, y0 = 5.45, y1 = 8.6, code = 0, lwd = 2, lty = 2)
text(2.5, 4.5, labels = expression(paste(Delta,"X")), cex = 1.5)
text(4.3, 7, labels = expression(paste(Delta,"Y")), cex = 1.5)
text(4.2, 4,expression(beta[1] == over(Delta*Y,Delta*X)), cex = 1.5)
text(2, 14, labels =TeX("$\\mu_{Y|X} = \\beta_0 + \\beta_1 X$"), cex = 1.7, col = "blue")
text(2, 12.5, TeX("$\\mu_{Y|X} = 5 + 9 X$"), cex = 1.6, col = "blue")
Arrows(2.2, 11.7, 3, 8.1, lwd = 1.5, col = "blue")
```
<br>
**Units**
Imagine that leaf expansion rate of a grass species is studied as a function of temperature. If you were told that the slope of the relationship is 50, you would not know if temperature has a strong or weak effect on leaf expansion, because the number given has no units. The number means something only when given with its proper units. Suppose that the units are $mm \cdot month^{-1} degreeC^{-1}$. This means that for each degree C that the temperature is increased, the expansion rate increases by 50 mm per month. This is quite a large effect. IN this case the units are:
<br>
$$Y: \quad mm \cdot mo^{-1} \\[20pt]
X: \quad degrees \ C \\[20pt]
\beta_1: \quad mm \cdot mo^{-1} \ degrees \ C^{-1}\\[20pt]
\beta_0: \quad mm \cdot mo^{-1}$$
<br>
## Sample relationship and sampling distributions
The problem in simple linear regression is to estimate $\beta_0$, $\beta_1$ and $\sigma_{\epsilon}^2$ using a random sample of $\{X, Y\}$ pairs. With those estimates we can estimate the the position of the line for any $X$ and our uncertainty about that position, as well as the uncertainty about any single population value for that value of $X$.
We use observed values of X and Y to estimate the parameters $\beta_0$ and $\beta_1$ with the method called **ordinary least squares**. This method chooses the values of $\beta_0$ and $\beta_1$ that minimize the sum of squares of the residuals or vertical distances from observation to estimated line. You can think of this as trying many many lines and picking the one that yields the least value for the SSE. Figure \@ref(fig:leastsquaresline) illustrates a least squares line versus alternative lines we could use to explain the relationship between X and Y.
<br>
```{r leastsquaresline, echo = F, warning = F, fig.cap = "The solid line is the line of 'best fit' that minimizes the sum of the squared deviations between the line and the observations, which are represented by the points. The dashed lines represent candidate lines with different regression coefficients ($\\beta_0$ and $\\beta_1$) that do not minimize the SSE."}
set.seed(400)
x <- 0:10
Y <- x +rnorm(10)
plot(x, Y, pch = 20,
xaxt = "n", yaxt = "n",
xlab = "X", ylab = "Y",
xlim = c(0,10), ylim = c(0,10),
bty = "l")
lines(0:10, 0:10, lwd = 2)
slopes <- seq(0, 1, .20)
b0 <- sort(seq(0,5, 1), decreasing = T)
for(i in 1:6){
y <- b0[i] + x*slopes[i]
lines(0:10, y, lty = 3, lwd = 1.5)
}
```
<br>
You may recall that the sample average is the least squares estimate of the population mean. The same principle is applied here, except that because the population mean changes with X, a different algorithm is necessary to find the least squares estimates. The solution consists of expressing the SSE as a function of estimated parameter values, calculating the partial derivatives of SSE with respect to each parameter, setting those to zero and solving for parameter values. The values obtained are estimated parameter values that change from sample to sample.
The ordinary least squares procedure results in the following equations to estimate the regression coefficients using a sample with *k* observations or X-Y pairs:
<br>
\begin{equation}
\hat{\beta_1} = \frac{\sum_{i=1}^k (X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^k (X_i-\bar{X})^2} (\#eq:beta1)
\end{equation}
<br>
The equation for $\beta_1$ includes $\sum_{i=1}^k (X_i-\bar{X})(Y_i-\bar{Y})$, called the **sum of cross-products**, which is involved in the calculation of the covariance between random variables.
<br>
\begin{equation}
\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \ \bar{X}
(\#eq:beta0)
\end{equation}
$$\hat{Y_i} = \hat{\beta_0} + \hat{\beta_1} \ X_i \qquad \quad \text{estimated Y}$$
<br>
The equation to calculate $\hat{\beta_0}$ determines that the estimated line always passes through the point $\{\bar{X},\bar{Y}\}$. Because they are calculated using the sample values, $\hat{\beta_0}$, $\hat{\beta_1}$ and $\hat{Y}$ are random variables that vary from sample to sample. These estimates have the following properties:
**Unbiasedness**
The mean or expectation of the estimated parameters over many, many independent samples equals the true value of the parameter.
<br>
$$E\{\hat{\beta_0}\} = \beta_0 \\[20pt]
E\{\hat{\beta_1}\} = \beta_1 \\[20pt]
E\{\hat{Y_i}\} = \mu_{Y|X_i}$$
<br>
**Normality**
Each of the estimates are random variables with normal distributions.
<br>
$$\hat{\beta_0} \sim N \left (\beta_0, \ \sigma^2_{\hat{\beta_0}} \right ) \\[20pt]
\hat{\beta_1} \sim N \left (\beta_1, \ \sigma^2_{\hat{\beta_1}} \right ) \\[20pt]
\hat{Y}_{X_h} \sim N \left (\mu_{Y|X_h}, \ \sigma^2_{\hat{Y}} \right ) $$
<br>
**Variance**
<br>
$$V\{\hat{\beta_0}\} = \sigma^2_{\hat{\beta_0}} = \sigma^2_{\epsilon} \ \left [ \frac{1}{k} + \frac{\bar{X}^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ] \\[30pt]
V\{\hat{\beta_1}\} = \sigma^2_{\hat{\beta_1}} = \frac{\sigma^2_{\epsilon}}{\sum_1^k \ (X_i - \bar{X})^2} $$
<br>
The variance of $\hat{Y}$ depends on whether the estimated Y is for the expected value of Y given a value of X or for a single value of Y given X. The first case is an **estimation**, because we are guessing the value of a fixed quantity or parameter: $\mu_{Y|X}$. The second case is a **prediction**, because we are guessing the value of a random variable $Y_i|X$.
**Estimation** of $\mu_{Y|X_h}$. The uncertainty comes only from the uncertainty about $\beta_0$ and $\beta_1$, which is represented by the sampling variances of $\hat{\beta_0}$ and $\hat{\beta_1}$.
<br>
$$V\{\hat{\mu}_{Y|X_h}\} = \sigma^2_{\epsilon} \ \left [ \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ]$$
<br>
**Prediction** of $Y_i|X_i$. In this case there is uncertainty about the unknown $\beta_0$ and $\beta_1$, plus the uncertainty due to the variance of $\epsilon$. Even if $\beta_0$ and $\beta_1$ were known, guessing $Y_i|X_i$ would amount to predicting the value of a random normal number drawn from a population with mean $\mu_{Y|X_i}$ and variance $\sigma^2_{\epsilon}$. Thus, the equation for the variance of $\hat{Y}_i|X_i$ includes an extra $\sigma^2_{\epsilon}$
<br>
$$V\{\hat{Y}_i|X_i\} = \sigma^2_{\epsilon} \ \left [ 1 + \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ]$$
<br>
Usually, the true variance of population $\sigma^2_{\epsilon}$ is not known, so we estimate it with the mean squared error calculated as follows. Once the estimated line is calculated, there is a vertical distance between each observation and the line, called *residual* or *errors*. These residuals are estimates of the true but unobservable $\epsilon$ for each observation. The sum of squares of residuals, called SSE, divided by the degrees of freedom of the SSE is the best estimate of the variance of $\epsilon$.
<br>
$$Y_i = \hat{\beta_0} + \hat{\beta_1} \ X_i + e_i = \hat{Y}_i + e_i \\[25pt]
SSE = \sum^k_{i=1} \ e_i^2 = \sum^k_{i=1} \ (Y_i - \hat{Y}_i)^2 \\[25pt]
dfe = k - 2 \\[25pt]
\hat{\sigma}^2_{\epsilon} = MSE = \frac{SSE}{dfe}$$
<br>
Therefore, the variances of the estimated coefficients and response are calculated as:
$$\hat{V}\{\hat{\beta_0}\} = \hat{\sigma}^2_{\hat{\beta_0}} = MSE \ \left [ \frac{1}{k} + \frac{\bar{X}^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ] \\[35pt]
V\{\hat{\beta_1}\} = \sigma^2_{\hat{\beta_1}} = \frac{MSE}{\sum_1^k \ (X_i - \bar{X})^2} \\[35pt]
V\{\hat{\mu}_{Y|X_h}\} = MSE \ \left [ \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ] \\[35pt]
V\{\hat{Y}_i|X_i\} = MSE \ \left [ 1 + \frac{1}{k} + \frac{(X_h - \bar{X})^2}{\sum_1^k \ (X_i - \bar{X})^2}\right ]$$
<br>
## Fitting the least squares line
An example of a linear relationship between an independent continuous variable (X) and a dependent continuous variable (Y) would be the relationship between annual rainfall and rice crop yield (kg/ha) in India. Figure \@ref(fig:rice) shows observed rice crop yields in India from 1990--2000 against annual rainfall levels.
```{r rice, echo = F, warning = F, fig.cap="Rice yields and rainfall in India from 1990 to 2000."}
rice <- read.csv("Datasets/rainYield.csv")
plot(rice$rain.mm, rice$yield.kg.ha, xlab = "Annual rainfall", ylab = "Rice crop yield (kg/ha)", cex.lab = 1.5, cex = 1.25)
abline(lm(yield.kg.ha ~ rain.mm, rice))
```
In table \@ref(tab:ricetab), we show the values and calculations necessary to find the regression coefficients for the least squares regression line.
```{r ricetab, echo = F}
library(kableExtra)
X <- rice$rain.mm
Y <- rice$yield.kg.ha
meanX <- round(mean(X),1)
meanY <- round(mean(Y),1)
deltaX <- round(X-meanX,1)
deltaY <- round(Y-meanY,1)
cross <- round(deltaX*deltaY,1)
SSX <- round(deltaX^2,1)
sums <- c("","","", "Sums", round(sum(cross),1), round(sum(SSX),1))
df <- data.frame(X = x, Y = y, deltaX = deltaX, deltaY = deltaY, XY = cross, SSX = SSX)
df <- rbind(df, sums)
knitr::kable(
df,
col.names = c("$X_i$", "$Y_i$", "$(X_i - \\overline{X})$", "$(Y_i - \\overline{Y})$", "$\\Delta X$$\\Delta Y$", "$\\Delta X^2$"),
caption = "Table shows the rice crop yields in India (Y) from 1990--2000, along with annual rainfall (X), the difference between each X$_i$ and \\overline{X}, the difference between each Y$_i$ and \\overline{Y}, the product of the X and Y deviations from their respective means ($\\Delta X$$\\Delta Y$), and the squared deviations of X$_i$ from \\overline{X}",
align = rep("r", 6)
) %>% row_spec(12, bold = T)
```
We can now use equation \@ref(eq:b1) to find the slope of the regression line that minimizes the error in the observations around the line:
$$\begin{equation}
\begin{aligned}
\sum_{i=1}^k (X_i-\overline{X})(Y_i-\overline{Y}) & = -60241.5 \\
\sum_{i=1}^k (X_i-\overline{X})^2 & = 93019.4 \\
\beta_1 & = \frac{\sum (X_i-\overline{X})(Y_i-\overline{Y})}{\sum (X_i-\overline{X})^2} \\
\beta_1 & = -60241.5/93019.4 \\
\beta_1 & = -0.65
\end{aligned}
\end{equation}$$
And we can use \@ref(eq:b0) to find the intercept:
\begin{equation}
\begin{aligned}
\beta_0 & = \overline{Y} - \beta_1\overline{X} \\
\overline{Y} & = 2064 \\
\beta_1 & = -0.65 \\
\overline{X} & = 1130.7 \\
\beta_0 & = 2064 - 0.65 \times 1130.7 \\
\beta_0 & = 2799
\end{aligned}
\end{equation}
This gives us the regression equation:
\begin{equation}
\hat{Y}_i = 2799 - 0.65 \times X_i + \epsilon_i
(\#eq:fittedrice)
\end{equation}
And we can plot this line on the original data scatterplot:
```{r riceline, echo = F, warning = F, fig.cap="Observed rice crop yields in India from 1990–2000 against annual rainfall levels and the least squares regression line ($\\beta_0 = 2799$kg/ha, $\\beta_1 = -0.65$)"}
rice <- read.csv("./Datasets/rainYield.csv")
plot(rice$rain.mm, rice$yield.kg.ha, xlab = "Annual rainfall", ylab = "Rice crop yield (kg/ha)", cex.lab = 1.5, cex = 1.25)
yhat <- 2799 - 0.65*rice$rain.mm
lines(rice$rain.mm, yhat, lwd = 2)
```
We have shown you above how to fit a simple linear regression line to explain the relationship between a predictor variable X and a dependent variable Y. We can state the direction of the relationship (negative) and report an effect size based on the estimated slope (for every increase in one unit of annual rainfall, there is a 0.65 decrease in projected rice crop yield). We can even state what we would expect the crop yield to be when there is no rain at all on a given year, which is simply the intercept (2,799 kg/ha).
But how do we know if the relationship that our regression line describes is significant? If you observe the scatter of the observations around the line in figure \@ref(fig:riceline), you can see that the line does not perfectly predict crop yield given the annual rainfall. Those deviations are represented by the residual error symbol, $\epsilon_i$ in equation \@ref(eq:fittedrice). The residual error is the random variation in crop yield (Y) that is not explained by our regression line. In other words, it is the variation in Y that is not explained by X. Is the residual error so large that our model of the relationship between X and Y is useless? In the next section, we will demonstrate how we test that the relationship between X and Y is significant, given the variation we see in Y.
## Analysis of variance of regression
In Chapter 9, we introduced you to a method called the analysis of variance. The general idea of analysis of variance is that we can partition the total variance of a Y dependent variable into random variation and variation explained by an X independent variable in order to test if there is a significant effect of X on Y. In simple linear regression, we partition the total sum of squares of Y (SSY) into the sum of squares that can be explained by the regression line (SSR) and the sum of squares that is not explained by the regression line, also called the residual sum of squares (RSS). $SSY = SSR + RSS$. In Figure \@ref(fig:anovaslr), we show you each component of variation in Y. The total variation in Y has to do with the deviation of each $Y_i$ from $\overline{Y}$. The regression line explains some of that variation in the deviations of each $\hat{Y}_i$ from $\overline{Y}$. The remaining, residual variation is in the deviations of each observation $Y_i$ from$\hat{Y}_i$.
```{r anovaslr, echo = F, fig.cap = "Components of variation in Y. The total sum of squares represents the total variation in Y (SSY $= \\sum (Y_i - \\overline{Y})^2$. The sum of squares of the regression represents the variation that is explained by the regression line, and by extension, X (SSR $= \\sum (\\hat{Y}_i - \\overline{Y})^2$). The residual sum of squares represents the random variation in the scatter of Y observations around the regression line (RSS $= \\sum (Y_i - \\hat{Y}_i)^2$) "}
########make a figure and label it slr#####
set.seed(400)
x <- 1:10
Y <- x +rnorm(10)
plot(x, Y, pch = 20, xaxt = "n", yaxt = "n", xlab = "X", ylab = "Y", xlim = c(0,10), ylim = c(0,10), bty = "l", cex.lab = 1.5)
lines(0:10, rep(mean(Y),11), lty = 5, lwd = 1.5)
text(1, mean(Y)+1, expression(bar(Y)), cex = 1.5)
arrows(x0 = 2, y0 = Y[2], y1 = mean(Y), code = 0, lwd = 1.15, lty = 3)
text(1, 3.75, expression(Y[i] - bar(Y)), cex = 1.5)
text(-1,5, "Y" , cex = 1.75)
text(5, -1.5,"X", cex = 1.75 )
lines(0:10, 0:10)
arrows(x0 = 6, y0 = 6, y1 = Y[6], code = 0,lty = 3, lwd = 1.15)
text(6.5, 3.75, labels = expression(Y - hat(Y) == epsilon[i]),cex = 1.5)
arrows(x0 = 8, y0 =mean(Y), y1 = 8, code = 0, lty=3, lwd = 1.15)
text(9, 7.25, labels = expression(hat(Y) - bar(Y)),cex = 1.5)
```
From the sum of squares (SS), we can obtain estimates of the **variances**, which are called the mean squares (MS). The MS is simply the SS divided by its corresponding degrees of freedom. In a simple linear regression, we are interested in comparing the variance explained by the regression (MSR) and the variance explained by the residuals (MSr). Specifically, we wish to obtain an F-ratio, which we calculate as the ratio of the variance explained by the regression to the residual variance $\frac{MSR}{MSr}$.
We can summarize this information in an analysis of variance table.
```{r aovtable, echo = F, warning = F}
source <- c("Total", "Regression", "Residual")
df <- c("k-1", 1, "k-2")
SS <- c("$\\sum (Y_i - \\overline{Y})^2$", "$\\sum (\\hat{Y}_i - \\overline{Y})^2$", "$\\sum (Y_i - \\hat{Y}_i)^2$")
MS <- c("","MSR", "MSr")
Finfo <- c("", "F","")
df <- cbind(source, df, SS, MS, Finfo)
knitr::kable(df,
col.names = c("source", "df", "SS", "MS", "F-ratio"),
caption = "Analysis of variance table.")
```
We can use the F-ratio to determine the significance of the regression.
Specifically, the null hypothesis that we are interested in testing is that the regression coefficient ($\beta_1$) = 0. If $\beta_1$ is equal to 0, then for each unit change in X, Y does not change, and there is no relationship between X and Y.
We will now demonstrate how to do an analysis of variance on the dataset from the previous section on rice crop yields in India.
First, we must calculate the different sources of variation in Y:
```{r SS, warning = F, echo = F}
library(kableExtra)
X <- rice$rain.mm
Y <- rice$yield.kg.ha
meanX <- mean(X)
meanY <- mean(Y)
TS <- round((Y-meanY)^2, 1)
yhat <- 2799 - 0.65*rice$rain.mm
SR <- round((yhat-meanY)^2, 1)
RS <- round((Y - yhat)^2)
sumlabels <- c("","","","SSY", "SSR", "RSS")
sums <- c("","","", round(sum(TS),1), round(sum(SR),1),round(sum(RS),1))
df <- data.frame(X, Y, yhat, TS, SR, RS)
df <- rbind(df, sumlabels, sums)
knitr::kable(
df,
col.names = c("X$_i$", "Y$_i$", "$\\hat{Y}_i$", "$(Y_i - \\overline{Y})^2$", "$(\\hat{Y}_i - \\overline{Y})^2$", "$(Y_i - \\hat{Y}_i)^2$"),
caption = "Table shows the rice crop yields in India (Y) from 1990--2000, along with annual rainfall (X), predicted crop yields based on regression equation \\@ref(eq:fittedrice) ($\\hat{Y}_i$), the squared deviations in Y$_i$ from $\\overline{Y}$, the squared deviations between Y$_i$ and $\\hat{Y}_i$, and the squared deviations between $\\overline{Y}$ and $\\hat{Y}$. The final row shows the sums of the squared deviations for the corresponding columns, representing the total sum of squares (SSY), sum of squares due to regression (SSR), and residual sum of squares (RSS)",
align = rep("r", 6)
) %>% row_spec(12, bold = T)
```
No we can construct and analysis of variance table:
```{r riceaov, echo = F, warning = F}
source <- c("Total", "Regression", "Residual")
df <- c(10, 1, 9)
SS <- c(190919.4, 39300.9, 151906)
MS <- c("",39300.9, 16878.4)
Finfo <- c("", 2.3,"")
df <- cbind(source, df, SS, MS, Finfo)
knitr::kable(df,
col.names = c("source", "df", "SS", "MS", "F-ratio"),
caption = "Analysis of variance table.")
```
We now have an F-ratio which we can compare to a critical F-ratio given a significance probability threshold, from an F distribution with the appropriate degrees of freedom. The degrees of freedom that are relevant are from the numerator ($df_1$) and denominator ($df_2$) of the F-ratio. In the example above, $df_1 = 1$ and $df_2 = 9$.
We can plot an F distribution with those degrees of freedom in R, using the following code:
Create a sequence of F values which will be plotted on the X axis of the F probability density distribution:
```{r}
x <- seq(0.25,6,0.1)
```
Generate probability densities for the F values in x:
```{r}
pd <- df(x, df1 = 1, df2 = 9, ncp = 0)
```
Plot the probability density of an F distribution with df1 = 1, df2 = 9:
```{r}
plot(x, pd,
type = "l",
xlab = "", ylab = "")
```
We will use the typical threshold of probability for significance, where $\alpha = 0.05$ and obtain the critical F value corresponding to that probability:
```{r}
criticalF <- qf(p = 1 - 0.05,
df1 = 1, df2 = 9)
```
Note that we specified 0.05 instead of $\alpha/2$ = 0.025 because the F test for a linear regression is one-tailed.
Add a red dashed vertical line where that critical F value is located and a mark where our test F-ratio falls. Note the "col" argument changes the color of the line, and "lty" changes the line type to a dashed line""
```{r}
plot(x, pd,
type = "l",
xlab = "", ylab = "")
abline(v = criticalF,
col = "red",
lty = 3)
points(2.3, df(2.3,1,9), pch = 4)
text(2.3, 0.15, "F-ratio")
```
As you can see, the F-ratio we calculated from our dataset on rice crop yields and annual rainfall was not significant. The test F-ratio was much smaller than the critical F value. We can calculate a p value for our test F ratio using the following code:
Note that we specify below the "lower.tail" is F so that it gives us the probability in the upper tail, to the right of where our F-ratio falls on the pdf.
```{r}
pf(2.3,
df1 = 1,
df2 = 9,
lower.tail = F)
```
This p value indicates that there is a 0.16 probability of seeing the results that we saw or more extreme results if the null hypothesis was true. This is larger than $\alpha = 0.05$. We therefore fail to reject our null hypothesis that the slope of our regression line is equal to 0, and state that the relationship was not significant.
We can also use a built-in statistics function in R to do the simple linear regression all at once:
```{r}
X <- rice[,1] #annual rainfall
Y <- rice[,2] #crop yield
summary(lm(Y ~ X))
```
Note that the F-ratio, the p value, and the regression coefficients are the same as what we calculated above. The estimate of (Intercept) is the same as $\beta_0$, though it differs slightly from our hand calculations due to rounding error. The estimate of the "X" coefficient is the same as $\beta_1$.
The summary output from the lm() function in R also reports a t value and a p value, [PR(>|t|)], for each of the coefficients. We are generally only interested in the p value for the predictor variable coefficients, and can ignore the t test results for the $\beta_0$ coefficient. The p value that was reported for the "X" coefficient is the same as the p value reported for the F test. This is because there is only one X predictor in a simple linear regression, so all of the explained variance in Y is from the single X variable, annual rainfall. Doing a t test on the $\beta_1$ coefficient for X in this case will therefore yield the same results as doing an F test.
## Confidence Interval for $\hat{\beta_1}$
We can construct confidence intervals for an estimate of $\beta_1$ using the calculated MSr as an estimate of the population variance. The standard error of the regression coefficient is:
\begin{equation}
s_b = \sqrt{\frac{MSr}{\sum (X_i - \overline{X})^2}}
(\#eq:sb)
\end{equation}
Note the two components of the standard error in our estimate of $\beta_1$: the MSr and the SSX. If we want to increase our certainty in the slope of the regression, then we can either decrease our MSr by increasing sample size or increase the variation in X by increasing the range of X values for which we have measurements of Y.
In addition to $s_b$, we need a t value to construct as confidence interval around the estimate of $\beta_1$. We use the degrees of freedom from the residual variance (k-2) and calculate a t value under the null hypothesis that the true population $\beta_1$ = 0. We can use the following formula to obtain the t value:
\begin{equation}
t = \frac{\hat{\beta}_1 - \beta_1}{s_b},
df = k-2
(\#eq:tsb)
\end{equation}
The confidence interval can then be calculated as:
\begin{equation}
\hat{\beta}_1 \pm t_{\alpha, k-2} \times s_b
(\#eq:CIb)
\end{equation}
##Confidence Interval for $\hat{\beta}_0$
We can also construct confidence intervals for an estimate of $\beta_0$ The standard error of the intercept is:
\begin{equation}
s_a = \sqrt{MSr\Bigg(\frac{1}{k} + \frac{\overline{x}^2}{\sum (X_i - \overline{X})^2}\Bigg)}
(\#eq:sa)
\end{equation}
where k is the number of observations.
Just as in the calculation of a t value for the $\beta_1$ confidence intervals, we use the degrees of freedom from the residual variance (k-2) to obtain a t value under the null hypothesis that the true population intercept = 0. We can use the following formula to obtain the t value:
\begin{equation}
t = \frac{\hat{\beta}_0 - \beta_0}{s_a},
df = k-2
(\#eq:tsb)
\end{equation}
The confidence interval can then be calculated as:
\begin{equation}
\hat{\beta}_0 \pm t_{\alpha, k-2} \times s_a
(\#eq:CIb)
\end{equation}
## Confidence Intervals for $\hat{Y}_i$
The predicted value for Y at a given X$_i$ can be considered in two different ways. For X$_i$, $\hat{Y}_i$ can represent the expected mean of Y, or it can represent the expected value of an individual Y. The predicted value will be the same for either case, but our certainty in the estimate will be different.
Predicting for a single Y means incorporating random variation at observation i. Recall that variation is represented by $\epsilon_i$ in the regression formula, and that visually it is represented by the scatter of points around the regression line on a scatterplot. We must account for the individual variance in the Y observations at X$_i$ for an individual $\hat{Y_i}$, so the confidence intervals for an individual $\hat{Y_i}$ will be larger.
Just as in the calculations for the standard error of $\hat{\beta}_1$ and $\hat{\beta}_0$, we will use the degrees of freedom of the MSr (k-2) in our calculations.
The standard error of $\hat{Y_i}$ for an expected mean of Y is:
\begin{equation}
s_{\hat{\overline{Y}}} = \sqrt{MSr\Bigg(\frac{1}{k} + \frac{(x_i - \overline{x})^2}{\sum (X_i - \overline{X})^2}\Bigg)}
(\#eq:symean)
\end{equation}
while the standard error of $\hat{Y_i}$ for an individual Y is
\begin{equation}
s_{\hat{Y}} = \sqrt{MSr\Bigg(1 + \frac{1}{k} + \frac{(x_i - \overline{x})^2}{\sum (X_i - \overline{X})^2}\Bigg)}
(\#eq:sy)
\end{equation}
Note the addition of 1 in the calculation of $s_{\hat{Y}}$ to account for the additional residual variance when predicting an individual Y$_i$.
The confidence interval for $\hat{\overline{Y}}$ is
\begin{equation}
\hat{\overline{Y}} \pm t_{\alpha, k-2} \times s_{\hat{\overline{Y}}}
(\#eq:CIsymean)
\end{equation}
and the confidence interval for $\hat{Y}$ is
\begin{equation}
\hat{Y} \pm t_{\alpha, k-2} \times s_{\hat{Y}}
(\#eq:CIsy)
\end{equation}.
## Coefficient of Determination, $R^2$
The coefficient of determination ($R^2$) is the proportion of variance that is explained by the regression line. If a large proportion of the variation in Y is not due to the regression line, then even if you obtain significant results, X is not a very good predictor of Y. The formula for $R^2$ is
\begin{equation}
R^2 = \frac{SSR}{SSY}
(\#eq:r2)
\end{equation}
For a simple linear regression with only one X predictor, $R^2$ can also simply be referred to as $r^2$.
In some cases, we may choose to adjust the $R^2$ value ($R^2_{adj}$ for the degrees of freedom from the regression, which is only 1 in a simple linear regression.
\begin{equation}
R^2_{adj} = 1 - frac{MSr}{MSY}
(\#eq:radj)
\end{equation}
Using the example of rice crop yields in India, we can calculate an $R^2$ of
\begin{equation}
\begin{aligned}
R^2 & = \frac{39300.7}{190919.4} \\
& = 0.21
\end{aligned}
(\#eq:r2)
\end{equation}
We can multiply $R^2$ by 100 to state that 21% of the variation in annual rice crop yields was explained by annual rainfall.
We can also calculate $R^2_adj$ as
\begin{equation}
\begin{aligned}
R^2_adj & = 1 - \frac{MSr}{MSY} \\
R^2_adj & = 1 - \frac{16878.4}{190919.4/10} \\
& = 0.12
\end{aligned}
(\#eq:r2)
\end{equation}
There is a large reduction from $R^2$ to $R^2_{adj}$ because of the small number of observations ($k = 11$). The $R^2_{adj}$ from datasets with large sample sizes will have a closer match to $R^2$
Let's look again at the summary output from the lm() function in R:
```{r}
X <- rice[,1] #annual rainfall
Y <- rice[,2] #crop yield
summary(lm(Y ~ X))
```
The lm() function reports that the "Multiple R-squared" = 0.2043, while the "Adjust R-squared' is 0.12, which are equivalent to our calculations above. The small difference between our calculated $R^2$ and the "Multiple R-squared" from R is again due to rounding error.
Our results were not significant, but if they were, then $R^2 = 20.4\%$ and $R^2_{adj} = 11.6\%$ would indicate that a fairly large proportion of the variance in Y is not explained by X. This is one way of ascertaining lack of fit in the regression model.
## Correlation Coefficient, r
The correlation coefficient, r, is a measure of the linear correlation between two variables. It does not imply a causal relationship between a fixed X variable and a Y dependent variable as regression does.
We calculate r using the following formula:
\begin{equation}
r = \frac{\sum(X_i - \overline{X})(Y_i - \overline{Y})}{\sqrt{\sum (X_i - \overline{X})^2 \times \sum (Y_i - \overline{Y})^2}}
(\#eq:r)
\end{equation}
The correlation coefficient ranges from -1 to 1. A perfect positive linear relationship between two variables would result in $r = 1$, while a perfect negative relationship would results in $r = -1$. No relationship is indicated by $r = 0$.
## Fitness of the regression model
Another aspect of using linear regression to test the significance of a relationship between X and Y is whether or not the model is appropriate for the data.
If there are more than one observation for each level of X, then we can perform a test of lack of fit to determine if there is enough information and deviation to reject the straight line model.
## Exercises and Solutions
## Homework
## Laboratory Exercises {#LabCh13PLS}
Prepare an .Rmd document starting with the following text, where you substitute the corresponding information for author name and date.
```
---
title: "Lab13 Simple Linear Regression"
author: "YourFirstName YourLastName"
date: "enter date here"
output: html_document
---
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Plant Sciences
In simple linear regression (SLR) we study the relationship between a *predictor, explanatory, or independent* variable (X, horizontal axis) and a *response or dependent* variable (Y, vertical axis). Different goals may be achieved with SLR. We may be interested in determining how much Y changes as X changes, or we may be interested in estimating the value of Y for values of X that were not observed. All goals require that we estimate parameters of a linear function for a straight line that best fits the data, which is represented by the linear model:
$$Y_{i} = \beta_{0} + \beta_{1}X_{i} + \epsilon_{i} \\ \epsilon_{i} \sim N(0, \ \sigma)$$
Best fit is achieved by minimizing the sum of squares of residuals or vertical distances between observations and the line of fit. The estimated parameters are the *slope* ($\beta_{1}$) tangent of the angle between the line and the horizontal, and the *intercept* ($\beta_{0}$), the height of the line when X = 0. Numerically, the *slope* is the rate of change that occurs between the Y (dependent) variable and the X (predictor) variable, or in other words, how the values of Y change with each unit change in the X variable. The *intercept* is the value the Y variable takes when the X variable is equal to 0.
In this exercise we use the clover.txt data set used in lab02. These data represent the mass of clover plants grown for different periods at three different temperatures. Temperatures are in the first column coded as 1 for 5-15 C, 2 for 10-20 C and 3 for 15-25 C. The second column contains the number of days of growth and the last column contains the log of the plant mass in grams (g). Column names are in the first row of the file, so we specify header = TRUE in the line of code to read the data. Data are placed in a data frame named “clover.
The relationship between plant size (Y, log transformed) and plant age (X, in days) is modeled as a straight line whose slope is the relative growth rate (RGR) of the plants. Relative growth rate is the amount of growth per unit time and per unit size of the plant. Relative growth rate is to plant size as interest rate is to a debt.
FOR THIS LAB WE IGNORE THE TEMPERATURE GROUPS.
RGR means relative growth rate and is estimated by the slope of the regression.
#### Part 1. Plot of data and estimation of parameters. [25 points]
1A) Make a scatterplot of plant size vs. age, ignoring the temperature groups.
1B) Obtain estimates for the slope and intercept and interpret the results in terms of plant growth.
```{r}
clover <- read.csv("./Datasets/Lab01clover.txt", header = TRUE)
str(clover)
plot(lnwt ~ days, data = clover)
slr1 <- lm(formula = lnwt ~ days, data = clover)
summary(slr1)
```
**ANSWER THE FOLLOWING QUESTIONS:**
1B) What are the estimates for the slope and intercept? Interpret the results in terms of plant growth.
#### Part 2. Test of null hypothesis and R-square. [30 points]
Test the null hypothesis that RGR = 0 by performing an ANOVA for regression. Do the calculations "by hand" and then repeat using R functions.
```{r}
coef(slr1) # Use help(coef) to understand what this function does
intcpt <- coef(slr1)[1] # extract estimated intercept, or "a" as it is called in the SG textbook
slope <- coef(slr1)[2] # Code needed here. Extract the estimated slope or b, just like we did with the intercept.
lnwt.hat <- intcpt + slope * clover$days # creates a vector of estimated or predicted lnwt’s for each of the measured days using the regression model coefficients
errors <- clover$lnwt - lnwt.hat # creates a vector of the differences between the sample lnwt measurements and the predicted weights from the model, i.e., these are the “residuals”
(TotalSS <- sum((clover$lnwt - mean(clover$lnwt)) ^ 2)) # total sums of squares
(SSErrors <- sum(errors ^ 2)) # Code needed here. Type in the vector of the residuals to calculate residual sum of squares
(SSReg <- TotalSS - SSErrors) # regression sums of squares
n <- length(clover$lnwt) # Code needed here. Complete the code with the column that contains the observations
(dfe <- n - length(coef(slr1))) # df of the residuals
(dfTotal <- n - 1) # Complete the code to calculate the total df
(dfReg <- 1) # Complete the code to calculate the df of the regression
(MSReg <- SSReg / dfReg)
(MSE <- SSErrors / dfe) # Complete the code to calculate the MSE
(Fcalc <- MSReg / MSE) # Calculated F statistic from the data
(Fcritical <- qf(0.95, dfReg, dfe)) # Complete the function arguments with the degrees of freedom for the numerator and the denominator to get the F critical value.
anova(slr1) # check your previous “by hand” calculation against the anova table.
```
**ANSWER THE FOLLOWING QUESTIONS:**
2A) Do you reject the Ho: $\beta_{1}$ = 0? Why? What does this mean in terms of plant growth, in non-technical terms.
2B) What proportion of the variance of lnwt is explained by days, the predictor?
#### Part 3. Make a 95% confidence interval for the RGR or slope. [25 points]
First, do the calculations by hand, using the formulas from the textbook 179-180. Then use R functions.
```{r}
# Hint: check your answers with the anova table. It should be the same if your calculations are correct
(tcritical.slr1 <- qt(0.975, dfe)) # Code needed here. Enter the probability for a 2-tailed test with alpha = 0.05
SSX <- sum((clover$days - mean(clover$days)) ^ 2) # get sum of square of X
(se.beta <- sqrt(SSErrors / (dfe * SSX))) # calculate the s.e. for beta
beta.lo <- slope - tcritical.slr1 * se.beta # lower CI boundary.
beta.hi <- slope + tcritical.slr1 * se.beta # Code needed here. Complete the formula for the the upper CI boundary
c(beta.lo, beta.hi) # Display the lower and upper CI extremes
confint(slr1)[2,] # easier way to get the CI using the function from R
```
**ANSWER THE FOLLOWING QUESTIONS:**
3A) What is the 95% confidence interval for the relative growth rate (RGR) of the plants?
#### Part 4. Make a 95% confidence interval for mean plant size at a given age. [20 points]
Assume that you are interested in using the fitted regression model (the line) to estimate the mean mass of plants when they are 33 days old. Use the regression equation to make the estimate and then calculate the standard error of the estimate to make a confidence interval for it. Do detailed calculations first and then use R functions.
You will first calculate the predicted mean of Y for 33 days and its standard error and make a confidence interval for it. Then you will calculate the standard error for a predicted individual Y and compare it.
```{r}
(lnwt.hat.33 <- intcpt + slope * 33) # In other words, Y = a + bX, where Y, or the predicted value, is the weight of plants at 33 days.
(mass.hat.33 <- exp(lnwt.hat.33)) # Since the response variable is log transformed, here we need to do a back transformation.
#Calculate the standard error of the predicted mean of Y:
se.lnwt.33.m <- sqrt(MSE * (1 / n + (33 - mean(clover$days)) ^ 2 / SSX)) # SG book pg. 192 and pg. d205 No. 6
# Complete the code to calculate the upper and lower CI for the mass of plants at 33 days:
(CI.lnwt.up <- lnwt.hat.33 + tcritical.slr1 * se.lnwt.33.m)
(CI.lnwt.lo <- lnwt.hat.33 - tcritical.slr1 * se.lnwt.33.m)
pred.data <- data.frame(lnwt = NA, group = NA, days = 33) # Here we create a data frame that we will use to “predict” or estimate lnwt in the next line. Only the value for days is included since that is the X, independent, or predictor variable.
(ci.r <- predict(slr1, newdata = pred.data, interval = "confidence")) # this line uses predict() to predict or estimate lnwt and its CI, as we did by hand above.
# The interval = “confidence” argument makes the function return a confidence interval for the estimated log of plant mass.
exp(ci.r) # Complete the code to do a back transformation on the predicted values.
# Calculate the standard error of the predicted individual Y:
se.lnwt.33.ind <- sqrt(MSE * (1 + 1 / n + (33 - mean(clover$days)) ^ 2 / SSX)) # SG book pg. 205 No. 7
```
**ANSWER THE FOLLOWING QUESTIONS:**
4a) Explain why we need to use exp().
4b) Report the upper and lower extremes of the CI for the predicted mean of Y
4c) Is the standard error estimate for a predicted mean of Y or a predicted individual Y based on the model bigger? Why?
### Animal Sciences
The same instructions for completion and submission of work used in previous labs applies here. Refer to previous labs for the details.
In simple linear regression (SLR) we study the relationship between a *predictor, explanatory or independent* variable (X, horizontal axis) and a *response or dependent* variable (Y, vertical axis). Different goals may be achieved with SLR. We may be interested in determining how much Y changes as X changes, or we may be interested in estimating the value of Y for values of X that were not observed. All goals require that we estimate parameters of a linear function for a straight line that best fits the data:
$$Y_{i} = \beta_{0} + \beta_{1}X_{i} + \epsilon_{i} \\ \epsilon_{i} \sim N(0, \ \sigma)$$
Best fit is achieved by minimizing the sum of squares of residuals or vertical distances between observations and the line of fit. The estimated parameters are the *slope* ($\beta_{1}$) tangent of the angle between the line and the horizontal, and the *intercept* ($\beta_{0}$), the height of the line when X = 0. Numerically, the *slope* is the rate of change that occurs between the Y (dependent) variable and the X (predictor) variable, or in other words, how the values of Y change with each unit change in the X variable. The *intercept* is the value the Y variable takes when the X variable is equal to 0.
In this exercise we will examine the relationship between body weight and heart weight in cats. This data was originally published in a short paper by R.A. Fisher in 1947 (see paper in CANVAS). R.A. Fisher was a British Statistician that developed the foundations for modern statistical theory.
Data for the exercise consists of body weight (Bwt) in kilograms and heart weight (Hwt) in grams of 144 cats. We will examine the relationship
between heart weight and body weight in cats.
#### Part 1. Plot of data and estimation of parameters. [25 points]
1A) Make a scatterplot of body weight vs. heart weight.
1B) Obtain estimates for the slope and intercept and interpret the results in terms of heart weight.
```{r}
cats <- read.csv("./Datasets/Lab13Cats.csv")
str(cats)
plot(Bwt_Kg ~ Hwt_g, data = cats)
slr1 <- lm(formula = Bwt_Kg ~ Hwt_g, data = cats)
summary(slr1)
```