-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path06.0_SamplingEstimators.Rmd
1912 lines (1238 loc) · 101 KB
/
06.0_SamplingEstimators.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---
```{r message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
# load packages for chapter
options(digits = 10)
library(bookdown)
library(kableExtra)
library(knitr)
library(tables)
library(pander)
library(car)
library(tidyr)
library(ggplot2)
library(latex2exp)
library(arrangements)
library(plotrix)
options(scipen = 999)
```
# Random Variables, Sampling and Distributions {#chSampling}
## Learning Objectives for Chapter
1. Define random variables and their distributions.
1. Define and identify population, sample and observation.
1. Explain why we need to sample populations.
1. Describe and use a method to obtain random samples with equal probabilities.
1. Define "parameter" and "random variable" and discuss their relationship to one another
1. Describe and implement stratified random sampling.
1. Explain the difference between sampling with and without replacement.
1. Write and use equations for estimators of population mean and variance.
1. Distinguish between estimators and definitions of parameters.
1. List the most important distributions in elementary statistics.
1. Define and calculate bias of an estimator of a parameter.
1. When should a Student's t distribution be used over a normal distribution for samples?
1. Define statistical bias and give a few examples of situations that may lead to a biased estimator
1. Sketch an example sampling distribution (r = 10) for observations drawn from a normally distributed population, and sketch the corresponding normal pdf, using your sampling distribution to estimate its parameters.
- List the differences and similarities between the two sketches.
- Explain why the normal pdf representing the population from which the samples were drawn might not look like the normal pdf you drew.
## Random variables
Random variables are those variables whose values are determined by random processes or experiments. Any time we conduct a measurement or observe the result of an experiment we obtain specific values for the random variables. These specific values are termed realizations. For example, consider the result of rolling a pair of dice. Define the random variable as the total number of dots in both dice. The random variable can take integer values between 2 and 12. Say that in one roll you got a 3 and a 5. The realization of the random variable is 8.
In reality, a random variable is a real-valued function defined over the sample space of a random experiment or process. We are free to define the function in any way we want, even with a table that shows the value for each outcome, but because it has to be a function, each result of the random process cannot be associated with more than a single value of each random variable.
```{block rvDef, type = 'mydef'}
- A random variable is a function that associates a single real number with each outcome in the sample space of a random experiment.
```
Do not confuse this function property with the possibility of having random variables grouped into vectors, where each vector contains more than one value, each value being the realization of a different random variable. For example, in the roll of two die we can define the random vector {number of dots in the first die, number of dots in the second die}. The importance of thinking of a random variable as a function is that it gives a lot of flexibility to deal with the results of random experiments. Instead of just counting dots on the face of a die we can use complicated functions that allow us to test hypotheses about more complex processes, such as administering medicines to animals, or comparing the yield of multiple varieties of food plants.
Do not confuse this function property with the possibility of having random variables grouped into vectors, where each vector contains more than one value, each value being the realization of a different random variable. For example, in the roll of two dice we can define the random vector {number of dots in the first die, number of dots in the second die}. The importance of thinking of a random variable as a function is that it gives a lot of flexibility to deal with the results of random experiments. Instead of just counting dots on the face of a die we can use complicated functions that allow us to test hypotheses about more complex processes, such as administering medicines to animals, or comparing the yield of multiple varieties of food plants.
A more formal definition and an intuitive physical model of random variable is given in [@286171]. Make sure to also explore the explanations given by W. Huber in [@54894].
**Examples of Random Variables**
- Number of dots in three dice rolled once.
- Number of plants within 50 cm of a point placed randomly in a wheat field.
- Time that it takes to find a bird of a certain species in a forest plot.
- Your diastolic blood pressure measured every day at noon.
- Weight of a randomly selected mature dairy cow in California.
- Number of atoms that decay in 1 second in 1 mg of Uranium-238.
- Pain level reported by a patient who consults the doctor due to a headache.
- Weight of a ball randomly selected from two.
- Mass of carbon in a soil sample divided by the total sample dry mass.
- Natural log of nonstructural carbohydrates concentration in CA almond tree branches on 1 Mar every year.
### Types of Variables and Notation
The choice of statistical method to answer a specific question depends strongly on the types of random variables involved in the question. The type of variable involved matters primarily because it determines the type of statistical distribution they can have.
There are several ways to classify variables, but the following is perhaps the most useful and general. Statistical variables can be:
**Categorical** For example, plant species or cultivar. These are non-numeric variables whose values are not ordered. The order of values has no meaning.
**Ordinal** For example, plant growth stages or disease condition. Ordinal variables are non-numeric but their values are ordered and the order has meaning. We know that an animal in stable disease condition is worse than healthy and better than gravelly ill, but it is not possible to say if "healthy" is 10 or 1.5 times better than gravelly ill.
**Quantitative** These are random variables that take numerical values and that can be used in arithmetic operations such as addition and multiplication. Quantitative variables can be discrete or continuous.
**Discrete** Discrete quantitative variables can take integer values, for example, the number of branches in a plant, or the number of eggs that a hen produces. The values come from the set of Integer Numbers. Discrete random variables can have finite or infinite sampling spaces. For example, the number of heads in 37 coin tosses is discrete and has a finite sampling space $S = \{0, 1, \dots , 37\}$, whereas the number of tosses until 3 heads are in a row is discrete but has a countable infinite sampling space $S = \{3, 4, \dots \infty \}$.
**Continuous** Continuous quantitative variables can take values between any other two values. For example, the mass of wheat produced per unit area, or the volume of milk produce by a cow per day. These variables take values that are Real Numbers. Note that most of the specific variables we analyze in agricultural and environmental sciences are positive real numbers.
### Using random variables
Once we define a useful random variable, we can apply the rules of probability to calculate the probability of any event in terms of the random variable, but first we have to know the statistical distribution of the random variable. Imagine that you are breeding animals and need to get at least three individuals with a specific genotype of class G that is expected to appear in 20% of the offspring. What are the chances (probability) that you will get the 3 G's if you get 5 offspring? We can *model* this situation assuming that the genotypes of all individuals are independent, and that the probability for each individual having G is 0.20. Let's use the letter "O" for "other" to designate individuals that are not G. You will have the 3 you need if you get 3, 4 or all 5 with G. We define the random variable Y as "number of G's in five offspring." If you get 3 you do not get 4 or five. This means that the events are disjoint and that we sum their probabilities to get the probability of at least 3 G's. To calculate the probability of each of the events, we recall that individuals are independent, so the probability of a given set, like GGGOO is the product of the individual probabilities, in this case $0.20^3 \cdot 0.80^2$. Finally, we take into account that the event GGGOO is not the same as GGOOG, but each has probability $0.20^3 \cdot 0.80^2$. The enumeration of the combinations and probabilities show how a pattern emerges:
<br>
Table: (\#tab:deriveBinomialExmpl) Calculation of the probability of obtaining Y successes in 5 independent trials with constant probability of success.
| Y | sequence | no. sequences | Probability of each sequence |
|:-:|----------------------------------------------------------------------|:-------------:|---------------------------------------------|
| 0 | OOOOO | 1 = $C_0^5$ | $0.20^0 \cdot 0.80^5 = `r 0.80^5`$ |
| 1 | GOOOO, OGOOO, OOGOO, OOOGO, OOOOG | 5 = $C_1^5$ | $0.20^1 \cdot 0.80^4 = `r 0.20^1 * 0.80^4`$ |
| 2 | GGOOO, GOGOO, GOOGO, GOOOG, OGGOO, OGOGO, OGOOG, OOGGO, OOGOG, OOOGG | 10 = $C_2^5$ | $0.20^2 \cdot 0.80^3 = `r 0.20^2 * 0.80^3`$ |
| 3 | GGGOO, GGOGO, GGOOG, GOGGO, GOGOG, GOOGG, OGGGO, OGGOG, OGOGG, OOGGG | 10 = $C_3^5$ | $0.20^3 \cdot 0.80^2 = `r 0.20^3 * 0.80^2`$ |
| 4 | GGGGO, GGGOG, GGOGG, GOGGG, OGGGG | 5 = $C_4^5$ | $0.20^4 \cdot 0.80^1 = `r 0.20^4 * 0.80^1`$ |
| 5 | GGGGG | 1 = $C_5^5$ | $0.20^5 \cdot 0.80^0 = `r 0.20^5`$ |
<br>
There are many ways to get any number, say r, G's in a sequence. That number equals the number of different sets of r positions or slots that can be chosen from the 5 slots available, without replacement. This is a case where the order in which the slots are chosen does not matter (having G's in positions 1 and 3 is the same as having G's in positions 3 and 1) and there is no replacement (once a slot is occupied it cannot be selected again). Therefore, we can use combinations of 5 slots taken in sets of r = Y to determine the number of ways that a given number of successes can be obtained. Notice that whenever you get r slots with G's you also get 5-r slots for O's. That's why the number of sequences is symmetric from top to bottom of the table.
If we use the letters $p$ for the probability of G or "success," $q = 1-p$ for failure or O and $n$ for the total number of trials we can write a compact equation for the probability that we obtain Y = r successes. We use the letter $Y$ for the random variable "number of successes" to emphasize that it is a response variable, but we also use the letter $r$ to link this to Equation \@ref(eq:combinations) for combinations and to designate a specific realization of Y.
<br>
\begin{align}
P(Y = r ) = C^n_r \ p^r \ q^{n-r}
(\#eq:binomialPMF)
\end{align}
<br>
This equation is the function that associates a probability to each value of the random variable and it is the *probability mass function* for the **Binomial Distribution**. The term "binomial" comes from the fact that the equation represents all the terms in the expansion of the power of a sum of the form $(p + q)^n$. The binomial expansion states that $(p + q)^n = \sum^n_{r = 0} \ p^r \ q^{n-r}$. This equation can be used to answer any questions that involve Binomial distributions. Specifically, the probability of getting at least r G's in the example is $P(Y \ge 3) = \sum^n_{r = 3} \ 0.20^r \ 0.80^{5-r}$
**Functions of random variables**
Suppose that in the example above, animals with G require 2 tons of feed to reach maturity, whereas animals O require 1.5 tons. How much feed do you need to order to have a probability of 90 or better that you will have enough to feed all animals to maturity? We can define a new random variable based on the first one:
<br>
$$X = 2 \ r + 1.5 \ (5-r) = 7.5 + 0.5 \ r$$
<br>
```{r rvFunc1}
an.dat <- data.frame(Y = 0:5) # list of possible values of rv Y
an.dat$Prob <- dbinom(an.dat$Y, size = 5, prob = 0.20) # corresponding porbabilities
an.dat$X = 7.5 - 0.5 * an.dat$Y # lisf of possible values of rv X
an.dat$cdf <- cumsum(an.dat$Prob) # cumulative probabilities
kable(an.dat, caption = "Probability mass function and cumulative distribution function (CDF) for a binomial random variable Y with p = 0.20 and n = 5. Random variable X is a linear function of Y") %>%
kable_styling(full_width = FALSE)
```
<br>
Using the relationship between X and Y we obtain the probability mass function for X, which is just the same as the original but translated (by adding 7.5) and shrunk (by multiplying by 0.5). By progressively adding the probability of each value of X, starting from the lowest one we obtain the cumulative density function, which gives us the probability that X is equal to or less than r: $F(r) = P(X \le r)$. We can see that in order to have a probability greater than or equal to 0.90 of having enough feed we need to order 8.5 tons of feed (Table \@ref(tab:rvFunc1)).
<!-- A global view of the process followed to solve the feed problem above is revealing, because it shows **TO BE COMPLETED** -->
## Probability Distributions
Probability distributions are functions of random variables that have the following properties:
**Discrete Random Variables ** have probability mass functions $P(Y = y)$
<br>
\begin{align}
P(Y = y)& \quad \text{is the probability of the event that results in }Y = y \\[15pt]
P(Y = y)& \ge 0 \\[15pt]
\sum_{y = -\infty}^{y = +\infty} P(Y = y)& = 1\\[15pt]
(\#eq:pmf)
\end{align}
<br>
The corresponding **cumulative distribution function** is
<br>
\begin{align}
F_Y(y) = P(Y \le y) = \sum_{i = -\infty}^{i = y} P(Y = i)
(\#eq:cdfDiscrete)
\end{align}
<br>
**Continuous Random Variables ** have probability density functions $f(Y = y)$
<br>
\begin{align}
f(y)& \quad \text{is the probability "per unit change of Y"} \\[15pt]
P(Y = y)& = 0 \quad \text{the probability of any specific value y is zero!} \\[15pt]
P(a \le Y \le b)& = \int_a^b f_Y(y) \ dy \\[15pt]
\int_{\mathbb{R}} f_Y(y) \ dy& = 1
(\#eq:pdf)
\end{align}
<br>
The corresponding **cumulative distribution function** is
<br>
\begin{align}
F_Y(y) = P(Y \le y) = \int_{-\infty}^{y} f_Y(y) \ dy
(\#eq:cdfCont)
\end{align}
<br>
The symbol $\int_\mathbb{R}$ represents the *integral* of the function over all real values of Y, and $\int_{\mathbb{R}} f(y) \ dy$ is the integral between $Y = a$ and $Y = b$. If you are not familiar with the concept of integral, no worries. Imagine that $f(y)$ is an absolutely continuous function of the value $y$. The integral between any two $y$ values a and b is the area under the curve between the two points (Figure \@ref(fig:Sum2Integral)). One way to calculate the area approximately is to divide the interval [a, b] into $(b-a)/dy$ little segments. The area of each segment is the base $dy$ times the height, which can be set at $f(y_{lo} + dy/2)$ where $y_{lo}$ is the left end of each segment.¬
<br>
```{r Sum2Integral, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '90%', fig.align='center', echo=FALSE, fig.cap ="The integral of a continuous probability density function (pdf) can be though of as the limit of the sum of rectangles with base width $dy$ and height equal to the height of the pdf function at the center of the base, as $dx$ approaches zero. The area of each rectangle is an approximation of the probability that Y takes values within the base of the rectangle."}
knitr::include_graphics("images/Sum2Integral.png")
```
<br>
- Probability distributions are functions that define the probability of any event occurring in a random experiment where the outcome is the value of a random variable.
- For example, if the random experiment is flipping a fair coin the probability of X = 1 (heads) is the same as probability of X= 0 (tails) and it is 0.50. Then, X is a random variable that has a Bernoulli distribution with parameter p = 0.5.
- Once we know the true distribution of a random variable, we can know everything there is to know about the random variable.
- True distributions for real random variables are rarely known. Even if we assume that we know the distribution family we never know the parameter values.
- In the case of the coin, we know the distribution of a theoretical coin that is fair, perfect and eternal. We use it as a model for real coins, but we know it is not a perfect model. Some people are able to flip any number of heads or tails they want. We can also suspect that each type, or even each individual coin, has a probability of heads that is different and not exactly 0.50. The same can be said of dice. Recall that dice can be loaded and/or imperfect. However, the usual distributions used for them are *very good models* for the corresponding random experiments and variables.
## Parameters and Moments
Distribution functions are usually functions that can be written down with equations that have certain constant values for all values of the random variable. The set of all possible values of the random variable is called the *support*. In the example about breeding animals above, Equation \@ref(eq:binomialPMF) has two parameters: n, the number of trials and p, the probability of success; q does not count as a different parameter because it is simply $1-p$. All distributions will be characterized by their parameters and other properties such as the mean or expected value and variance. The mean, variance, skeweness and kurtosis are called the **moments** of a distribution, because they all arise from variations of the same equation.
This section presents the definitions of the expectation, variance and parameters of distributions using equations that may be a little scary at first. Students are not expected to be able to derive these equations, and in many cases not even to memorize them. The main point in presenting them is for students to realize and remember that moments and parameters are characteristics of the whole distribution, or of populations, whereas averages and sample variances are calculated from samples and use **different equations**. Moreover, for any given random variable, the parameters and moments of its distribution are *constants*, whereas averages and variances estimated with samples from the distribution are themselves random variables that change from sample to sample. This is a very important concept.
### Expectation, Mean or First Moment
The expected value of a random variable is the long-run average of values resulting from repeating the random experiment a very large number of times. According to the **Law of Large Numbers**, the average of a sequence of independent draws from the same random variable will tend towards the mean of the random variable as the number of draws increases. The expectation, represented by $E\{\}$ or mean is more formally defined as follows:
**Expectation of discrete random variables**
<br>
\begin{align}
E\{Y\} = \sum_i y_i \ P(Y = y_i)
(\#eq:DiscreteExpectDef)
\end{align}
<br>
The expectation of the number of animals with genotype class G above is:
\begin{align}
E\{Y\} = \mu_Y &= 0 \times 0.32768 \\
&+ \ 1 \times 0.40960 \\
&+ \ 2 \times 0.20480 \\
&+ \ 3 \times 0.05120 \\
&+ \ 4 \times 0.00640 \\
&+ \ 5 \times 0.00032 = `r an.dat$Y %*% an.dat$Prob`
\end{align}
**Expectation of continuous random variable**
<br>
\begin{equation}
E\{Y\} = \int_\mathbb{R} y \ f_Y(y) \ dy
(\#eq:ContinuousExpectDef)
\end{equation}
<br>
The integral appears again. In this case, think of it as weighted average of values of Y at the center of a large number of small intervals that cover the whole horizontal axis. The weighting factor is the area under the curve inside each narrow column, which is approximated by the product of the width of the interval and height if the curve at the center. As you make the intervals $dy$ smaller and smaller, the calculation approaches the integral.
As an example for the calculation of the mean of a continuous variable, let's use the continuous uniform distribution between real numbers a and b. The probability density function of this function is rather simple:
<br>
\begin{equation}
f(y) =
\begin{cases}
\frac{1}{b-a} & \quad \text{if } a \leq y \leq b \\
0 & \quad \text{otherwise }
\end{cases}
(\#eq:unifPdf2)
\end{equation}
<br>
This pdf is zero outside the interval $[a,b]$ and constant in the interval because the area has to be 1.0, then, the height of the rectangle and the value of the function are constant at $1/(b-a)$. Using a bit of calculus or intuition we obtain the expectation:
<br>
\begin{equation}
E\{Y\} = \int_a^b \frac{y}{b-a} \ dy = \frac{1}{b-a} \int _a^b y \ dy = \frac{1}{b-a} \ \left . \frac{y^2} {2} \right |_a^b = \frac{b^2 - a^2}{2 \ (b-a)} = \frac{a + b}{2}
\end{equation}
<br>
Thus, the expectation or mean of a random variable with a continuous uniform distribution between 3 and 7 is 5, the midpoint of the distribution.
### Variance or Second Moment
The variance of a distribution, represented by $V\{Y\} = \sigma^2_Y$ can be defined as the first moment or expectation of the squared deviations from the mean:
<br>
\begin{equation}
V\{Y\} = E\{(Y - E\{Y\})^2\} = E\{(Y - \mu)^2\}
(\#eq:VarDef)
\end{equation}
<br>
This definition is short and easy to remember, and it will also remind us of the equation to estimate the variance based on a sample, where we divide the sum of the squared deviation by the number of observations.
<br>
```{block varDefBlock, type='mydef'}
- The variance is the expectation of squared deviations from the mean.
- The standard deviation is the square root of the variance.
- Both are always positive.
```
<br>
An important property of the expectation is that the expected value of a function $g(Y)$ of a random variable can be obtained simply by using the definition of expectation, where instead of Y, we plug in $g(Y)$. This is called the *Law of the Unconscious Statistician* or LOTUS. The application of Equation \@ref(eq:VarDef) and the LOTUS, with $g(Y) = (Y - \mu)^2$ to discrete distributions yields the definition of variance for discrete distributions:
<br>
\begin{equation}
V\{Y\} = E\{(Y - E\{Y\})^2\} = E\{(Y - \mu)^2\} = \sum_i (y_i - \mu)^2 \ P(Y = y_i)
(\#eq:VarDefDiscrete)
\end{equation}
<br>
In the case of continuous distributions, the variance becomes
<br>
\begin{equation}
V\{Y\} = E\{(Y - \mu)^2\} = \int_\mathbb{R} (y - \mu)^2 \ f_Y(y)
(\#eq:VarDefContinuous)
\end{equation}
<br>
### Covariance and Correlation between two RV's
Recall the definitions of marginal, joint and conditional probability from the section on [Probability of two events](#Pof2Events). Consider two random variables together, and think of the values they take as the events. We can then calculate the joint and conditional probabilities for any combination of values from the two variables. This is easily illustrated with discrete finite random variables, but the concepts extend to any pair of random variables.
Imagine that we inspect the tidal zone in a segment of coast (your population) and count the number of starfish (X) and large sea snails (Y) in each 2 x 2 $m^2$ of the tidal zone during low tide. Because we define the area at the time of measurement as our whole population, we know the complete population and we can calculate the moments. The table of (fictitious) results expressed as joint relative frequencies is:
<br>
```{r starfish, echo=FALSE}
set.seed(8) # generate some random but reasonable data.
x1 <- rpois(n = 10000, lambda = 0.2)
x2 <- rpois(n = 10000, lambda = 1)
x0 <- rpois(n = 10000, lambda = 0.8)
starfish <- x1 + x0
snails <- x2 + x0
# Remove rows with too many counts
ss.data <- data.frame(starfish = starfish, snails = snails)
ss.data <- ss.data[(ss.data$starfish < 7) & (ss.data$snails < 8), ]
# cor(ss.data)
total <- sum(table(ss.data$snails,ss.data$starfish))
ss.tbl <- table(snails = ss.data$snails, starfish = ss.data$starfish)
st.sn <- as.data.frame(ss.tbl / total)
knitr::kable(addmargins((ss.tbl / total)),
digits = 4,
caption = "Joint and marginal probabilities of number of starfish (columns) and number of sea snails (rows) in each 2 x 2 m-sq of a tidal zone. (Fictitious data)") %>%
kable_styling(full_width = FALSE)
st.marg <- aggregate(Freq ~ starfish, data = st.sn, FUN = sum)
names(st.marg)[2] <- "Freq.st"
#st.marg$Freq.st %*% 0:6
sn.marg <- aggregate(Freq ~ snails, data = st.sn, FUN = sum)
names(sn.marg)[2] <- "Freq.sn"
#sn.marg$Freq.sn %*% 0:7
```
<br>
The mean or expectation for number of starfish (X) is the sum of each number of starfish multiplied by its marginal frequency:
\begin{align}E\{X\} = \sum_{x=1}^6 \ x \ P(X=x) &= 0 \times `r round(st.marg[1,2], 4)` \\
&+ 1 \times `r round(st.marg[2,2], 4)` \\
&+ 2 \times `r round(st.marg[3,2], 4)` \\
&+ 3 \times `r round(st.marg[4,2], 4)` \\
&+ 4 \times `r round(st.marg[5,2], 4)` \\
&+ 5 \times `r round(st.marg[6,2], 4)` \\
&+ 6 \times `r round(st.marg[7,2], 4)` = `r round(st.marg$Freq.st %*% 0:6, 4)`
\end{align}
As an exercise, fully expand the calculation for the snails mean, ($E\{Y\} = `r sn.marg$Freq.sn %*% 0:7`$).
Inspection of Table \@ref(tab:starfish) reveals that the events are not independent. The joint probabilities are not the product of the marginal probabilities. We can get the conditional probabilities for each number of snails given the number of starfish by dividing each column by the marginal column sum at the bottom of Table \@ref(tab:starfish). The results are displayed in a graph so the differences can be seen at once.
<br>
```{r starfishPlot, echo=FALSE, fig.cap="Conditional probability of number of snails, given a number of starfish. The distributions are discrete and have no height for axis values between integers. Lines are added just to facilitate the indentification of the different sets of points."}
st.sn <- merge(st.sn, st.marg)
st.sn$condP.sn <- st.sn$Freq / st.sn$Freq.st
ggplot(data = st.sn,
aes(y = condP.sn,
x = snails,
group = starfish,
color = starfish)) +
geom_line(size = 0.5) +
geom_point(size = 2) +
xlab(" Number of snails") +
ylab("Conditional probability of no. of snails") +
theme(axis.title.x = element_text(size = 14, face = "bold"),
axis.title.y = element_text(size = 14, face = "bold"))
```
<br>
As the number of starfish increases, the distribution of the number of snails moves to the right, therefore, the two random variables are not independent. The level of dependence between two random variables is measured by their **covariance**. The covariance between two random variables is defined as the expectation of the cross product of deviations of each variable from its mean.
**Covariance**
```{block covBlock, type='mydef'}
- The covariance between two random variables is the expectation of the products of deviations from each observation to the mean in each variable.
```
<br>
\begin{align}
V\{X,Y\} &= \sigma\{X,Y\} = cov\{X,Y\} = E\{(X - E\{X\}) \cdot (Y - E\{Y\})\} \\[15pt]
&= E\{(X - \mu_X) \cdot (Y - \mu_Y)\}
(\#eq:covDef)
\end{align}
<br>
In the case of discrete random variables like the number of starfish and snails, we apply the definition of expectation for discrete variables to obtain:
<br>
$$V\{X,Y\} = \sum_i \sum_j (x_i -\mu_x) (y_j-\mu_y) \ P(X = x_i, Y = y_j)$$
<br>
Plugging in some of the values from the example we get
<br>
\begin{align}
V\{X,Y\} &= (0 - 1.006) \times (0 - 1.805) \times 0.1363 \\
&+ (1 - 1.006) \times (0 - 1.805) \times 0.0266 \\
&+ (2 - 1.006) \times (0 - 1.805) \times 0.0033 \\
&\dots \\
&+ (0 - 1.006) \times (1 - 1.805) \times 0.1360 \\
&+ (1 - 1.006) \times (1 - 1.805) \times 0.1311 \\
&+ (2 - 1.006) \times (1 - 1.805) \times 0.0250 \\
&\dots \\
&\dots \\
&+ (4 - 1.006) \times (7 - 1.805) \times 0.0006 \\
&+ (5 - 1.006) \times (7 - 1.805) \times 0.0003 \\
&+ (6 - 1.006) \times (7 - 1.805) \times 0.0002 = 0.8088
\end{align}
<br>
As an exercise, complete the calculation using R and corroborate that the result 0.8088 is correct. The covariance is positive, indicating that both variables tend to increase or decrease together. This can be seen in a graph showing the **bivariate** distribution of number of starfish and snails.
```{r DiscreteBivariatePlot, message=FALSE, echo=FALSE, fig.align='center', fig.cap="Two perspectives of the bivariate distribution of two random variables, number of starfish and number of snails in all quadrats of a tidal zone. Although bars are used to represent the joint probabilities, in reality the bars should only have height, with null width and lenght, because both random variables are discrete."}
library(plot3D)
par(mfrow = c(1, 2))
hist3D(x = 0:7, y = 0:6, z = ss.tbl / total,
bty = "g", phi = 20, theta = 45,
ltheta = 160 , lphi = 70,
xlab = " no. starfish", ylab = "no. snails", zlab = "", main = "P(X,Y)",
col = "lightblue", border = "black", shade = 0.7,
ticktype = "detailed", space = 0.7, d = 1.5, zlim = c(0, 0.15))
hist3D(x = 0:7, y = 0:6, z = ss.tbl / total,
bty = "g", phi = 20, theta = 110,
ltheta = 160 , lphi = 70,
xlab = " no. starfish", ylab = "no. snails", zlab = "", main = "P(X,Y)",
col = "lightblue", border = "black", shade = 0.7,
ticktype = "detailed", space = 0.7, d = 1.5, zlim = c(0, 0.15))
par(mfrow = c(1, 1))
```
When variables are continuous, the application of the expectation operator to the definition of covariance yields equations with double integrals. We extend the idea of integral in one dimension, explained in Figure \@ref(fig:Sum2Integral), to two dimensions. The probability density function is now a surface in two dimensions and the volume under it is divided into columns whose widths tend to zero. The probability of any region of X and Y values is the volume under the surface inside the area defined by the region.
<br>
$$V\{X,Y\} = \int_x \int_y (x - \mu_x) \ (y - \mu_y) \ f(x, y) \ dx$$
<br>
<br>
```{r biNormal, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '90%', fig.align='center', echo=FALSE, fig.cap ="Two perspectives of the joint probability density function for two normal random variables that have a correlation of 0.70. The height of the surface is the probability per unit volume under the surface. The intersection between any vertical plane and the surface is the pdf of a normal distribution. In particular,intersections with planes perpendicular to either one of the axes (variables) are the conditional distributions of the other variable."}
knitr::include_graphics("images/bivariateNormal.png")
```
<br>
**Correlation**
The covariance has a drawback to represent the degree of association between two random variables because it is a quantity with units and its numerical value depends on the magnitude of the random variables and the units. Therefore, covariances cannot be compared when variables have different units or when they have different means. The **correlation** coefficient is a better measure of association between two variables because it has no units and can be compared across variables and situations.
```{block corBlock, type='mydef'}
- The correlation between two random variables equals their covariance divided by the product of their standard deviations.
```
\begin{align}
cor\{X,Y\} = r_{X,Y} = \frac {V\{X,Y\}} {\sqrt{V\{X\} \ V\{Y\}}}
(\#eq:corDef)
\end{align}
When random variables are standardized (see below), their covariance and correlation are equal.
<br>
```{block indepBlock, type='stattip'}
- When random variables are independent, their covariance and correlation are zero, but the converse is not true. Random variables with zero correlation may have nonlinear dependence.
```
<br>
### Parameters are not necessarily the moments
Distribution functions have constant values that are part of the function called **parameters**. In order to know everything about any given distribution, one must know all the parameters. The Uniform distribution seen before has two parameters, a and b. The Binomial distribution also has two parameters, n and p. As we will see in more detail below, the Poisson distribution has a single parameter. In the case of the Uniform and Binomial distributions, their means and variances are not the same as the parameters. In the case of the Poisson distribution, the mean and the variance both are equal to the single parameter. Other distributions, like the Gamma distribution have more than two parameters. The Normal distribution has two parameters, its mean and its variance; it is a very special case.
The point is that parameters are not necessarily the same as mean and variance. The Normal distribution is a special case in which there are only two parameters and they happen to be the mean and the variance. Therefore, in order to know everything about a Normal distribution one only needs two numbers: mean and variance.
### Properties of Mean and Variance
The following properties can be easily derived directly from the definitions of mean and variance, and they are very useful when working with samples. Consider that $X$, $Y$ and $Z$ are random variables with any distribution that has mean and variance^[Amazingly, some distributions do not have mean or variance, for example because the mean or variance involve sum of series that do not converge.].
<br>
\begin{align}
&E\{a + b \ Y\} = a + b \ E\{Y\} \\[15pt]
&V\{a + b \ Y\} = b^2 \ V\{Y\} \\[25pt]
&\text{Let X be a linear combination of random variables Y and Z: } \\[15pt]
& X = a + b \ Y + c \ Z \quad \text{then,}\\[15pt]
&E\{X\} = a + b \ E\{Y\} +c \ E\{Z\} \\[15pt]
&V\{X\} = b^2 \ V\{Y\} + c^2 \ V\{Z\} + 2 \ b \ c \ V\{Y,Z\} \\[15pt]
&\text{Special cases for independent variables: } \\[15pt]
&V\{Y + Z\} = V\{Y - Z\} = V\{Y\} + V\{Z\}
(\#eq:propertiesOfMean)
\end{align}
<br>
The last equation is extremely important in statistics. In most real applications of statistics we have to use models that involve estimation of more parameters than just mean and variance. For example, in linear regression we will estimate intercept and slope. **Estimated parameters** are based on samples of random variables, so they are themselves random variables. When we use the estimated intercept and slope to make a prediction, we are using a linear combination of estimated parameters. The last line in Equation \@ref(eq:propertiesOfMean) allows us to estimate the variance of the prediction and thus make confidence intervals or test hypotheses that involve linear combinations of estimated parameters.
<br>
```{block LCofEstimatedParameters, type='stattip'}
- A main goal of statistics is to make statements about linear combinations of estimated parameters.
- The variance of a linear combination of estimated parameters depends on the variances and covariances of the estimated parameters.
```
<br>
### Standardized variables
Random variables are frequently standardized to obtain new random variables with more desirable properties. Standardization is a linear transformation or function of the random variable by which the mean is subtracted and the result is divided by the standard deviation.
Frequently we use the lower case version of a letter to refer to the standardized variable. The letter "z" is typically used to refer to the standard Normal distribution.
The mean and variance of a standardized variable are obtained directly by the properties of the mean and variance treated in the previous section. Standardized random variables have mean zero and variance one.
<br>
\begin{align}
&X \ \text{ is an RV with } \ E\{X\} = \mu_X \ \text{and} \ V\{X\} = \sigma^2_X \\[15pt]
&\implies x = \frac{X - \mu_X}{\sigma_X} = - \left( \frac{\mu_X}{\sigma_X} \right) + \left( \frac{1}{\sigma_X} \right) \ X\\[15pt]
&\text{ is an RV with } \ E\{X\} = 0 \ \text{and} \ V\{X\} = 1
(\#eq:standardize)
\end{align}
<br>
The function of X used to get x is a *linear* function, where $\mu_X$ and $\sigma_X$ are constants that do not depend on X. Linear functions simply translate (move along the axis) and uniformly stretch or shrink the width and height of the distribution function.
## Common distributions
The list of distribution functions is infinite. In theory, random variables can have any distribution, for as long as they comply with the definition, which is rather general. In practice, we manage to make useful models of the world with very few prototypical families of distributions. Loosely, a family of distributions is a set of distributions that differ only in the values of their parameters, but have the same functional form. For example, the Bernoulli distribution, which has value 1 with probability $p$ and 0 with probability $1-p$, is a Binomial distribution where $n = 1$.
Some distributions can be derived on the basis of a few facts or assumptions. For example, the Binomial distribution results from performing a set number of independent trials with constant probability of success. Other distributions may be the result of complicated combinations of basic ones random experiments.
In this section we describe the distributions that we will use in the rest of the book. Wikipedia has entries for about 30 discrete and 100 continuous distributions, where they are nicely organized and described.
### Discrete Uniform Distribution {#DUnifDist}
The random experiment consisting of rolling one die can be modeled with a discrete uniform distribution where the random variable is the number of dots facing up. There are n values possible between integers a and b inclusive, and each value has a probability equal to $1/n$.
<br>
**Parameters:**
$$a, \quad b, \quad n = b - a + 1$$
**Support:**
Every integer between a and b, including a and b.
**PMF:**
$$P(Y = y) = 1/n$$
**CDF:**
$$F_Y(k) = \frac{{\lfloor k \rfloor} - a + 1}{b - a + 1}$$
**Mean:**
$$\mu_Y = \frac{a + b}{2}$$
**Variance:**
$$V\{Y\} = \sigma^2_Y = \frac{n^2 -1}{12}$$
<br>
### Continuous Uniform Distribution {#CUnifDist}
This is the distribution where a continuous variable has equal probability of taking values in any segment of equal width between a and b. Its pdf was given in Equation \@ref(eq:unifPdf2).
<br>
**Parameters:**
$a, \quad b$
**Support:**
Every real number between a and b, including a and b.
**PDF:**
$$f_Y(y) = 1/(b - a) \quad \forall \quad a \le y \le b, \quad 0 \quad \text{elsewhere}$$
**CDF:**
\begin{equation}
F_Y(y) = \begin{cases}
0 & \quad \text{if } y \ \lt a\\
\frac{y - a}{b - a} & \quad \text{if } \ a \leq y \leq b \\
1 & \quad \text{if } \ y \gt b
\end{cases}
\end{equation}
**Mean:**
$$\mu_Y = \frac{a + b}{2}$$
**Variance:**
$$V\{Y\} = \sigma^2_Y = \frac{(b - a)^2}{12}$$
<br>
### Binomial Distribution {#BinDist}
This is the distribution of a random variable obtained by summing n independent Bernoulli random variables. The binomial random variable is the number of "successes" in a set of n independent trials with constant probability of success p. The following random variables are usually modeled with a Binomial distribution:
- Number of heads in n coin tosses.
- Number of seeds that germinate out of a set of 10.
- Number of bird nests found in a forest plot that actually has 5 nests.
- Number of salmon that survive and return to a hatchery stream out of 1000 young released.
- Number of apples with worms out of 20 selected randomly from a batch of 10000.
<br>
**Parameters:**
$$n = \text{no. of trials} \quad \quad p = \text{probability of success}$$
**Support:**
Natural numbers between 0 and n (it's a discrete distribution).
**PMF:**
$$f_Y(r) = P(Y = r) = C_r^n \ p^r \ q^{n-r}\quad \forall \quad 0 \le r \le n, \quad 0 \quad \text{elsewhere}$$
**CDF:**
$$ F_Y(k) = P(Y \leq k) = \sum_{i = 0}^ {\lfloor k \rfloor }\ C_i^n \ p^i \ q^{n-i}$$
**Mean:**
$$\mu_Y = n \ p$$
**Variance:**
$$V\{Y\} = \sigma^2_Y = n \ p \ (1 - p)$$
<br>
In the binomial calculations, the combinations are combinations of positions where "success" happens. We use combinations because having "success" in positions 1 and 5 for example, is exactly the same as having success in position 5 and 1. Just note that the order of successes and failures matters but the order of the positions of the successes and failures does not! Coin tosses HHTT are not the same as HTHT. However, having H in positions 1 and 2 is the same as having H in positions 2 and 1.
There are four basic functions in R to do calculations related to the Binomial distribution:
- `dbinom(x, size, prob)` returns the value of the PMF.
- `pbinom(q, size, prob, lower.tail = TRUE)` returns the CDF left tail.
- `qbinom(p, size, prob, lower.tail = TRUE)` returns the quantile.
-`rbinom(n, size, prob)` returns random values with a Binomial distribution.
Analogous functions are available for several distributions in R. The functions are illustrated in the R chunk below.
```{r binomialPlot, message=FALSE}
# Probability of 3 heads in 5 coin tosses
dbinom(x = 3, size = 5, prob = 0.5)
# Probability of k heads in 5 tosses
(df <- data.frame(k = 0:5, P = dbinom(x = 0:5, size = 5, prob = 0.5)))
plot(df, type = "h", col = "skyblue", lwd = 4)
# Probability of 3 heads as a function of number of tosses
(df <- data.frame(size = 0:15, P = dbinom(x = 3, size = 0:15, prob = 0.5)))
plot(df, type = "h", col = "skyblue", lwd = 4)
# Cumulative distribution function for number of 5's in 10 die rolls
(df <- data.frame(k = 0:10, P = pbinom(q = 0:10, size = 10, prob = 1/6)))
plot(df, type = "S", col = "skyblue", lwd = 4)
# P(> 1) infected animals in a sample of 20 from a population where p = 0.02
# One way; note that q is included in the lower tail
pbinom(q = 2, size = 20, prob = 0.02, lower.tail = FALSE)
# or another
1 - dbinom(x = 0, size = 20, prob = 0.02) -
dbinom(x = 1, size = 20, prob = 0.02) -
dbinom(x = 2, size = 20, prob = 0.02)
```
### Poisson Distribution {#PoisDist}
The Poisson distribution is typically used to describe the number of events or occurrences that take place over periods of time or segments of space when the probability of event is constant over space and or time, and events are independent. The probability that an event happens in a specific segment of space or time does not depend on whether there are events on other segments. The following are examples of random variables that can be modeled with a Poisson distribution:
- Number of cars that pass by a corner every 10 minutes between 8 and 9 am every weekday.
- Number of birds observed in a forest plot during 1 hour.
- Number of plants of a given species of weed that are within each square meter in an alfalfa field.
- Number of water pumps that break down in California every minute.
- Number of times a machine is expected to fail in 5000 hours of operation.
- Number of weak points per 100 ft length of rope.
The parameter $\lambda$, which is the mean and the variance of the Poisson, is the mean number of events per unit space or time.
<br>
**Parameters:**
$$\lambda = \mu = \sigma^2$$
**Support:**
Natural numbers between 0 and $\infty$ (it's a discrete distribution).
**PMF:**
$$f_Y(y) = P(Y = y) = \frac{e^{-\mu} \ \mu^y}{y!} \quad \forall \quad 0 \le y, \quad 0 \quad \text{elsewhere}$$
**CDF:**
$$ F_Y(k) = P(Y \leq k) = e^{-\lambda}\sum_{i = 0}^ {\lfloor k \rfloor } \ \frac{\lambda^i}{i!}$$
**Mean:**
$$\mu_Y = \lambda$$
**Variance:**
$$V\{Y\} = \sigma^2_Y = \lambda$$
<br>
For the practical application of the Poisson distribution, if $\lambda$ is the mean rate of events per unit time or space, then the mean of the distribution is $\lambda \cdot s$, where $s$ is the size (length, duration, volume, etc.) of the interval considered. Suppose that the mean number of plants per unit area is 6 $m^{-2}$, what is the probability of observing more than 3 plants in a quadrat that is 0.5 m on each side?
The area of the quadrat is 0.25 $m^{-2}$, thus, the expected number of plants per quadrat is 6 $m^{-2} \ \times$ 0.25 $m^2 / quadrat$ = 1.5 $quadrat^{-1}$ or 1.5 plants per quadrat.
$$P(Y \gt 3) = 1 - P(Y \le 2) = 1 - P(Y = 0) - P(Y = 1) - P(Y = 2) \\[15pt]
= 1 - \frac{e^{-1.5} \ 1.5^0}{0!} - \frac{e^{-1.5} \ 1.5^1}{1!} - \frac{e^{-1.5} \ 1.5^2}{2!} \\[15pt]
= 1 - `r round(dpois(0, 1.5), 4)` - `r round(dpois(1, 1.5), 4)` - `r round(dpois(2, 1.5), 4)` = `r 1 - round(ppois(2, 1.5), 4)`$$
Obviously, the probability that there are more than 3 plants in a 1-$m^2$ quadrat is going to be larger than in a smaller quadrat. Both distributions are plotted in Figure \@ref(fig:PoissonPmfPlot)
<br>
```{r PoissonPmfPlot, message=FALSE, fig.cap="Poisson distributions for number of plants per 0.25 and 1 m^2^ quadrat in a population where the average number of plants per square meter is 6. Both distributions have probabilities that are positive for any positive integer, the distributions are truncated at values of Y for which the probability is too small to be seen in the plot. Abscissa values have been slightly offset so both distributions can be seen without overlap.", out.width = '70%'}
plot(0:20 + 0.05, dpois(x = 0:20, lambda = 1.5), type = "h", lwd = 4, col = "orange", ylab = "PROBABILITY", xlab = "NUMBER OF PLANTS IN QUADRAT")
points(0:20 - 0.05, dpois(x = 0:20, lambda = 6), type = "h", lwd = 4, col = "black")
```
<br>
Note that when asked for the probability of values greater than something in a Poisson, you have to use the probability of the complement, otherwise you would have to sum an infinite number of terms.
There are four basic functions in R to do calculations related to the Poisson distribution:
- `dpois(x, lambda)` returns the value of the PMF.
- `ppois(q, lambda, lower.tail = TRUE)` returns the CDF left tail.
- `qpois(p, lambda, lower.tail = TRUE)` returns the quantile.
-`rpois(n, lambda)` returns random values with a Poisson distribution.
The functions are illustrated in the R chunk below. The number of crossovers that happen in a region of a chromosome during meiosis can be modeled with a Poisson distribution. Suppose that we consider a region where the mean crossover rate is 0.8, and that each meiosis represents an independent event.
```{r poisonPlot, message=FALSE}
# Probability of 0 crossovers
dpois(x = 0, lambda = 0.8)
# Probability of 2 crossovers
dpois(x = 2, lambda = 0.8)
# Probability of y crossovers
(df <- data.frame(y = 0:7, P = dpois(x = 0:7, lambda = 0.8)))
plot(df, type = "h", lwd = 4)
# Cumulative distribution function for number of crossovers
(df <- data.frame(y = 0:7, P = ppois(q = 0:7, lambda = 0.8)))
plot(df, type = "S", col = "skyblue", lwd = 4)
# Expected proportion of gametes with 3 or more crossovers
# One way; note that q is included in the lower tail
ppois(q = 3, lambda = 0.8, lower.tail = FALSE)
# or another
1 - dpois(x = 0, lambda = 0.8) -
dpois(x = 1, lambda = 0.8) -
dpois(x = 2, lambda = 0.8) -
dpois(x = 3, lambda = 0.8)
```
### Normal Distribution {#NormDist}
The Normal or Gaussian distribution is most important in statistics and science, because it is an accurate model for a multitude of variables measured. As sample size increases, many estimated parameters tend to have a normal distribution, regardless of the original distribution from where the sample is obtained. The normal distribution describes diffusion of gas particles in a volume, noise in an electromagnetic signal, values in chaotic systems, and all sorts of variables measured in the empirical sciences.
For any given variance and mean, the Normal distribution is the continuous distribution with maximum entropy. This means that when one knows the mean and variance of a continuous distribution but nothing else about it, assuming that it has a Normal distribution is the choice that imposes or adds the least prior information.
<br>
**Parameters:**
$$\mu \ \text{and } \sigma^2$$
**Support:**
Real numbers between $-\infty$ and $\infty$ (it's a continuous distribution).
**PMF:**
$$f_Y(y) = P(Y = y) = \frac{1}{2 \ \pi \ \sigma^2} \ e^{-\frac{(y - \mu)^2}{2 \ \sigma^2}} = \\[20pt]
= \frac{1}{2 \ \pi \ \sigma^2} \ exp \left[ -0.5 \ \left( \frac{y - \mu}{\sigma} \right)^2 \right]$$
Note that the probability density depends on the square of the deviation from the mean. As a result, the Normal distribution is symmetric about the mean, the height of the curve at $y = \mu + a$ is equal to the height at $y = \mu - a$: $f_Y(\mu + a) = f_Y(\mu - a)$
**CDF:**
$$\Phi(y) = \int_{-\infty}^y \ \frac{1}{2 \ \pi \ \sigma^2} \ exp \left[ -0.5 \ \left( \frac{y - \mu}{\sigma} \right)^2 \right] \ dy$$
The CDF of the Normal distribution is a well defined function but it cannot be written down with other elementary functions. We use the Greek letter $\Phi$ to refer to it.
**Mean:**
$$\mu_Y$$
**Variance:**
$$V\{Y\} = \sigma^2_Y$$
<br>
Although the Normal pdf has positive values for all real numbers, we frequently use it as a approximation for random variables that are strictly positive such as mass, length, and other biophysical quantities that cannot be negative. For the approximation to be a good one, the mean has to be at least 4 times the standard deviation. This way, less than 1/30,000 of the distribution is expected to be below zero, making errors of little practical consequence. In cases when the mean is closer to zero than 3-4 standard deviations, we should consider other distributions that have the positive real numbers as support, including 0. The log-normal and members of the Gamma family of distributions would be good choices.
**Calculations with the Normal distribution**
Several of the calculations we present here for the Normal distribution can be performed with any distribution, particularly with symmetric continuous distributions. Most of the time we are interested in areas under the Normal pdf $f(y)$ between values of the random variable, which is the same as differences in the height of $\Phi(y)$ at the values of Y.
```{r normDistFig1, message=FALSE, echo=FALSE, fig.cap="Probability density function (blue) and cumulative distribution function (black) for the standard Normal distribution. The shaded area under the pdf is te probability that the random variable Y takes values less than 0, which is the height of the cumulative distribution function at Y = 0.", out.width='80%'}
curve(pnorm(x),
from = -3.5,
to = 3.5,
lwd = 2,
xlab = "Y",
ylab = "")
# Create data for the area to shade
cord.x <- c(-3.5, seq(-3.5, 0, 0.01), 0)
cord.y <- c(0, dnorm(seq(-3.5, 0, 0.01)), 0)
# Add the shaded area.
polygon(cord.x, cord.y, col = "lightblue")
# Redraw lines over shading
curve(pnorm(x),
from = -3.5,
to = 3.5,
lwd = 2,
xlab = "Y",
ylab = "",
add = TRUE)
curve(dnorm(x),
from = -3.5,
to = 3.5,
add = TRUE,
lwd = 2,
col = "blue")
# Add lies to axes and text
lines(x = 0, y = pnorm(0), type = "h", lty = 2)
lines(x = c(-3.75, 0), y = rep(pnorm(0), 2), lty = 2)
text(x = 2, y = 0.12,
labels = "f(Y)",
col = "blue",
cex = 1.5)
text(x = 2, y = 0.90,
labels = TeX("$\\Phi(Y)$"),
col = "black",
cex = 1.5)
text(x = -0.6,
y = 0.07,
labels = "Area = 0.5",
col = "blue",
cex = 1)
text(x = 0.85, y = 0.5,
labels = "<- Height = 0.5",
col = "black",
cex = 1)
```
<br>
Any Normal RV can be mapped to the Standard Normal Distribution (SND) by the linear transformation consisting in subtracting the mean and dividing by the standard deviation. The standard normal distribution is a normal distribution with mean equal to 0 ($\mu = 0$) and variance equal to 1 ($\sigma^2 = 1$).
$$P(Y \le y) = P\left ( Z = \frac{Y - \mu}{\sigma} \le \frac{y - \mu}{\sigma} = z \right )$$
If Y has a Normal distribution with mean $\mu$ and variance $\sigma^2$ and Z has a SND, which we write as $Z \sim N(0,1)$, then
$$P(y_1 \le Y \le y_2) = P \left( \frac{y_1 - \mu}{\sigma} \le Z \le \frac{y_2 - \mu}{\sigma} \right)$$
What is the probability that $Y \sim N(5, 4)$ takes values between 6 and 9, or that it takes values lower than 3? Those probabilities are represented as areas under the curve of $Y \sim N(5,4)$ and under the SND $Z \sim N(0,1)$ in Figure \@ref(fig:normDistFig2).
<br>
```{r normDistFig2, echo=FALSE, message=FALSE, fig.cap="Equivalence of areas under curve in a normal distribution with mean 5 and standard deviation 4, and the standard Normal Distribution.", fig.height=10}
par(mfrow = c(2,1))
curve(dnorm(x, mean = 5, sd = 2),
from = -2.5,
to = 12.5,
lwd = 2,
xlab = "Y",
ylab = "",
col = "blue",
xaxp = c(-1, 11, 12))
# Create data for the area to shade between 6 and 9
cord.x <- c(6, seq(6, 9, 0.01), 9)
cord.y <- c(0, dnorm(seq(6, 9, 0.01), mean = 5, sd = 2), 0)
# Add the shaded area.
polygon(cord.x, cord.y, col = "skyblue")
# Create data for the area to shade below 3
cord.x <- c(-2.5, seq(-2.5, 3, 0.01), 3)
cord.y <- c(0, dnorm(seq(-2.5, 3, 0.01), mean = 5, sd = 2), 0)
# Add the shaded area.
polygon(cord.x, cord.y, col = "grey")
curve(dnorm(x, mean = 5, sd = 2),
from = -2.5,
to = 12.5,
lwd = 2,
xlab = "Y",
ylab = "",
col = "blue",
xaxp = c(-1, 11, 12),
add = TRUE)
text(x = 7.5, y = 0.015,
labels = TeX("$P(6 \\leq Y \\leq 9)$"),
col = "blue",
cex = 1)