-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathLab02Key.Rmd
618 lines (337 loc) · 18.4 KB
/
Lab02Key.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
---
title: "Lab02: Probability and distributions"
author: "YourFirstName YourLastName"
date: "Today's date"
output: html_document
---
```{r loadPackagesLab, message = FALSE, warning=FALSE}
# install.packages("prob")
library(prob)
library(arrangements)
library(knitr)
library(kableExtra)
```
### 1. Labels with letters [5]
You are preparing to perform an experiment that will result in the collection of no more than 10,000 laboratory specimens. Each specimen has to be labeled uniquely. What is the minimum number of characters that you can use in the labels if you use 26 letters? Perform the calculation adding one character at a time until you have a set that will work.
```{r LabelsLetters}
x <- 26 # fill in this number!
unique_labs <- x^(1:5) # if you pull between 1 and 5 times from x options, how many combinations can you get?
min(which(unique_labs >= 10000))
```
What is the minimum number of characters needed?
> Your answer here: 3
Is the selection of letters with or without replacement?
> Your answer here: with replacement
Does order matter?
> Your answer here: yes
What Combinatorics case is this?
> Your answer here: Permutations with replacement
### 2. Poker hands [5]
How many possible 5-card poker hands are possible from a 52-card deck?
What Combinatorics case is this? Use the equation given in the notes and then use the `ncombinations` function of the `arrangements` package.
```{r PokerHands}
### the question says that we should first use the equation from the notes ###
# answer with equations
factorial(52)/(factorial(52-5)*factorial(5))
# fill in the values of "k" and "n" to identify the number of possible outcomes
ncombinations( k = 5 , n = 52 , replace = FALSE )
### the question says that use the equation from the notes
# answer with equations
factorial(52)/(factorial(52-5)*factorial(5))
```
What Combinatorics case is this?
> Your answer here: Combination without replacement
### 3. Tagged fish recapture [10]^[This is the Hypergeometric distribution]
A manager needs to estimate the population of trout in a small Sierra Nevada lake. She catches 10 trout, tags them, and releases them back into the lake. After trout return to normal, the manager returns to the lake and catches 7 trout. She does not release the trout until she is done with the catching and counting. What is the probability of getting 2 tagged out of 7 trout if the lake has a total of 50, including the 10 marked ones?
The total number of possible sets of 7 trout from the lake is $C^{50}_7$. This can be better visualized if you think that each trout is a different individual, or a different person and the question is how many different groups of people can you make choosing from 50 people (it is a large number, so do not try to list all possibilities). Once a person (e.g. Peter) is selected, he cannot be selected again (no replacement), and it does not matter if Peter is selected first , second or 7th. It just matters whether he is in the group (order does not matter).
Repeating the rationale for the 10 tagged trout, there are $C^{10}_2$ ways to get 2 out of the 10 tagged trout. Thus, there are $C^{40}_5$ ways to get the other 5 untagged trout out of the 40 untagged ones available. Each set of tagged trout can be combined with each set of untagged trout, so the total number of possible sets of 2 tagged with 5 untagged trout is the product of the number of tagged and untagged sets: $C^{10}_2 \times C^{40}_5$
```{r TaggedFishRecapture}
total50 <- ncombinations(k = 7, n = 50, replace = FALSE) #the combinations of 7 trout caught of 50 total trout without replacement
tagged50 <- ncombinations(k = 2, n = 10, replace = FALSE) #of the 10 tagged trout, 2 tagged trout are caught
untagged50 <- ncombinations(k = 5 , n = 40 , replace = FALSE) #of the 40 untagged trout, 5 untagged trout are caught
tag.untag50 <- tagged50 * untagged50 #total possible combinations of 2 tagged trout and 5 untagged trout out of 50 total trout (10 tagged/40 untagged)
prob50 <- tag.untag50 / total50 #probability of 2 tagged trout and 5 untagged trout out of 50 total trout (10 tagged/40 untagged)
#total possible combinations of 2 tagged trout in the set of 7 caught if there were 40 total trout
total40 <- ncombinations(k = 7, n = 40, replace = FALSE)
tagged40 <- ncombinations(k = 2, n = 10 , replace = FALSE )
untagged40 <- ncombinations(k = 5, n = 30 , replace = FALSE )
tag.untag40 <- tagged40 * untagged40
prob40 <- tag.untag40/total40
#total possible combinations of 2 tagged trout in the set of 7 caught if there were 34 total trout
total34 <- ncombinations(k = 7 , n = 34 , replace = FALSE )
tagged34 <- ncombinations(k = , n = 10 , replace = FALSE )
untagged34 <- ncombinations(k = 5 , n = 24 , replace = FALSE )
tag.untag34 <- tagged34 * untagged34
prob34 <- tag.untag34 / total34
```
What is then, the probability of getting exactly 2 tagged in the set of 7 caught if there were 50 total in the lake?
> Your answer here: 0.2964463
What is then, the probability of getting exactly 2 tagged in the set of 7 caught if there were 40 total in the lake?
> Your answer here: 0.343967
What is then, the probability of getting exactly 2 tagged in the set of 7 caught if there were 34 total in the lake?
> Your answer here: 0.3555421
Based on these results, which of these three options has the highest probability of getting exactly 2 tagged in the set of 7 caught trout?
> Your answer here: 34 total trout
### 4. Probability of one-card events [10]
Use the `prob` package to perform the following calculations. Corroborate your results using basic calculations instead of the high-level R functions.
What is the sample space for the random experiment: draw a random card from a deck?
> Your answer here: The 52 cards of the deck {4 suits of Spades, Hearts, Diamonds, and Clubs. Each suit contains 13 cards: Ace, 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King.}
What is P(Heart & even)?
> Your answer here: 0.09615385
What is P(Heart or even)?
> Your answer here: 0.5384615
```{r}
help(prob)
# Make the sample space with all outcomes and equal probabilities
(Scard <- cards(makespace = TRUE))
# Define Heart event.
(Heart <- subset(Scard, suit == "Heart"))
# Define even event
(Even <- subset(Scard, rank %in% c(1:5 * 2)))
Prob(Heart) # probability of a Heart
Prob(Even) # probability of an even cards
#complete code below
Prob(Heart, given = Even) # conditional probability of Heart given even
Prob(Even, given = Heart) # conditional probability of Even given
#calculate by hand
PHGE <- (Prob(Heart)*(Prob(Even))/Prob(Even))
# Applying the definition of conditional probability:
Prob(intersect(Heart, Even)) / Prob(Even)
Prob(intersect(Heart, Even)) # 5 cards out of 52
Prob(union(Heart, Even))
# Using the equation for either one of two events:
Prob(Heart) + Prob(Even) - Prob(intersect(Heart, Even))
```
Define an event "Odd" for odd numbered card.
Define an event "G7" for rank greater than 7. Note that rank is a factor and cannot be treated as a number. You have to specify each element in the subset, for example `c("10", "Q", "A")`.
Calculate P(Odd), P(G7), P(G7 & Odd), P(G7 OR Odd), P(Odd | G7) and P(G7 | Odd). *(oxford comma neeeded?)*
> P(Odd): 0.3076923
> P(G7): 0.5384615
> P(G7 & Odd): 0.07692308
>P(G7 OR Odd): 0.7692308
> P(Odd | G7): 0.1428571
> P(G7 | Odd): 0.25
> P(Odd | G7) and P(G7 | Odd): 0.03571429
Are the events "Odd" and "G7" independent? Why?
> No. They are not independent because the probability of drawing an odd card is affected by the condition that the card is greater than seven. Similarly, the probability of getting a card greater than seven is affected by the condition that the card is odd
> For each suit(13 cards), G7{8,9,10,J,Q,K,A}, Odd{3,5,7,9}; G7 and Odd{9}
> P(Odd) = 4/13 ≠ P(Odd|G7) = 1/7
> P(G7) = 7/13 ≠ P(G7|Odd) = 1/4
> Besides, the P(Odd and G7) ≠ P(Odd) * P(G7)
```{r}
# Define event odd
(Odd<- subset(Scard, rank %in% c(3,5,7,9)))
#Cards greater than 7
(G7 <- subset(Scard, rank %in% c(8,9,10, 'J', 'Q', 'K', 'A')))
# Probability of odd
Prob(Odd)
#Probability of card greater than 7
Prob(G7)
#Probability of G7 and odd
#complete code below
Prob(intersect(G7,Odd))
4/52 # only the nines are odd and greater than seven
#Probability of G7 or odd
#complete code below
Prob(union(G7,Odd))
#Probability of Odd given G7 and G7 given odd
(Prob(Odd, given = G7))
(Prob(G7, given = Odd))
((Prob(Odd, given = G7)) * (Prob(G7, given = Odd)))
```
### 5. Rolling the dice [10 points]
A random experiment consists of rolling a die three times.
What is the sample space? How many outcomes are there?
> the combination (with replacement) of 6, 6 and 6. There are 6^3 outcomes = 216
What is the probability of observing the ordered sequence 2, 3 in the three rolls?
> 16/216
What is the probability of observing a 2 and a 3 regardless of order?
> 30/216
```{r}
# Make and view the sample space; 6^3 = 216
(S <- rolldie(times = 3, makespace = TRUE))
# This function checks if a sequence or subset appears inside another sequence
sum(TwoThreeOrdered <- isin(S, c(2,3), ordered = TRUE))
sum(TwoThreeNoOrder <- isin(S, c(2,3), ordered = FALSE))
# Subsets or events are made according to whether they have the sequence or values
# We reuse the names
TwoThreeOrdered <- subset(S, TwoThreeOrdered)
TwoThreeNoOrder <- subset(S, TwoThreeNoOrder)
Prob(TwoThreeOrdered) # 16/216
Prob(TwoThreeNoOrder) # 30/216
```
<br>
### 6. Conditional probability, independence and Bayes' rule. [15]
A very large sample of deer mice were tested for presence of a genetic marker and Hantavirus. The total proportion of mice with Hantavirus (H) is 0.17, whereas 63% of the mice carry the genetic marker (M). Five percent of the mice carried the marker and were positive for Hantavirus (H M).
Make a data frame with the joint probabilities for all cases. Hint:
$$P(+H) = 0.17 = P(+H\cap+M) + P(+H\cap-M) = 0.05 + P(+H\cap-M)$$
$$P(+H\cap-M) = 0.17 - 0.05 = 0.12$$
```{r Hantavirus}
# Create a data frame with the probabilities for each event
hvirus <- expand.grid(Hanta = c("h", "H"), marker = c("m", "M"))
P.H <- 0.17
P.h <- 1 - P.H
P.M <- 0.63
P.m <- 1 - P.M
P.HM <- 0.05 #complete code
P.Hm <- P.H - P.HM
P.hM <- P.M - P.HM
P.hm <- 1 - P.HM - P.Hm - P.hM
hvirus$probs <- c(P.hm, P.Hm, P.hM, P.HM)
```
Calculate the marginal probabilities for presence of marker and Hantavirus.
```{r hmarginals}
marginal(hvirus, vars = "marker")
marginal(hvirus, vars = "Hanta")
```
Determine if marker and hantavirus are independent.
```{r hmIndep}
# Is the product of the marginals equal to the joint probability?
P.H * P.M == P.HM
P.h * P.m == P.hm
```
Calculate the probability that a mouse known to have the marker has Hantavirus.
```{r hmConditional}
# Create events for virus and marker
H <- subset(hvirus, Hanta == "H")
M <- subset(hvirus, marker == "M")
#Complete code below
Prob(H, given = M)
# Corroborate that sum of conditionals is 1
h <- subset(hvirus, Hanta == "h")
Prob(h, given = M)
Prob(H, given = M) + Prob(h, given = M)
```
The relationship between marker and Hantavirus remains constant across mice populations. This means that the probability that a mouse known to have the marker has Hantavirus is the same across populations. A random mouse is selected from a new population where Hantavirus incidence is 30% and it is determined that it has the marker. What is the probability that it has Hantavirus? Use Bayes' rule. For the denominator, use the law of total probability. You are asked to calculate the posterior P(H | M) with prior P(H) = 0.30.
```{r HvPosterior}
P.H <- 0.30 # New prior probability
# make sure to use P.H and 1 - P.h
P.H * Prob(M, given = H)/(P.H * Prob(M, given = H) + (1 - P.H) * Prob(M, given = h))
```
### 7. Binomial Distribution. [10]
You want to have 10 tomato seedlings to transplant. You place 50 seeds in germinations trays. It is known that the probability that the seeds will germinate and become seedlings is 0.20. What is the probability that you will have enough seedlings? Keep in mind that if 10 or more than 10 seeds germinate, you will have enough seedlings.
We assume that the probability of germination is constant, and that seeds behave independently.
```{r BinomLab}
help(pbinom)
# Use pbinom() because it gives the cumulative probability
# Keep in mind that the lower tail includes the q specified
# Complete the code below
pbinom(q = 9, size = 50, prob = 0.2, lower.tail = FALSE)
## q is 9 because to have the right tail of the pdf we need to set lower.tail = FALSE,
## which means that the returned value is the P[X > x]. As we want the probability of 10 or more, we ask for greater than 9.
```
> Your answer here: 0.5562596
Plot the binomial distribution for p = 0.20 and n = 50.
```{r BinomPlotLab}
# Complete the code below
plot(dbinom(x = 0:50, size = 50, prob = 0.5) ~ c(0:50), type = "h")
```
What is the smallest number of seeds you need to start with if you want to have at least 95% probability of having enough seedlings? You can add seeds to the calculation until you get the probability needed.
```{r numberSeeds}
# Complete the code below
# For 50 seeds the probability of 10 or more is
pbinom(q = 9, size = 50, prob = 0.2, lower.tail = FALSE)
# Graphical solution
plot(
pbinom(q = 9,
size = 60:80,
prob = 0.2,
lower.tail = FALSE) ~
c(60:80),
type = "S",
ylab = "P(10 or more seeds germinating)")
# Numerical solution
c(60:80)[which(
pbinom(q = 9,
size = 60:80,
prob = 0.2,
lower.tail = FALSE) >=
0.95)]
```
> Your answer here: 76 seeds
### 8. Poisson Distribution. [10]
High tensile wire wire for fences can have weak points where it breaks when fencing tension is applied. Wire is sold in 1000 ft rolls. Assuming that the number of weak points has a Poisson distribution with $\lambda = 0.00015$ weak spots per foot, what is the probability of getting a roll with no weak spots?
```{r PoissonWire1}
lambda <- 0.00015 # per foot
s = 1000 # feet per roll
#complete the code below
rate <- s * lambda # rate per roll
dpois(x = 0, lambda = rate)# Hint: use help(dpois)!
```
> Your answer here: 0.860708
How low should be the rate of weak spots, as controlled by the manufacturing process, to make sure that less than five rolls per 100 have one or more weak spots?
Recall $$P(0) = \frac{e^{-\mu} \ \mu^{-0}}{0!}$$, therefore,
```{r PoissonWire2}
# Desired P(0) >= 0.95
-log(0.95)/s # analytical solution using Poisson distribution equation
dpois(x = 0, lambda = 100:1/200)
max(c(100:1/200)[which(dpois(x = 0, lambda = 100:1/200) >= 0.95)])
# finer numerical search for solution
guesses <- seq(from = 0.05, to = 0.055, by = 0.001)
max(c(guesses)[which(dpois(x = 0, lambda = guesses) >= 0.95)])/s
```
> Your answer here: 5.1e-05
### 9. Normal distribution. [10]
Plot three normal distributions on the same axes. One with mean 0 and variance 1, one with mean 0 and variance 2 and one with mean 1 and variance 1. Describe the effects of mean and variance on the shape and location of the curve.
```{r normDist1}
# check the help file of the dnorm function to see which arguments are set by default.
curve(dnorm, from = -5.5, to = 5.5)
curve(dnorm(x, sd = 2), from = -5.5, to = 5.5, add = TRUE, col = "chocolate", lwd = 2)
curve(dnorm(x, mean = 1), from = -5.5, to = 5.5, add = TRUE, col = "skyblue", lwd = 2)
```
> Your answer here: By increasing the mean, the curve moves to the right without changing its shape. By increasing the variance, we make the curve wider and shorter.
What is the probability that a normal random variable with mean 5 and variance 2 has values that are either below 2 or above 7? Between 1 and 4? Between 1 and 4 or more than 7? Between 1 and 4 or more than 3?
```{r normDist2}
# P(Z < 2 or Z > 7)
# Mutually exclusive events. Add the separate probabilities.
# Complete the code below
left.area <- pnorm(q = 2 , mean = 5, sd = sqrt(2))
right.area <- pnorm(q = 7, mean = 5, sd = sqrt(2), lower.tail = FALSE)
left.area + right.area
# Now for other values:
# P(1 < Z < 4)
# between 1 and 4 is the complementary of lower than one and higher than 4
left.area <- pnorm(q = 1 , mean = 5, sd = sqrt(2))
right.area <- pnorm(q = 4, mean = 5, sd = sqrt(2), lower.tail = FALSE)
1 - (left.area + right.area)
#or:
pnorm(q = 4, mean = 5, sd = sqrt(2)) - pnorm(q = 1 , mean = 5, sd = sqrt(2))
# P(1 < Z < 4 or Z > 7)
# we only need to add the P( Z > 7)
pH7 <- pnorm(q = 7, mean = 5, sd = sqrt(2), lower.tail = FALSE)
1 - (left.area + right.area) + pH7
# P(1 < Z < 4 or Z > 3)
# here we need to add the P( Z > 3) and substract the P (3 < Z < 4)
# or simplify it to P (Z > 1)
pnorm(q = 1, mean = 5, sd = sqrt(2), lower.tail = FALSE)
```
> Your answers here:
> either below 2 or above 7: 0.09559703
> Between 1 and 4: 0.2374112
> Between 1 and 4 or more than 7: 0.3160608
> Between 1 and 4 or more than 3: 0.9976611
Make and print a table of probabilities for the Standard Normal distribution. Make z values vary between 0 and 3.5 in steps of 0.01. This will be a table of probabilities representing the area in the right tail of the curve. Bring it to the tests.
```{r normDist3}
zTable <- matrix(1-pnorm(seq(0, 3.49, 0.01)), nrow = 35, ncol = 10, byrow = T)
row.names(zTable) <- 0:34/10
colnames(zTable) <- 0:9/100
kable(zTable, digits = 6, caption = "Right tail areas under the standard normal distribution. For left-tail areas and other calculations consider that the function is symmetric about 0.") %>%
kable_styling(full_width = FALSE )
```
### 10. Student's t distribution [10]
Make and print a table of quantiles for Student's t distribution. The table should have those values of t that leave areas under the right tail equal to 0.15, 0.10. 0.05. 0.025, 0.01, and 0.005, for 1 to 30 degrees of freedom in steps of 1 and 30 to 42 df in steps of 2.
```{r tTable}
# insert the p-values (areas under the right tail of the curve)
p.values <- c(0.15, 0.10, 0.05, 0.025, 0.01, 0.005)
df <- c(1:30, 16:21*2)
tTable <- matrix(nrow = length(df), ncol = length(p.values))
for(j in 1:6){ #do the instructions below for j = 1, 2, 3, and so on
tTable[,j] <- qt(p.values[j], df = df) #fill the column with the values for this p-value.
}
row.names(tTable) <- df
colnames(tTable) <- p.values
kable(tTable, digits = 5, caption = "Values of Student's t with row degrees of freedom that have the column probability under the right tail. For left-tail areas and other calculations consider that the function is symmetric about 0.") %>%
kable_styling(full_width = FALSE )
```
> **Remember to save the table, print it and bring it to the test!**
### Knit this file into html. [5]