From 8c487ba63be59f9412538f40733ce5e2d44f8306 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Fri, 4 Aug 2023 15:32:56 +0100 Subject: [PATCH 1/9] Updating correlation_causation --- source/correlation_causation.Rmd | 275 +++++++++++++++---------------- 1 file changed, 129 insertions(+), 146 deletions(-) diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index 8be035ac..b8a4db7e 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -248,147 +248,130 @@ difference* refer to *groups of individuals* , and in such a situation it does make sense to ask whether or not two groups are observably different from each other. -**Example 23-1: Is Athletic Ability Directly Related to Intelligence?** -(Is There Correlation Between Two Variables or Are They Independent?) -(Program "Ability1") - -A scientist often wants to know whether or not two characteristics go -together, that is, whether or not they are correlated (that is, related -or associated). For example, do youths with high athletic ability tend -to also have high I.Q.s? - -Hypothetical physical-education scores of a group of ten high-school -boys are shown in Table 23-1, ordered from high to low, along with the -I.Q. score for each boy. The ranks for each student's athletic and I.Q. -scores are then shown in columns 3 and 4. - -Table 23-1 - -**Hypothetical Athletic and I.Q. Scores for High School Boys** - - -------------------- ---------------- ------------------- -------------- - **Athletic Score** **I.Q. Score** **Athletic Rank** **I.Q.Rank** - **(1)** **(2)** **(3)** **(4)** - 97 114 1 3 - 94 120 2 1 - 93 107 3 7 - 90 113 4 4 - 87 118 5 2 - 86 101 6 8 - 86 109 7 6 - 85 110 8 5 - 81 100 9 9 - 76 99 10 10 - -------------------- ---------------- ------------------- -------------- - -We want to know whether a high score on athletic ability tends to be -found along with a high I.Q. score more often than would be expected by -chance. Therefore, our strategy is to see how often high scores on -*both* variables are found by chance. We do this by disassociating the -two variables and making two separate and independent universes, one -composed of the athletic scores and another of the I.Q. scores. Then we -draw pairs of observations from the two universes at random, and compare -the experimental patterns that occur by chance to what actually is -observed to occur in the world. - -The first testing scheme we shall use is similar to our first approach -to the pig rations — splitting the results into just "highs" and "lows." -We take ten cards, one of each denomination from "ace" to "10," shuffle, -and deal five cards to correspond to the first five athletic ranks. The -face values then correspond to the - -I.Q. ranks. Under the benchmark hypothesis the athletic ranks will not -be associated with the I.Q. ranks. Add the face values in the first five -cards in each trial; the first hand includes 2, 4, 5, 6, and 9, so the -sum is 26. Record, shuffle, and repeat perhaps ten times. Then compare -the random results to the sum of the observed ranks of the five top -athletes, which equals 17. - -The following steps describe a slightly different procedure than that -just described, because this one may be easier to understand: - -**Step 1.** Convert the athletic and I.Q. scores to ranks. Then -constitute a universe of spades, "ace" to "10," to correspond to the -athletic ranks, and a universe of hearts, "ace" to "10," to correspond -to the IQ ranks. - -**Step 2.** Deal out the well-shuffled cards into pairs, each pair with -an athletic score and an I.Q. score. - -**Step 3.** Locate the cards with the top five athletic ranks, and add -the I.Q. rank scores on their paired cards. Compare this sum to the -observed sum of 17. If 17 or less, indicate "yes," otherwise "no." (Why -do we use "17 or less" rather than "less than 17"? Because we are asking -the probability of a score *this low or lower* .) - -**Step 4.** Repeat steps 2 and 3 ten times. - -**Step 5.** Calculate the proportion "yes." This estimates the -probability sought. - -In Table 23-2 we see that the observed sum (17) is lower than the sum of -the top 5 ranks in all but one (shown by an asterisk) of the ten random -trials (trial 5), which suggests that there is a good chance (9 in 10) -that the five best athletes will not have I.Q. scores that high by -chance. But it might be well to deal some more to get a more reliable -average. We add thirty hands, and thirty-nine of the total forty hands -exceed the observed rank value, so the probability that the observed -correlation of athletic and I.Q. scores would occur by chance is about - -.025. In other words, if there is no real association between the -variables, the probability that the top 5 ranks would sum to a number -this low or lower is only 1 in 40, and it therefore seems reasonable to +### Example: Is Athletic Ability Directly Related to Intelligence? + +::: {.notebook name="athlete_iq" title="Athletic ability and IQ"} + +Question: **is There Correlation Between Two Variables or Are They +Independent?** + +A scientist often wants to know whether or not two characteristics go together, +that is, whether or not they are correlated (that is, related or associated). +For example, do youths with high athletic ability tend to also have high I.Q.s? + +Hypothetical physical-education scores of a group of ten high-school boys are +shown in @tbl-physical-mental, ordered from high to low, along with the I.Q. +score for each boy. The ranks for each student's athletic and I.Q. scores are +then shown in the third and fourth columns: + +| Athletic Score | I.Q. Score | Athletic Rank | I.Q.Rank | +|--------------------|----------------|-------------------|--------------| +| 97 | 114 | 1 | 3 | +| 94 | 120 | 2 | 1 | +| 93 | 107 | 3 | 7 | +| 90 | 113 | 4 | 4 | +| 87 | 118 | 5 | 2 | +| 86 | 101 | 6 | 8 | +| 86 | 109 | 7 | 6 | +| 85 | 110 | 8 | 5 | +| 81 | 100 | 9 | 9 | +| 76 | 99 | 10 | 10 | + +: Hypothetical athletic and i.q. scores for high school boys {#tbl-physical-mental} + +We want to know whether a high score on athletic ability tends to be found +along with a high I.Q. score more often than would be expected by chance. +Therefore, our strategy is to see how often high scores on *both* variables are +found by chance. We do this by disassociating the two variables and making two +separate and independent universes, one composed of the athletic scores and +another of the I.Q. scores. Then we draw pairs of observations from the two +universes at random, and compare the experimental patterns that occur by chance +to what actually is observed to occur in the world. + +The first testing scheme we shall use is similar to our first approach to the +pig rations — splitting the results into just "highs" and "lows." We take ten +cards, one of each denomination from "ace" to "10," shuffle, and deal five +cards to correspond to the first five athletic ranks. The face values then +correspond to the I.Q. ranks. Under the benchmark hypothesis the athletic ranks +will not be associated with the I.Q. ranks. Add the face values in the first +five cards in each trial; the first hand includes 2, 4, 5, 6, and 9, so the sum +is 26. Record, shuffle, and repeat perhaps ten times. Then compare the random +results to the sum of the observed ranks of the five top athletes, which equals +17. + +The following steps describe a slightly different procedure than that just +described, because this one may be easier to understand: + +* **Step 1.** Convert the athletic and I.Q. scores to ranks. Then constitute a + universe of spades, "ace" to "10," to correspond to the athletic ranks, and a + universe of hearts, "ace" to "10," to correspond to the IQ ranks. +* **Step 2.** Deal out the well-shuffled cards into pairs, each pair with an + athletic score and an I.Q. score. +* **Step 3.** Locate the cards with the top five athletic ranks, and add the + I.Q. rank scores on their paired cards. Compare this sum to the observed sum + of 17. If 17 or less, indicate "yes," otherwise "no." (Why do we use "17 or + less" rather than "less than 17"? Because we are asking the probability of a + score *this low or lower* .) +* **Step 4.** Repeat steps 2 and 3 ten times. +* **Step 5.** Calculate the proportion "yes." This estimates the probability + sought. + +In @tbl-ability-trials we see that the observed sum (17) is lower than the sum +of the top 5 ranks in all but one (shown in bold) of the ten random +trials (trial 5), which suggests that there is a good chance (9 in 10) that the +five best athletes will not have I.Q. scores that high by chance. But it might +be well to deal some more to get a more reliable average. We add thirty hands, +and thirty-nine of the total forty hands exceed the observed rank value, so the +probability that the observed correlation of athletic and I.Q. scores would +occur by chance is about .025. In other words, if there is no real association +between the variables, the probability that the top 5 ranks would sum to a +number this low or lower is only 1 in 40, and it therefore seems reasonable to believe that high athletic ability tends to accompany a high I.Q. -Table 23-2 - -**Results of 40 Random Trials of The Problem "Ability"** - -(Note: Observed sum of IQ ranks: 17) - - ----------- --------------------- --------------- - **Trial** **Sum of IQ Ranks** **Yes or No** - 1 26 No - 2 23 No - 3 22 No - 4 37 No - \* 5 16 Yes - 6 22 No - 7 22 No - 8 28 No - 9 38 No - 10 22 No - 11 35 No - 12 36 No - 13 31 No - 14 29 No - 15 32 No - 16 25 No - 17 25 No - 18 29 No - 19 25 No - 20 22 No - 21 30 No - 22 31 No - 23 35 No - 24 25 No - 25 33 No - 26 30 No - 27 24 No - 28 29 No - 29 30 No - 30 31 No - 31 30 No - 32 21 No - 33 25 No - 34 19 No - 35 29 No - 36 23 No - 37 23 No - 38 34 No - 39 23 No - 40 26 No - ----------- --------------------- --------------- +| Trial | Sum of IQ Ranks | <= observed (17) | +|-----------|---------------------|------------------| +| 1 | 26 | No +| 2 | 23 | No +| 3 | 22 | No +| 4 | 37 | No +| **5** | **16** | **Yes** +| 6 | 22 | No +| 7 | 22 | No +| 8 | 28 | No +| 9 | 38 | No +| 10 | 22 | No +| 11 | 35 | No +| 12 | 36 | No +| 13 | 31 | No +| 14 | 29 | No +| 15 | 32 | No +| 16 | 25 | No +| 17 | 25 | No +| 18 | 29 | No +| 19 | 25 | No +| 20 | 22 | No +| 21 | 30 | No +| 22 | 31 | No +| 23 | 35 | No +| 24 | 25 | No +| 25 | 33 | No +| 26 | 30 | No +| 27 | 24 | No +| 28 | 29 | No +| 29 | 30 | No +| 30 | 31 | No +| 31 | 30 | No +| 32 | 21 | No +| 33 | 25 | No +| 34 | 19 | No +| 35 | 29 | No +| 36 | 23 | No +| 37 | 23 | No +| 38 | 34 | No +| 39 | 23 | No +| 40 | 26 | No + +: 40 random trials of the problem athletic / IQ problem {#tbl-ability-trials} The RESAMPLING STATS program "Ability1" creates an array containing the I.Q. rankings of the top 5 students in athletics. The SUM of these I.Q. @@ -1440,13 +1423,13 @@ The players did all kinds of fancy calculating, using a wild variety of assumptions — although there was no possible way that the figuring could help them. -When I stopped the game before completing the 10 buy-andsell sessions -they expected, subjects would ask that the game go on. Then I would tell -them that there was no basis to believe that there were patterns in the -data, because the "prices" were just randomly-generated numbers. Winning -or losing therefore did not depend upon the subjects' skill. -Nevertheless, they demanded that the game not stop until the 10 "weeks" -had been played, so they could find out whether they "won" or "lost." +When I stopped the game before completing the 10 buy-and-sell sessions they +expected, subjects would ask that the game go on. Then I would tell them that +there was no basis to believe that there were patterns in the data, because the +"prices" were just randomly-generated numbers. Winning or losing therefore did +not depend upon the subjects' skill. Nevertheless, they demanded that the game +not stop until the 10 "weeks" had been played, so they could find out whether +they "won" or "lost." This study of batting illustrates how one can test for independence among various trials. The trials are independent if each observation is From 748f46a3151d6315a88af48589025edd3e169cc2 Mon Sep 17 00:00:00 2001 From: Matthew Brett Date: Sat, 5 Aug 2023 10:57:19 +0100 Subject: [PATCH 2/9] Start code for correlation causation --- source/correlation_causation.Rmd | 327 ++++++++++++++++++------------- source/images/27-Chap-23_000.png | Bin 37982 -> 0 bytes 2 files changed, 191 insertions(+), 136 deletions(-) delete mode 100644 source/images/27-Chap-23_000.png diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index b8a4db7e..7b9aa84f 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -1,22 +1,19 @@ - --- +--- +resampling_with: + ed2_fname: 27-Chap-23 jupyter: jupytext: - metadata_filter: - notebook: - additional: all - excluded: - - language_info + notebook_metadata_filter: all,-language_info + split_at_heading: true text_representation: extension: .Rmd format_name: rmarkdown - format_version: '1.0' - jupytext_version: 0.8.6 + format_version: '1.2' + jupytext_version: 1.14.6 kernelspec: - display_name: Python 3 + display_name: Python 3 (ipykernel) language: python name: python3 -resampling_with: - ed2_fname: 27-Chap-23 --- ```{r setup, include=FALSE} @@ -250,10 +247,8 @@ other. ### Example: Is Athletic Ability Directly Related to Intelligence? -::: {.notebook name="athlete_iq" title="Athletic ability and IQ"} - -Question: **is There Correlation Between Two Variables or Are They -Independent?** +Question: **Is there correlation between two variables or are they +independent?** A scientist often wants to know whether or not two characteristics go together, that is, whether or not they are correlated (that is, related or associated). @@ -311,7 +306,7 @@ described, because this one may be easier to understand: I.Q. rank scores on their paired cards. Compare this sum to the observed sum of 17. If 17 or less, indicate "yes," otherwise "no." (Why do we use "17 or less" rather than "less than 17"? Because we are asking the probability of a - score *this low or lower* .) + score *this low or lower*.) * **Step 4.** Repeat steps 2 and 3 ten times. * **Step 5.** Calculate the proportion "yes." This estimates the probability sought. @@ -330,142 +325,202 @@ believe that high athletic ability tends to accompany a high I.Q. | Trial | Sum of IQ Ranks | <= observed (17) | |-----------|---------------------|------------------| -| 1 | 26 | No -| 2 | 23 | No -| 3 | 22 | No -| 4 | 37 | No -| **5** | **16** | **Yes** -| 6 | 22 | No -| 7 | 22 | No -| 8 | 28 | No -| 9 | 38 | No -| 10 | 22 | No -| 11 | 35 | No -| 12 | 36 | No -| 13 | 31 | No -| 14 | 29 | No -| 15 | 32 | No -| 16 | 25 | No -| 17 | 25 | No -| 18 | 29 | No -| 19 | 25 | No -| 20 | 22 | No -| 21 | 30 | No -| 22 | 31 | No -| 23 | 35 | No -| 24 | 25 | No -| 25 | 33 | No -| 26 | 30 | No -| 27 | 24 | No -| 28 | 29 | No -| 29 | 30 | No -| 30 | 31 | No -| 31 | 30 | No -| 32 | 21 | No -| 33 | 25 | No -| 34 | 19 | No -| 35 | 29 | No -| 36 | 23 | No -| 37 | 23 | No -| 38 | 34 | No -| 39 | 23 | No -| 40 | 26 | No +| 1 | 26 | No | +| 2 | 23 | No | +| 3 | 22 | No | +| 4 | 37 | No | +| **5** | **16** | **Yes** | +| 6 | 22 | No | +| 7 | 22 | No | +| 8 | 28 | No | +| 9 | 38 | No | +| 10 | 22 | No | +| 11 | 35 | No | +| 12 | 36 | No | +| 13 | 31 | No | +| 14 | 29 | No | +| 15 | 32 | No | +| 16 | 25 | No | +| 17 | 25 | No | +| 18 | 29 | No | +| 19 | 25 | No | +| 20 | 22 | No | +| 21 | 30 | No | +| 22 | 31 | No | +| 23 | 35 | No | +| 24 | 25 | No | +| 25 | 33 | No | +| 26 | 30 | No | +| 27 | 24 | No | +| 28 | 29 | No | +| 29 | 30 | No | +| 30 | 31 | No | +| 31 | 30 | No | +| 32 | 21 | No | +| 33 | 25 | No | +| 34 | 19 | No | +| 35 | 29 | No | +| 36 | 23 | No | +| 37 | 23 | No | +| 38 | 34 | No | +| 39 | 23 | No | +| 40 | 26 | No | : 40 random trials of the problem athletic / IQ problem {#tbl-ability-trials} -The RESAMPLING STATS program "Ability1" creates an array containing the -I.Q. rankings of the top 5 students in athletics. The SUM of these I.Q. -rankings constitutes the observed result to be tested against -randomly-drawn samples. We observe that the actual I.Q. rankings of the -top five athletes sums to 17. The more frequently that the sum of 5 -randomly-generated rankings (out of 10) is as low as this observed -number, the higher is the probability that there is no relationship -between athletic performance and I.Q. based on these data. +In fact we can apply an even simpler procedure to get the same result, by reasoning about the individual trial. -First we record the NUMBERS "1" through "10" into vector +One trial in our procedure is: -A. Then we SHUFFLE the numbers so the rankings are in a random order. -Then TAKE the first 5 of these numbers and put them in another array, D, -and SUM them, putting the result in E. We repeat this procedure 1000 -times, recording each result in a scorekeeping vector: Z. Graphing Z, we -get a HIS- TOGRAM that shows us how often our randomly assigned sums are -equal to or below 17. +* **Step 2.** Deal out the well-shuffled cards into pairs, each pair with an + athletic score and an I.Q. score. +* **Step 3.** Locate the cards with the top five athletic ranks, and add the + I.Q. rank scores on their paired cards. Compare this sum to the observed sum + of 17. If 17 or less, indicate "yes," otherwise "no." (Why do we use "17 or + less" rather than "less than 17"? Because we are asking the probability of a + score *this low or lower*.) + +Now consider the 5 IQ rank cards. In the procedure above, we found these by +*first* pairing the athletic ranks and the IQ ranks, *then* selecting the IQ +ranks corresponding to the top 5 athletic ranks. A little thought may persuade +you, that by doing this, we have have a *random selection of 5 IQ ranks*. We +got that random selection by pairing, selecting on athletic rank — but the +initial pairing and selection will do nothing other than giving us one +particular set of randomly chosen 5 IQ rank cards. So we can simplify our +procedure even further by missing out the pairing and selecting by rank steps; +we can just shuffle the IQ rank cards and deal out 5 to be our randomly +selected IQ ranks. +::: {.notebook name="athlete_iq" title="Athletic ability and IQ"} - +::: nb-only +Question: **Is there correlation between two variables or are they +independent?** -```matlab -' Program file: "correlation_causation_00.rss" +We answer this by permuting the athletic ability and IQ *ranks*, pairing them, +and seeing if the sum of the IQs ranks (randomly) paired with the top 5 athletic scores, are less than or equal to the observed value of 17. +::: -REPEAT 1000 - ' Repeat the experiment 1000 times. - NUMBERS 1,10 a - ' Constitute the set of I.Q. ranks. - SHUFFLE a b - ' Shuffle them. - TAKE b 1,5 d - ' Take the first 5 ranks. - SUM d e - ' Sum those ranks. - SCORE e z - ' Keep track of the result of each trial. -END -' End the experiment, go back and repeat. -HISTOGRAM z -' Produce a histogram of trial results. +To simulate this problem in {{< var lang >}}, we first create +{{< var an_array >}} containing the I.Q. rankings of the top 5 students in +athletics. The `sum` of these I.Q. rankings constitutes the observed result to +be tested against randomly-drawn samples. We observe that the actual I.Q. +rankings of the top five athletes sums to 17. The more frequently that the sum +of 5 randomly-generated rankings (out of 10) is as low as this observed number, +the higher is the probability that there is no relationship between athletic +performance and I.Q. based on these data. + +First we record the 1 through 10 into {{< var array >}} `iq_ranks`. Then we +shuffle the numbers so the rankings are in a random order. Then select the first +5 of these numbers and put them in another {{< var array >}}, `top_5`, and `sum` +them, putting the result in `top_5_sum`. We repeat this procedure `N = 10000` +times, recording each result in a scorekeeping vector: `z`. Graphing `z`, we get a +histogram that shows us how often our randomly assigned sums are equal to or +below 17. + +```{python opts.label="py_ed"} +import numpy as np +rnd = np.random.default_rng() +import matplotlib.pyplot as plt ``` -**ABILITY1: Random Selection of 5 Out of 10 Ranks** +```{python opts.label="py_ed"} +# Number of repeats. +N = 10000 +# The IQ ranks, ready for shuffling. +iq_ranks = np.arange(1, 11) # 1 through 10. +# Scorekeeping array. +z = np.zeros(N) + +for i in range(N): + # Shuffle the ranks. + shuffled = rnd.permuted(iq_ranks) + # Take the first 5 ranks. + top_5 = shuffled[:5] + # Sum those ranks. + top_5_sum = np.sum(top_5) + # Keep track of the result of each trial. + z[i] = top_5_sum + # End the experiment, go back and repeat. + +# Produce a histogram of trial results. +# Make the bins be the integers from 10 through 45. +plt.hist(z, bins=np.arange(10, 46)) +plt.title('Sums of 5 ranks selected at random'); +``` -![](images/27-Chap-23_000.png) +```{r opts.label="py_ed"} +# Number of repeats. +N <- 10000 +# The IQ ranks, ready for shuffling. +iq_ranks <- 1:10 # 1 through 10. +# Scorekeeping array. +z <- numeric(N) + +for (i in 1:N) { + # Shuffle the ranks. + shuffled <- sample(iq_ranks) + # Take the first 5 ranks. + top_5 <- shuffled[1:5] + # Sum those ranks. + top_5_sum <- sum(top_5) + # Keep track of the result of each trial. + z[i] <- top_5_sum + # End the experiment, go back and repeat. +} + +# Produce a histogram of trial results. +# Make the bins be the integers from 10 through 45. +hist(z, breaks=10:45, main='Sums of 5 ranks selected at random') +``` -**Sum of top 5 ranks** +```{r echo=FALSE, eval=TRUE} +le_pct <- round(sum(get_var('z') <= 17) / get_var('N') * 100) +``` -We see that in only about 2% of the trials did random selection of ranks -produce a total of 17 or lower. RESAMPLING STATS will calculate this for +We see that in only about `r le_pct`% of the trials did random selection of +ranks produce a total of 17 or lower. {{< var lang >}} can calculate this for us directly: +```{python} +# Determine how many trials produced sums of ranks <= 17 by chance. +k = np.sum(z <= 17) +# The proportion. +kk = k / N +# Show the result +kk +``` - - -```matlab -' Program file: "ability1.rss" - -COUNT z <= 17 k -' Determine how many trials produced sums of ranks \<= 17 by chance. -DIVIDE k 1000 kk -' Convert to a proportion. -PRINT kk -' Print the results. - -' Note: The file "ability1" on the Resampling Stats software disk contains -' this set of commands. +```{r} +# Determine how many trials produced sums of ranks <= 17 by chance. +k <- sum(z <= 17) +# The proportion. +kk <- k / N +# Show the result +kk ``` -Why do we sum the ranks of the first *five* athletes and compare them -with the second five athletes, rather than comparing the top three, say, -with the bottom seven? Indeed, we could have looked at the top three, -two, four, or even six or seven. The first reason for splitting the -group in half is that an even split uses the available information more -fully, and therefore we obtain greater efficiency. (I cannot prove this -formally here, but perhaps it makes intuitive sense to you.) A second -reason is that getting into the habit of always looking at an even split -reduces the chances that you will pick and choose in such a manner as to -fool yourself. For example, if the I.Q. ranks of the top five athletes -were 3, 2, 1, 10, and 9, we would be deceiving ourselves if, after -looking the data over, we drew the line between athletes 3 and 4. (More -generally, choosing an appropriate measure before examining the data -will help you avoid fooling yourself in such matters.) - -A simpler but less efficient approach to this same problem is to -classify the top-half athletes by whether or not they were also in the -top half of the I.Q. scores. Of the first five athletes actually -observed, *four* were in the top five I.Q. scores. We can then shuffle -five black and five red cards and see how often four or more (that is, -four or five) blacks come up with the first five cards. The proportion -of times that four or more blacks occurs in the trial is the probability -that an association as strong as that observed might occur by chance -even if there is no association. Table 23-3 shows a proportion of five +Why do we sum the ranks of the first *five* athletes rather than taking the sum +of the top three, say? Indeed, we could have looked at the top three, two, +four, or even six or seven. The first reason for splitting the group in half is +that an even split uses the available information more fully, and therefore we +obtain greater efficiency. (I cannot prove this formally here, but perhaps it +makes intuitive sense to you.) A second reason is that getting into the habit +of always looking at an even split reduces the chances that you will pick and +choose in such a manner as to fool yourself. For example, if the I.Q. ranks of +the top five athletes were 3, 2, 1, 10, and 9, we would be deceiving ourselves +if, after looking the data over, we drew the line between athletes 3 and 4. +(More generally, choosing an appropriate measure before examining the data will +help you avoid fooling yourself in such matters.) + +A simpler but less efficient approach to this same problem is to classify the +top-half athletes by whether or not they were also in the top half of the I.Q. +scores. Of the first five athletes actually observed, *four* were in the top +five I.Q. scores. We can then shuffle five black and five red cards and see how +often four or more (that is, four or five) blacks come up with the first five +cards. The proportion of times that four or more blacks occurs in the trial is +the probability that an association as strong as that observed might occur by +chance even if there is no association. Table 23-3 shows a proportion of five trials out of twenty. In the RESAMPLING STATS program "Ability2" we first note that the top 5 @@ -572,12 +627,12 @@ arbitrarily be placed on the x-axis, whether you intend to suggest causation or not.) -```{r fig-hypot_iq_athletic_1, opts.label='svg_fig', fig.cap="Hypothetical Scores for I.Q. and Athletic Ability — 1"} +```{r fig-hypot_iq_athletic_1, opts.label="svg_fig", fig.cap="Hypothetical Scores for I.Q. and Athletic Ability — 1"} include_svg('diagrams/hypot_iq_athletic_1.svg') ``` -```{r fig-hypot_iq_athletic_2, opts.label='svg_fig', fig.cap="Hypothetical Scores for I.Q. and Athletic Ability — 2"} +```{r fig-hypot_iq_athletic_2, opts.label="svg_fig", fig.cap="Hypothetical Scores for I.Q. and Athletic Ability — 2"} include_svg('diagrams/hypot_iq_athletic_2.svg') ``` @@ -587,7 +642,7 @@ than in @fig-hypot_iq_athletic_1. On the basis of @fig-given_iq_athletic alone, one can say only that there *might* be some association between the two variables. -```{r fig-given_iq_athletic, opts.label='svg_fig', fig.cap="Given Scores for I.Q. and Athletic Ability"} +```{r fig-given_iq_athletic, opts.label="svg_fig", fig.cap="Given Scores for I.Q. and Athletic Ability"} include_svg('diagrams/given_iq_athletic.svg') ``` diff --git a/source/images/27-Chap-23_000.png b/source/images/27-Chap-23_000.png deleted file mode 100644 index bc428ca05bd7dd792c22edc716d5a52aad14a7c3..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 37982 zcmXtA1z1#T*Bw+qLJ*KHMUifWAq_wTaX`9D@+#epN`oNMGK7GPgpyLy-3X|(bc3XX z68}Em^Zd`fs4&c#bKcl{?X}i^!&ET}mxvjNQ7F_U#fNfg@b@u%;Vu%upSNd))Zs4z zvquVYs59ihj9OeQ{DjEip{_FuMb?6R;l=ZiF``gcQHpZ+G(52DweB98t@~o9x{@5P zZ^Q-CcoFF=hd+FBFFcf1j=wiJJb?3Y^&8dug-V6@uBh&1zPxhpUab+AG4%y%YMXP5 z!Dbiv;>t>gH_1y5Qbm$Sz0(7x4(j?hQ7B zhs^)f2i}FV?W+B8o#Poqn|hv|9C$Xq#K$WWs5{K`oNmB7Jz7fd9X9qi%-=1y9ru`R z3B&WP9eE@__kBu{BegF#H`lVzsJ2oahr!0jw?1u)oOzrgn#`wHCRkcp>iMh2bE3>f zz0_tX#~|y$rBb~w{~5f$kN>OASMp$ETrA+=-)4QkvG1Swd}(rK3G)Ol_23jy+b+1o z?R4+o;`{$jq+C}ub~|FOANPt)Z3W<+FLUddTH4(fF{+mdv3btmfm{Y4}tQxu|@I*|_^ zjP6U5?%o)$5F=+ot2QLO6Iwv74$sw9{Cm8ju*&zJo84sf?SChy=g=06dF#2ml49xK zw$(QD&j?UD7cLVo&{mPDU3>5Gp){ObJ&tc@(Dd!L-X6vYujosE57t-#rCMhFMCsu} z6w2Jva`13_E+#1{bjt7FYktFOxo1BMW|$@2rs2mfr^lX=^di9zqFG+Ua8ii8O7?6H zzLa#!hLTy@>k6y?v5=6+hwSXFnClAG-3feK|4uwk{tVmfEf0(oKP40r;Ww%Yb^ql_ zI9_fm>*OSCSmlI&=VXs6nnjuc9;Zm7r#EZb-h3LiX>>C&r72K5n8C-g=;YYr9 zu~{o|U{H{*Xt68fE$ho}o4+2f4Hvent4Ift%N(NMx*QRU#m_bCd_3cLbt$z!8@$)d z*P2N`JLYJ6%^311kssd#@9rnPFSC^NENddqXi+}y(AMY7|tq5Cyq;lba}=m>^C z>C?hoHP)a05#Tkh!!GtD%lg%)C)4w2e{Ly$+FsWk&&c*Jw0VBJ3Bk*>(|fI)6>Is(YP-x%3x8NQc;f9TT|7h9`e`cY6Zvj<05RTv46)$h4FwO+UJ?!pSQ-d`ULjfiN5P-?UJtM;LaO8aL0S>J1h z+nI35TmOB(2;ox>rt8@6&P|oHo|u>zm&G3XzV7Z)*Z%JAED@`|jC-#LI=Z{(zJJ%a z!lia$p(|eYN1i4H3kw-4G&-74#2G%OApY{kXelA1=<^0f`x-p9`+rt{(4)-E%rwYk zQL2ST(k!CS>5fkPuoAAsJpniRI7clbKmgiQ>z9^ty9vM!l3%1PVWov@FzJK=`k>F(38g9BV_QiTIn+W9qk#7a^mBNDu z1gU68HWcn^?5=Ip;D<-*@GD2}Ls*)o8()cwiY5e;v-ZJesF+l2QTSBuj-uD{`w$P78A!#{w|SCf<w^gV&>CF&(^rCPI7T)UY=8ZhQ41Pn$JwC$P$X>qP z&heuim%aVS{vJWnSCFs?E!%pMMXLNx{&Gvp8*Z4ryJ=-^{-!ds)cQdbQz!8B=a)l4}e_#8rb<2?vcNXTatv%3&PB}RoG5)z{R@L{O0M^-Zt3dVW) zSFNOk+j_^!%F32=#WDTfxn zpIb_>)^L%D$Q|MpsEFq`e3z6Ap-%}S#|(suJglbs@neY$Ze1fww~9KpWOn0+=R{xI zyi0=N6tn(@l1Yyarxn2bEO1%=b`42Ikj(Ew7$l$=NOk`ewzcJ;QJ_OH<+FRqcdL=G z#_P|2pH<)A$HT&A0HAdIzibIH{Sho&=TcbqUjux!2up#%CC}-py4#Csl+AApyJHJIK^|4)<*?N|AURmddH6mZVV!jEV6^w z%u`1Ecjh}i{`}A$sC0aKnSzGjpaUQnj^g@*Ef{5M2PVsAAQyM zig4(9wA9PprG95fe(~t!WI&J#2rf}mSNDYb)qv;%iH;#rz&H)nuz9fCmnxa-w9o~K zs1@E_Y}!l^9u{WWy&q6PZ(yO~xkI_u5;%P;e{H^_^X3h4aq&@G&kVJ5s3_Jun6tz7 zJIh~&jbyd7Xke?h14z1x(kZpPRPDYQmG1YK=*g2O24CH5A;Dfjz?tn>8AW_sfo|!W zw{Ks*W)vHQp{mmQq^D70uDGx+FgNt2n;>t%W zD8UzTXef_X!9zfPgcR?6j&>eCe0cR47b6Qx6l9q}0BQi3<~}4EA<>rmN#=z-jog1th3~D?m_cC`;;%f!-EG8C@BWQs5n$%$YJTnD8+LJK^(sZ$gZ^I8lw_Irv}6;6{}g-KY0O~N7ec!-A)?Y_v9`d>80m={QNi#?f4S*u!W5C zVq|BJ&-Sdm-DnB>(!zqHkx?34@ll<}Hk*J>Q2-!pK#ZrW`Gq|I+q-IxAbR-Bwh+lU z&R#fJ{So25F}84Hd|RR>FCD`nQ0>h%USYqq*qf5ln<6Hb?tkhDFx@>{+mKar;A45J z2LX}f#~uIceeeXySr1;%b|(s)v9_+vm(G^|uJ$gfNnQ5&87E*xYI)+g4>7F2*83ob z*2v>Uh5c02gVzj>AC%(16`9J`?Dk1s2w)I>-T}$NWhvEt=f!!#A_xXBw=ut~meNr4 za&i`3am9&DM^@f=C)Zxgh4E70qA>A73vLn7)`T&bY#S(w^va*xl!}m^=N_moD2Tnb zIvc2-Gd$x`5|S7FT)^<`ObbiRqtz0}r9LA3Z=)r*X-V$-j#CM4nZ(Fpbv>u~vO|K= z@2lsiB_GJish-knpfZYz1mzw)a56Jv^M&AmY6uGp3yqANhb^cM;Nnd}LK_5-*7M|2 zDBWPT%j=J*xfw=1B-#i54E%|`#NSPAbK|KK#9+0AiQ!8)CGAepuF@iI{guC6XbY*G zy7g?~PnZOy&t$l+Tzf*yys@-nb8b_1b8d6auC{Iz0dB1_fA1iC=(oMlPm0PfE6Z5Y z>)bbXnIUbt|DH&O_?xrvzIjf=o@YHXD?B9SG^jDRc{cbq9oERv`wQ-&z~D*6Y*~5j z^!HA;EX||ijZoUcGHzq1S!xGdD&yS>wjGB#D+{z{Y*YwsT7VbP{D=CyWdhxWhnN?< zb5HQ2dioV<#jp|++(r^<616oWyY+ZDMjA0J8`)$aH7Ta5klUz0v~Z5UqH5+=3qiaf z?)iIX9UNP?iKV%f)f|MJO1~2cC|Anb!;Cz?ioeOj&<1c%x^{MPh1O1MbjA~^xXD$V zI++cM$5E+pc;~rng>Tmr;l@Yu9m0#>H|5T4;^Y3CutG@bLmg!*|B@T_B$X$f8R(7c`b)Z03gyXNil7s|{(VCrnoBz`z zf3&?lS0pR7R)xtw`rkKx1^!ZEWh?Q|bHk1D(v8|Zsxeu?&Uc@GqJzc!!H7DIlPCNx zY|GXAOYmxl=)m}ehz^sHHeKE+Mpz@Pl_=zz7|l%&&HJBb33Q5C9ekh63+zyhxA>;Y zM~aVEo4W!5<`;wfcr;k&35Bc@ln(&uHCWGn* zGBXh$ZRfGqOg$7>V~*_OPU%F~jtwc&Qr9;{zNW=Cjtmh+Xs5P{rb@f+yS-hLClsfE z2XUbKkGi6ObN7u%G!?7-0Jl!Sw|k#yy(+v4OD3N$x3<~Ihp_9U)}oWYY}{%)^KV@V zly84@`#LQRZaABB_bQvPMkA*HkvT)fbYlE~t>lfDD0;i|c_MjT9gSo;pX*bYCoG>Z z%jBc~XpW8%MH>kGT=~1Q)wwszo3rZ8gSH^WpeyE3umamk>t@7WJsUo4Yy zs|t0Igw;>{mtAR2B{2(A{8&u+&>z=b1!fGhzkn^qN2781rb3SM;ArNLQT5+ThZsqL zHuW!I7#_(ACWD@ypWcn3p-OR_c++|` z?#mXUEV7UZ{vi*d2dRBQtt*SlMn_-q&5FoYXfhy(l#YLzc4>CbM~l~Y%o*umCG-Fa z*Hwq*Hq`FfHL?_(qfz0IpSR0oqJG_>-mWiS$Z$FJ7n`weFj=g6DlL{KHFi#4)HRl( z2fZRFIk3F0se^5tt(`j+5%u}B!vnH0}!#Rt_Q(_u?ivzr_@h9TjF7$V7-@1?(K-lva zm~-VpPtGsDjfj=jBvJ@z&2wjPA5d{=+z1{TA`v|8YU-gAGP_WBcIu64*q&=Ysy+C1 z9T#GCMarP~EPYU5WarG`*RLIalD?+6rsW)I8;;@J3up`Z7%ed>J=app-g7%DKQA#$ zeAmPilP4Mf+{8P)vZuLSQtnz(v%^h_DfBt3c)MWyTHQhgM}Ou|g^PSltq@dfZ|cCE zo)DX;X;w=sk>By?a<6d9-zTT_YbDAFE9_TWWhAmf(J*4@RL*v;;>wxt)YO<5ZNa|W zv4l1Ae_pLNf4F!tqiGnlyf+O2u7g7d%l#ZfWE`aaqdx*`;wbUULp`bBDc&I0%8Lp& z_URlfqmek(FSfpbOjvF+aj#pw4Bv9Ek1TskgiC*#ZA}Ym)YgGz>7cY6)VhjqC3rG% zv(ThiS7O}0CMKve`N?pONY1J;p$-FzC;yoWi@1dowP-Si+Tq=gkhxD0+$<$Nsh2eW zcDQ0T@@88USnj0q#y`i=|0Kdqi0NSMbX<<@X)!TBZ3|qIQeE5zhj=g*bTJE~g}GYd zIbFG~<9T%wTbbt=WZdGF(tc?FUNYxo#&8(ctXkXVGu<6FPIPdQN5g{_8VP48 zXXWjxO_+Pq-;Mdnhj)d>*wiJ~^Gf+-KB>S{_k3YH!sw>Rh={_k`$P$SM7aL9g>!`| zk1Z_obo1Y2eDQtR+hk-zyryGRQm0L7Wc{*r#dUutyxCUkt0uNWZJsIWrJr0$;aPga zie>-!wc)T|?od1VK+Q@gVi~!kqm+1WUe_@3%DfDVkO-=sI(RDt)OHfkbpp?sk!oX*1`W~--xpKr1 zQ3#b$g^>2P_NaPLHxW*|fX)6q%I)`rVqwh=aWs?o+tQMfI1VK_1q(~yY)yguiu32; z4Go22Y`cr6FvGas4`CMJH42nou^B!*ExGez50?olp7Kr9jOM5^uBOX@n?YB>P&I5r zILd@so|+ef)*P2eHgm|O{6y=+ot7#6is)fQp!ZkD#l=N4s5jPkk2r;e-?Cl59`oi+ zfVQzO^=8d38GN=Z;8;CuUNJ>P>@PdIh0IdiP#rOD;x|5y90g=w#5LJq)rSu+HZ?UF zB(6j8-WW=8eQTp)%51vf1rsIZ{Tv$*HmHEyYK9lkH*N&NZyrM7%ib&~Xx?#|392jL z*K(jfy?}CZK5imN(9sZ{C*?U9z7!RnSP;U*Yb2QWxAR^`-q6a-2MQ?Rr9Y#8*ieWCnf{Zup|i$?&^ih^$II&umMA{0-){P)8H%8px(vvPJn@~CyW zJsMZc84Xj^S#_hLjG`nVgQ-xn#?8+UVzOzB~l3=JIqSLLk+y?#d#x#+?NW&?OeLKOlr@yJBMbzcha1tI&>wO zGnG)~d90LMgjiVuhiC%qn@+u-^gB^oB`CE^|0U&gvij~(V6fiP6br|~2T|{2YAUK%eEQ`XE-QnAk7Kny8*~D}kd}~mT_K9pL+>+i*7%kr z980P*E_u$GUr1L)i!jF;_vm>sxU19=+CFwknyyBIRRi7oqy!|WsMoJQ0JAl)w=!f( zN5CLvN7^iCLPr41H4&7E;zphe)GwzafN9Md8PP!`Bt($`R#(N#v(;1bSmilMLL&E8 z?d)DqP!RB7KZlZSf>7R)K(F69-a60XIZshxx8E?|ovmAH>3B-o57euGp#BK6ePdOs zrNWcou0x~wRf4NJJ+MD-1vRcvM^k;G#j@3zINnD)GBPv2o}V{YPL-fu>Q7ey((bO@ z){&3(csZZf{(2`=s?@LWkuoqz=xLEj8<&N09b-Y4_-VN(v|!I#o^|JeJ$#SabZ9BYQQGg@028zHpj$I|L|0+mu6T?7}!Rz#W5I`Ngyu47sz&NCUQZNt1kqAXHh;VRg zx8QbqpfKNt3K`AA)7=&aQZPbbfB^Eob;VpZ{&V#yU|Mei5tf~DJe2>x@u{tY`m>?yqR$ZY$8~XEuI#n50XS8voJp+%E1FI{d?s(_XS3X`| zQh}PSa|mGzg4lTVa|P5P(A#Yh0(GGKdyhp+E{{}^IE^HG(R%^%%e)K!VgyB zia&pC>(GctyH#5>@1eku_{>|2JQo*dgGa&tsV+3Ct>opoXR6P9AISHR7}fuIFX&M= zIoVD3|Mvn^8KNVua5X`h+zb*(WWUV6$lerf=cLeYrB*5$`C6B#I52XMR9uZh{~P}N z8kg?J0O_d>0(^s5VbyhEVak(#2Ug93oE#j4Y=ui{-ee#c%)>5)FO*n6tInMdcUFq> zt<4!Fj^1^Nwfat8m<=BKD|YolQX+3QgF#jLPylyaBQRDTQ+2*IpfG^M8nyeCW_j)Q zw2Mbeo!#g_SVb2GbDoKmRSa8gzHlWDqtha{68J+gA*n^fw`Tf#Bw=ey%k)B`aT^eE z3P_wvIconMY?=WJ#d;{`($CL^Bp31YfWy1N&OR{T8SA({QjGFLd4T<3Z`3Niar59+ zG7!j8$6L*NBTu7ELHVupIV!om&8b~umfA-n!T52VYg|F>H}`4vd>mV_Zu#$o9@3OF9A;pRIdqaz)G;a)F=^u!)P)LkMsE zRqK6Tj*MC23T%}I;8#ul2$7x#^MX7m4aE*K_YJCDh%SG=5(;aRY9n`uN9Qc=-}2j+ zwAcrT;&?b0O?c5+|2|qs`w!LlmBuhAq|FAi6%oZB@>E{05sd&&l-W-_W73jObHKt* zvo=V*wVec0Wi+2#@`cvOh%!FK$gVUoRJ~QD{n3(umJsQyzxx{zneq{4FpB$p6$xC< z&dx%B!|Gtjmr;rc@Xq7S`EOG>BLxNE)Hk0@Iw^8AE z^4ImN3P_loQc?y6LLL%|Ii;`qj8m5<+6?8F@NwIl{%4Gj5_lVt6$rb!jD__(_jKH- z=Dn*{3^@9pyspCi8w$Lc%9HCeZL%TSxAi4VUEbRMaqVg`^6+ldrTfKAuNCv>l*jS! ziJ#AQEA8QOR`m09QXtnsBcY(B4XJWo(i4$Be;0JnD!A(!5VlTF4yJ6e6ciKy2ZYl> z(!Og1$g|pIWO%kcQX@hU1lsL@ys`?(a$P3zHYpB_+QGpf&S7SKR{~H&wDB zWFuuZu~}DzoINNOMaQjI71`ULId(@GZK1a>6FzE{WxB{AjwuZqS7w{jIO(P{pJ2mq z_*9>bhpD&W2V7@k`vyF@xw-iks5(8fuLZ$lA_w*du%0|TN@uV_X?)VVU30wCearo~ z0T%7pB=d`$oE!&I@{3Rh01^Qy|BY4HM~m8yU4_91r$Fi6z^qGYVIP5==(e7mzWAv6 zL`B<>7YhrbPlDs37eEw2h93^AM3FitDgKswu5(foco_|Lg=2vW1jAJJMt>`3Tp3-8 z>!T9MLTYR;RqHnm&bSONXgPA93q@T8SnvI3RUX+$aLdmLxhdX*kbz|PVh}m+Ui=0! z#ttya2pC{OJ7ffOKpbZTi%JFm(u+`p)J8;BA@BX?$Djb?mDWTKjSaJ_iAi?+4R_6} zGkU&8jl)ABI*QEShjK-OyY#lopFxUrYbP)voryD5be1dPw~+k zzRiN)zC6>0%P|5Sa$ieJ3)MFmm@erJgj9dj;x`g%QjTj*5FelIn^4hTk&o zomWnmk)94TM2sjV4X7(TTFMc-ZZwnGS&Z(*(PYt77+8eoq`miBK%HI5ik1Noc8*s( zukj$Sj5sbYDpVDmM4z8z5eygTQn3T^c6ZH{ram5Td$u4!+U3Pcu*gW{CUs(7beE|P zcTgj}@=hf+7KMg{1}fQ?-nOOP_Yi0iaY(B`=S9ADq23Al)~zcb7e7Ju~P@zfCN5grVJhwn41p|)HLKn5ZADA@m%gQo4hKs^{s!!|H)c%YJ8<@?1 zr-5}~Xy|>01=;cwZVG#Wu+)42Kaxr%mFdUx0^va75dDEmN&j`u-&8156nw$AQ~x8= zS@^9|k?^b5C5wdx!F`vFvD?bY!a)U+vjHh0fblkpt@>H8GQN5Sc}S)rm)^T9YBxcL zx_$NrbtF)Kc_9PFccj9e^+Qh1eOwv|20c2|*(Y)5C$**nVZ;zB0I=2;gh51b9ywt3 z#$Wk`0_W~Kv0gBjUlQ{%1p95` zFn3Ij@#8WGG$?Q}0HkaKH{mo0d|Lq2Hv9kyHbNZm0F`Z|-v5rnkUbjk4g}0rfJ!~A zeuvjmaysSY0BAt%r7Sn{7^HJv`YW0f0ze6g(^WK`Hds@QNX`r`B z_yK+XbGlS10Mj{ni5B^S9q0*b4NS&co>>adFgLAEtuJfqxy?TLXiYRP4FF;V*uCIX zQN$)EEB1(`2Sh|fWP#Mm#?9T8gKx)@33Yptj|^*Ap<2H^4usb<7;A6|n!mmS8{{68 zWjJ_7?EkWF^s93ck}HL!IJ-#sl#kXf`*9Yl>$#b<^?V^PyXvRT>-Yr9@wo+N7UZcE z<_&CoB6EPT4k0$e|NbD066jAZtaHq&qJv+{bEp^4{?qqx(ba4>{;m?Q>+0)@b`?%N zTwc{E9~6IYuo*0(qd$2L^Yt{%`*$NKq)1AJI^>Hdh`7Xv=$e#_Ns}TPxM!HSnSs^QVO^K^^oI0>(onr4lkzdK4sdLF;c2nRNXfC8jGS&OMOmlLa$n9{uW(({G6E*)lTPtVau%tN3v=u%p~6_< z>`>X`L>t#?^h$m+_f4UgXnm6oT|GPPQH`KS7UTnV+@9R0d4X5WuD|za&7CE>V34)| zP;Cb+a>XtyYD^$)nP@H_{3~|b&;z?_>16xT?#y;Os{-&C$SSCFoC|J^g=7&4?9Vqa zgscGXQGN_+F~V=m&CN;AeIHf-KHN)&nZuQUepWMm>%X4`VPAh&MW9kkGsl54x?5|%o6MW>DKr}nlkeZ|{ zU}Pn!VCuk}Xl!8iO0Ijg=4Q+R(}Ph~mebNoc#Sw}mnXl9CGZ=DBW}(F&h-8eSW;>* z!3e@OrH=+1L>8n{x;uV84PUq5L3)s!4VFlU<3q>LqSWzJ&0t_#6!SCxSB*`&U>S~NmPGtJBp$*7kOSaRrT2eN z<{yLSq5m%q3evlNT^jsO&(Q;wuVZM9s?AV$!H<;4`Q#kf=R6jZt%xtLB&{+T6*zYt zO0d7QiM5icsLN|0YqPGtFprOWxJ1L*xO1RZvq%2(1u;+iBmc~^iQm=M18fMI5Pdw5 zDennxx*<-NR6k6=<|^PU4moF%6cx;#|KlE|L7H&^&g2IeKC9i<@I#x=W9n29L$AaA1RMd3ebLb~J8bY%A2j zyH{a1`A$SwI4SuEoPHC9#&>ohWRPBV1s>3EwJ=}df&AI(*k}!9%sm-$;smIUU#sy!ef+9 z#XXCkYmZKO{O(q{{n!WxD z-;hprz+XeNpXE^%;pv|rE94IibPG>z6ulQjYdX1zZoO1b_%K)~h^`&9u&~&MGRyQ` z&Uoc@5s}yzDF^0YoiPDxHv+IPF`+Mmy~M;4No$%PRbM32gU9(gw^qSLHn7J`RJ%R{ zdy#;Va;?{&lJ`)~F++(+m!q8g5*!asP&!;hf#Wb7hS{_!kmwKw>!?q1^Tza|Z6(lm zfu?jil@8Yc?5Dt@C9qrso)c`BU2Q9t&&ob@^8&!xlMjzAFqpCT3F;6vr&9) ztvQ3CFRX+r;0rGV3=Iu2t33RO#fXjPUeTHSd_22Khe|9J_s~K8jpO}Nqsc5k&pIFFA@bW8raNu;saWz9Q$Ua`%yMO|f zV*#N}fe{@MB+UaRvLS2^%nO51N^AjdJ~ovAm9RMASqAXL=KvtevL47R=epXvBk|8W zH9mEHCBesno+kAP82GvM$~epICc|M*H%WBf6Gq1iCRbG!su?WC3tR-)^Z~438SpTO z4ZHA;|)eaP`KG3$uwfp%+0;$OI;+NszS3|E4(5kEWn)lv(yNz%~-Qmvs2M_P|I^ zUVaArfcGGUYzVkdy3$?(*+ zI_1-xH(MqE(Em*_uTf3X>B**l^GHNs1g5_{hd1ptkq9_|aSW(^62JPX zG@SxnP#%8lK`Dtk@}r@kcmYDfH^k)RcjEPoUM1Lc8Ym{jHZJ76sEnAL4fz4Fz!=K{ zKCEDwuG+}ITPVZRO*wae5AW(($po|eL?f)72jHC+Ka2m-wKihrrtp^50Dej-`)IdRNDmwi8UhXBtDY6&iOKs|D2XLffuJ3aA)W#8c+8IPCoFotb;2uBW}*>Yga7b)R^yd44u`yr%qJTxm% z{ZO|t1IR6LqyY~%JW(5<_HA$ohi>-IwX=AWLf^p$q&>jWGG~(G+BH0c>=m}?;@sdG z`!+WGm{P(xW`k>U&F5izKggK=XQ?uW)91&^Y^bs~pHb_qo@)TFr40<^A&f`C^H9Jl zNa}`Y(G1}X5xu&4dRh=*0b3LCLnCH;xP@m{HWjnC7*{RPhAB}MQJ9bco1bx|`AXWO z*RbJKspx68TBUMjp3SxAGv7Zq<-VT-d#*{xn`>qMXMR4%+wDKYFCVG2KY{L{WvKlw zP1AzQmPm@$P)?DE3gfqRG-}PsHejtn6v_;ap%rpg^+tJ+r{<>Z&SgV%FJMYv^CXc}?zpoOUk5CJQ z0Uxo0SdJl{5prfQAYO1MfYdtKrSF3LDfiEhefV&3aB%PtdY8~{>vT(ZNuJ2Yhfd-jg@Aid@?RF zj;g=Mr(gO25v(6#Fy+h1p`0M#oF^d7fRF^`UtO_rluhRpv?=t11bEc<6?&)o{4@K0 zDq85{!XDpFFcWqt!F|T?3*S7C@`J98G@+*iP@|lKsCZq+I57)`UKWEQbifo`%MnUle*k>Jq8|O2qd;?t>&kr0+-p5MIOLjfov{Z>n&wD&C zD`$S0;+>d12K2dk*nm2P_=Ei0=8FlA%Znwr?q=g z2UcGv?F+t6-5L4UYX662*Tk{<2jQbR{1El8L()@0t)4scigwWcg&2^b)2<)x!!2cM zQw#u8e_`!DgZs*0(4#m`)2S>!a4wg^%3nk{vgve~o?~bqOP-mTc|Q3lT_Ev)T}a@t z0yO%|+_JAeF%6ownjtnWy&@#1nT~(^wh5kp5wy_pgF)b3fY_f0Xy*u0VSo_;t68uz zkE31p<4(HrYbTV7WC%#o_e3U9;d^-!AztkPPTn3lf=V1`QG{PN@k3}MD^^vV z?t+;yZwi)gFGQku(60(CzL2__8d^X=WxW}e@f^UdBs1U$I>F5`1LMDoxIwO30o2Lq zUlIXk!FF+BAq?#S=}`(6wG2Qf<}IMmkk|l?T{TBQ$9BkcLg-ikx#S&Wy-R{CV3+Tp zx~VKUG&uO&tA5HpocstqI`1~B&hqR><=SyWk=BJlm8vVT38ecfoCSG_o)JDIrF(qq zPRUhUk8}le7TE1f6OY(1g(r&K5;DT%91LP(P17yC7PlxU{;UjzbS3abD8}BzpYZgA zrIZPcW%!tj?JX@xA{!~SA_tb0tLqwDCXfrTEF-#g5V8^hgdp8rOn>_TT26w=n3dr9 zNjz#2Lbm|*fZE*GW|Le7(DNAj28IHOs2pHYpRlTkZku@<8dD2s79Xy*H6R69HV)Se zu70TF44_QRFnam^+)ZK^N`9I`7b9y|TkcnD@TZX@WmUbu0C>_yF&;IudrEV04TD2N zj$m5fUXAoIOT_y&T57d1w*{?8-}AM}q1TBhRoa^d(DtP_DT#?8KvQMGla&{M==~o0 zFA$R`NUbv9!a!AeihnYy^>Ts6Vyv`x0#q}$GJq{J;9A>t6c0`n8-bb`glzQ@t^m&O zvX^gA)3Cb0`y>k*xvej(wy1C4RFNuK%w|x|d;K}j#+$=Q>jlSy)o3>Vs!0TPAdEKj zVh|w8AE*TSHFY`}I*-mj)3=MIb|mJcet4xOPJ4Ma?%qQ7jfyFy%lMBr6INWw)aa!A zHGx(n0h&+=A_)?Ukopq-N$p?exoe4(@esbs|7k)G%`1XSO_1DU{J-WKf;PMjlo8Tk ziuXt5&dDukctaXfx0xdipmrezg%DmUWj!LW>l^z?Yg3SS?6o@fg~#ma(_-Kl{#_EM zc>?`s5csyBA)0W@3*lP55A4Q@OwUix7Gh*_xXpaTUw@Tf6FtQrgT9ycR0-GFy*(Ew z(s?Uv$M+MAEid;Jhugf9g5uB))YS@`p3=`+CV(pk?&|?+3JRxf;QY|HZ^u6# znund_0Bt15ceH$Z=5}LcTmgv?!62#bK*SUm5{hY!b%Ve`2yyw13cx7`{!WQt<4TA3 zT+R;AgGU6Hld}Vjg*~8XU)X~7op;F)QvgPb=$BbHLeH&#@}zxG zp@9|ZE`aqXCFY$0Emy554TBxKxbxT}@>Bo|Bdp3LM$rsNA5p_yldF?s<`N{)wBi-s z@|Srrg~BgWHCz7`qRJpz2B;hPr=;L%8~A4=Kr`&kQA`o6&U-b5LUmA!+z5X(7F=<-6dvFiGgHv zhZY65e;+`o*>>YQ;9;|1?IeMn2;seap|XgE-yMX;$QKYk$M$)l!KW5B@x#ZD*&T1s zm@AB546jsW7&3iVJiqY&UVw>IAJylv|HcY)`#79xL>|*{vf88ZN|ven!k`R-LJnG4 zh@s=|r%*;o=%|uH*6Q_r>a~^(_cDl3zQ+R5B_dmo@s2Hi zdCB^%+N}x|0<_8L)~}ihXk2_W(Y2qIU5+i7GSBCZ;hDgsfrL=+EBo4%?mG~Wd8+Tv z!2nn>e-%Li|9GKaYL}G#=FPW%Oh_o6fPriPXtBDvD9Lkz@F+mcWB?jMSWM{kW_Cyb zVLQ_+qCA*R; z(r^7|*!U?hP3DVtsLoADRSjp#lOU@NfT}DYgBDh4FhW?56kU6F%f@W7`uDrC)HZoq z**r&D-%F@SNQWjsPy!F8TXN%*ZfP6K6>4>yuF}G{#tVvmDL*e(Cl7^bwX9@E{AbLM zss1_u)k7>*!BdZ9zL$JP>*6p~r(j;%F7!X)nU1Oi%`OX?!gRj;7w>Be6jb-jS3hH7vYao_;L4uhyZR)cs5!F>lwY6uLCw%(v9T|sIQxED_VU!nrw z3WWE;YgrJHbSmsp|9t1PhrEq6fJ3ueB?PTU1);)9+D+G@ ztdHP3AA9i1@#=u$e8F|ROO@Ffb2z8FnyHnW7K<*AtV?()zUpsDm?k<(07XH4B>4$N ztgi9&<~R>nhk)R0x3515MnrDj#IMjz4+Sit&()WqA0fegha)9wPWH#$9CSBwY<7S~ z4FJ2%$US>H0+<$PR=WUX<#{-JLbRTkn%e}l5Tx(2>&|tRo*raN0SC8D@1BA<8yuh) z%M8jZAJ}&R?C#Kx{mb{j5c+02>RRX+7?j|AjeGz7$6X8bU(sxE=(jpwZ|E((3{W;` zYEuCiZ{YMQVb9S?dzBqx1Dw>ZxVkgCYko!lmiYAmpf-5WWIyRwl>fTP(4g}xA!uz` zhm+gH%o&moLanI7wXG~I&mH#jeAc*!Yr4_xL@dM6Gh1$VRc?~^dc=;a7vU@Qc6BRf z&62l{U4Nl2jbOy&^Giu#;q-w{z?w_o^Ck2tsRA4u8W#2*atV!bpjwra1q<5Y>SBL- z|EyIywDzwA(Hh+~GGZyzuZU{x&p>KW|FeHnLg2RwVEhkOf*jBeE#SS2IWr+iN^Wjy z`UvW%vK&%ew&;67gb4$$!obHTSq?`#7T&YDuyYaW6xvwlK-ek&1mo}#7+5BH`cA|m z3<>&^%mnNxztvov;D6sg2E@LF{5R*lSvx$CwlmT>$GPvkqd&^g;wbEQdSnIQoDkYP zxi&_M&%=)GvdVD{kV91$TrUqh`?(KGRuNg5@bPfUf*~FzS0pmB-$sjS+ycujBLN&= zp%3g` zHw=zoFoi5MOC(7~PTo!;P;(cO;NI!K&GF&Mu#3QH$$|$E_?ak_kE5d_a%93UuQk43 zrB=}&uj5U_+Pe(Ml4W6WajeuCB5f5M9f8QZsWQG}i`0ya5y^i&``vf9_c|T#lg;w zoY1j-`w^QuNrv*~{BvOywkA-hI73D+Ol_m^*SASJ)XMXJUY}?SYq9 zb!&81X6Bo?xR$fsJ7==ce93-(e<^(%vbi(RH*mTO>DmW6Wx0>~75refX+ljO{Ij)2 z`c$D@-2hW4Os5U14>(YD;0Y0%pXI5@mYf{KtXW)|7cpo{rS(p-F1y@-!YIj@&<(K2 zieC*0lEPM}h3?mLmcb1q2vwTqyRXZuQ-lI86fcCh+GF7H1G}dm8QsM_QDOq!qp6=b z7FyObiLEuKLnAE|F0YxVao7C3E6J5ZOjCtfsNs$jCa1_!9~@-ALjBvsG5nudYGkkN z)&idPjOxr4*Mql}c;D6v>k$`X*)JzKxj+CYSkV3b-bs6lMsSQmlnVli?I|c|fQm~U zkuZrkG!S;i`-chs6Q~Zh8_tuP00&nIc*;ecUEJa!7f z0!{~=k#JZ9@6-{_Ff3f=M)cQ8K7ub&qBO62$Wf7Clv6V>U;=-}Jt(t+ws@WHTBV-< zh`mR1^JnLH$zX%O(cNkGkruCbWwFRSFPz%|+0Sxi+#1h8qJ_kzG%CzwLpR!j28o~rj4!DY7 z)u1IM$GsZyDZpdei14SdcZ;BZIN<34loFAwcUa)|S^>73&%vQDEwHPL{DD*iri+Fj z8mSv!k;{aFX&;3=#Gk)BnJI;h`$s;1I#vgfeu5#DJ2POYy`g?cX$x6< z0NQ}{O3aCLH=EA~K&7w?ql{)}$49iwtI9=o+u5b1<2uV8f+CDS;yeJOW!3#4Vg$g$ zX$+i2ftQn4gx`$haOaJxBsQY{kf2=}KdULXu@Sr<;?4UzJSv-LBB!UB+MykPrPU9! zpXmDJM;KZ&?BQ)oryX#!bo2vz$11}FWJLoA9%%otez`yeR^K={4It{6uylR27{-7O z&iX+JT%Ox**M!wb-^$5>0_HgXxnexN%~c{sdNjU#+7O)n<*x^0Zb;CFP6 zwup`<*M3`liQ$(R%`?&7b|StauBv&0Ey<@9w>z0Ur}wMedciNRLH!SHs_b{;;oJMw~-#X9z z?(^LHoO_O(z5oBU-u1q}X)S0_$;Rw5iwtE3jM)9^6_w1oj+d$)z+rP5tJn=CI{(Ta zvqRok#UCeDAz|Uwh<$i`HW>(JhVhe!%}H8Le4*ic zo((*1s!16GPx)Dpi@ZFjdWGWLoYoqz7Sv%oV*MI9Io_CBwj)*KMbR&({t*E~MUhXS zG-jR;?iG-M#rT@6a{f%oX@do(?4Vf7Uf-4<6OX6b$3PFVj%Fg0hpgMhnZ6CCT7Vp& zvs#D#yapXxPBIcdQhs-q@Iq?^71e>g^m{veNZlsOp#GZA(S_DEH1uN5x4>yWof33WGJj3Zst@(EX`(em z@(d7~j(3)CHuUGC_O18A|1!m&uMCTn+si>y?rVLg7w+US_~W+v)4Z1VBJY0vvIp)K zBR@;1A?WQ;N<0bnDI?1$@fgXwgQYxv*i=ttNO zUQBo!^z(bCN9f@>A@6(U4~yJx`T3nCKYZ!?+Uuw&3D;FBC?8&WUMYhr`#(0Mp`pO1 z3}n?$95blpD~3Y~-@JjHi5*P;yZc8%ug~}Gm@fGfunMk*qqo*d1BPb^?G;MbEGqU z1GGiSC$Dc{-R{cG$r%U)y6)M@Q>Tdj8RFvQNP+X7?A%V-@9>Qa8&d&&0FXe@i9AGb z0I(T-z0H9#<0)veb?a6jHc>m|+`It`hM#%7j?#^-bPOk={Iz+bm~%j)WFT%tnS2c> z?85ZZB;BCj>h2~BkA9gBmRGj5Z2{s6@^(3X)1Ett*$kg0P5e3O(dmln>KjcOF6J>T zns*mkX-tpmw-ox7zciY%u?tb-1Zo>8(1US@gf^+ytYutIntvR5>v8f#KfTSvLz34} z!3d3`>dx+KrMlbM**X7Wr>uCRy362^rP&Q=eU=TbU{}4bDX>{lS@|s5nuPfH-JhIX zT=;BV%+T$01v-cmT>mSi0%{r>e2owrDBh{E5buUigwNn4Mf0vW1A0!lUija{EZ<8> z#z9X=#Z8+x&tmxEYa1IKtpH-;hDL*o&zkK+Q{m1Z9-r%GX*!7Q%miQ318iqH*4brG z*ZrsE`wT_OI)+ybF^4ELIw~eUwI7%VcLHBU*bQq_3Lax&-+Ey=8md+%UU zlB7W!KTj1pk)FQ3sD|ji)yftt8Ol*MZun#N#xGybtq5 z9@l@V0&CgDDNq$Pt7ucY2L^6cR4B|rYD2&^pCz*&T7W2(@#zN&^vdqv-$4ckKq+a< z87MfS8KG8Dva`!FNbSr&T}{Sz&@B7|zvUqoA@NbdfLjo$7vcWAq~t6P#(XNJjOS3= zqt6ABgC?n6bZX!V3gf|p2cNxof%9+qo;`a^rlg`Z-?!MbZRPNxKVIv9q_Mr)Tx@Mo zWY~;Y=xpnmnSq!D zOEljpEiK)Mt$|@`3{4}?wyY+bR2Y6FS$Z&R>2Ql16PND6N7> z$H@Dl(xMxWyz2e?4G53%=3hK+$4g2enJ^XsB*}DDS0C@#_GgQG*1(1A&{nSpZGY$< zxADoyvWD_8;8coFLA7FFR*lhusnfa|epNH$pr`SE@>Ngx^@}2PnxE7@&s;h0&|PR) z?D5xLrstE>*XeF4_O)aI608lw^tIXXw6?n0-Wq0t@q%PHvnSq?S5Q!ph#D}s;ehqg zj}m3Ma1t7KNQHz1zh|JaB5EO6og8kxZ)#!zuY*yN&SiV9QiI|+{83Ew7 zpf~BM8x2W#q{Pv@UKVfNnx0W6=0_j+Np(#vn3iSAyyHDBvH&js;BQF1Z{EC_^BJB( zj>W}AWV)qb#Woaf-@YB=vkeTlJurDj+HPWTMP>=fD}waFs3!+S4&|iN&@o;^0D2DA z@remz=%*;ZKR@REhLoBWjErgqM$S7K7#NHSZCEaLm5EL4i_;*0ANyKH@=+s9ah9j9 zK5Q&U=^h?dC!h#9m@&S_`^3txr!m&gskgK(>}fvk6BNXJ)ywPF8arkUr?mBoPiZ>+LZX+K*BH7QRc1?ep$_H85IzEZ8bB~%gLGnIqAqAZTZ4E#2OT(yrl+QA zfYmrctYIi_eePUi(1p8kctFq5LiYFc+yJK^F|h40gJS465_|>BVi1F&#C^0SDX3cF zoFO#cCpX!4$a}Qf{WvAuS?azS2|6(YwhwAxp|qEg)D_Z}pFPi!m&>3S^w2~5(7gZy z)y#@b>!z;fzKEFv^3elf;smzJhit>WH>-y2+6?<>KQbdTN@E;bttN-QRM`f_2$?gd zyX5Y_DUuh>C|pJVbk4(_$5`Zg*Yit$2*eU9|Ks85iM4kVL8@bDNp{K0Z&zqCb3$Dk zot}P(HtGy!28nlA73f71!=s~bVRs0xxg#$x&jQ>7xP{RR7vZk*>*lvtDSWG^$y!AC zlbMMGlc9yHh3`A|+hBWjwGSPWHFSDt*s8%c4HP5xtFpG12`t(Gj3X+c3zL}Y>FASP zmKMEU)z&`Y_g0!a4m@cuMlZ1@r#El7P*kuOdQr4Xq0`O2X81Q+#oT_Xh)rz2rZ7jV zTB??vl3JBoFO@nwJNwLiLIC~JEzG!efBF!Rb`@LDV`lV$fqQO~#)^|~TWB)6+cOh+ z95OEktxDGuuq*SBD(uL)W!!qhLyl^tINn-q$y}1L&aQJO&*bmUZP%_{W8Jz{@cuV7 zYAdWDE0~z7$5#mUP2!}2zQ74W*Tat8lsDE-G`ebPR>BST8V8urp+lzM!Gn9PSg}IK z(vq+9`6Xq#^6%fjk=bbkNy&~$G^UX}e6g6=yn@;HEmOmzZ_71}UABBVaqfdd;wEiuc$aKjT5Gah3b8JFD4nZp_X z-d0vJ;3M`4XvSLrX_}m$z4RQDpjEr2L=Sd>wC8k=M(x|TA&`S9l`kzG0tR;6L)OBl zvI3vE3W%8>#)SMcJ?pWL;2*i6s=*CRAv5{M$%FELzw-G^7UTwhI5$9i&g#vD$Tv%fUlDJuF=IFiLohCJ!AJPx^j#+?FIrbS%g z{$*@h;k$woijq#~O>CBtvf6KTjRY1U%lHj+X$WacwmeFm0W23D6C=>#;^;{7IVB_| zk%~(FPUpQB;13v*iHP{5Fg6rl)mqN-0Dc5y4J`Ai+!51VhY~-_Z}Z4y(Hs$W#Tf8piQ4V!IVuHL|UNsS)462@w?8%U1>_>Q!;uX1<(LC2UBPhzV)4lsJtL144)7gSiw5#?)AYy(cUxsu9mbVbR4q z*M;m%`5}9;!ygQ};#}7ZeC|&#u3|h@Fj&EdT_5D-!?+&Z( zC6wf!pYnE~2?_z8blO$xDnOj>k&zH=i``f1AE0uRW~>mkhfCf=f6&EuT;D9H|71d0 zOZMLK>LBjU!ibCnoaHz>N3pHRc14+fY!a5dGAiydU?t-5#VHj4`EYe$Xh*){{8U0X z?cT4nU35WbGaet&J8~nHXB|9$4V=%;d8wr__Y!^{bRe&H`==u5Z`G<*75EToPx{z} zOuX2(ZmlIFGxn2}M@*OzGMf9gNQ3EC(9zk5H*QyyyFXCi*C?9OUf_F$^fss-pOuvK zK@ZP2u>L7#yW?oQNFXX=|Mtpsb#`t?kyix3mD6u4#W{wRmxAe7PbDYX9`DB+!M{Y0 z?tBf@>_%)#=t-KOG`DZiQ!gKB z8AgwJz-sG$<5M`LEV0{ko#7hI_C2mLKm&X9C-hr3%kYOU|4@DYk4?`0jzGu2nT(|` zRhg-og1o}W&tXMFs_f!Y=G(Rj89PaP&JCkPPizm>MzX-iX} zWeFfz0|?9GN~Arp)+E&xx^oB>Z%_WzR7zn1FY?TFnC<$++k>=1;@Bfn1B5Qp)Gf1t zArQSTmQFWz`uKT2zFrVQJ%Bh1vRc5!fDs9UX5Gp)UkMw@4M6EU#2z1DK^AV2h|{<$ z>$)3)hO|%ZhW(kpnN3rQGs-fKP-pYxn$L4Q{3f)90;+~|i|J`;Jy3*r=jH9?;^HFh zoZR(AE+T=&DJT>=7TwYF<3}x8X+`W)UC5+ZwsqEW6@>@+X7W=4lFT@A?`#|x>|yOO z_|ZO>YiJm7qTg{UN=;;oVG`;u!SUTmZ4VFM6`P4jUv!zDoo_z~#{3wq9L#MMm8PkEdI1z)aSHKh?W`8gzZA+ddCWfNGJG?xFHcdAf*Ktze{;NVS)vr7S4ImFV zFEZUiP-v9`7V0%#WfM9Xg(;8h%QGemB176%Ub^N1T(^!=btX;U>f~Bpym)aH2n)RW z8N7&Im>G4k6vZLD1NwRv@tl+W^_y`nmp8B4BGk!p^8L@Kb!@`LdMb7`2AsY$)1_R? zy;{1)@+q_{#W4}Ag2L`1(dr7Jr{ntVpfJEl89?%-@jEz>v&w|`;Ry3D@!DE(BE1@x z21{uD)$xszZ2~UhNFf*t4v_jkzZIJPE&xy>QvBvI%@wR8Ry$Z(Ux8^&!P(pZw&{*; zAAE|V7z(Z2J>1=QLH-9=vKKx>gYP{*f7YSVRR%#85gDo4VcqD=CVzN-^f)=3oNKph zYimPcO*-ys%v1DK#Yc}G*@s8!B$>PxkvM79^?B%u6^BnaZ>X z70|hfy!TSZtIojdCr}-E;dn;z~>2 zCc+UAY|^^JV#{dH4)J_xDZVc8V(ZiGOR>^H!>NJQ z>Or-rD!K1GJy8Sp;T8ZA%fk9a`N`9#2LYSrL(>R@cmxl99TpeKLpPT>NrwvR?tu%O zKrC(n6s(*5_U&87$>$PxIp6^dD=*o$mJx^WfBhdINmUM8Lre|WNkl@Cl_8IZ&Ct@NR#9c(5F)TQ^XrEVVQ>_lx|=9u9C6nkaM&=&)ns6HD+I`B)_! zkJ8@M zEe;w9u_x{J1n+Q1R`^K_wwIlqd*OL+T!=sv>Wv#W=6|(I8Vf+q-wVy$nt#d*rrBLZ zk9P5-3S`uLNa9~**Z;)PnSpP`F(E8}*#ytc!hqwQ73Uk`lt zc}sIQb%6k@0XXwMk~K`86>X6yeenb@9ERc!+owH@Odd%xmjp0Zb_Fg1Cwhh=LE3c@ zZah13vjj=IoxxmPQ0eLJK89KW(^NR+${Xdmo3OC_}N_5U=S&+g{S18;atKr zFKIk|CDI+LwJ6 zI@3`iV!=dTSBL{t9I*Ja=g*1xA#8nOa`~s_A5i?>HO#ICY>lLB691Ak9EtITSzT2ed%C`OliW5#lKlrfrV3jeQ{z33~P}Cj!YEA zu4=5VMl?z}Qge`2bw_F+K0S3&lj;i-6%i&3kjC@!sVLjVyANT!GZJ)9NOfB!0DFv;m3?Cq^0KMt*2HBd1Y1o+JduZZ<+%RO{lUduvcEq)OsOYF4H^a^=SFZRXGzHz* z4B&~%H?K~*tWk#rTY*aMYKV9TxClyxO@MPqOvb=;lRg5d;?c^I8x(--G~WM?7mv^c znoZ}6qR9)Vw6^TlPljWaz@N`oZXqsnhte@dQqHc<osf=K!-fY5d@AizS+CdqJm%q0GK8SI*7@>;94v$yTG2BM72sy zORFDuado{_UVh{h^68;o6}X?IPHn|#J=9t9wDgF#J~odZq_MUj(F@|wPsN#>YhBRF z;)|BdU{*V5#4Frr$B`QIm!-+z)AF2@M^RRMqx2WcWNE6tr+lkFdhNd;mM~OcGT&Ah z=u`R8A&tj>G$S!cg)wH)g5v1t*o_Q)xvMi9p^nvW2xcG!-VKd3wu{)0Sit`{^9LZe zNJ59SE;Faeh6?8Ri2JxN2F*t&Evvu?4Pe*RiR)Opoy^h~h6#KfF*iYi9=EhuQ2izt z=Q9GV>CXRbNGAAG`}R#pF}GjcW<~imGV&H(_-7>OnBFwK8`o|ju5yni9%#K}#SJnp z9k9vx{Op7b0>9K;Fz%OPke0=JNU#-WC!vIhb|%DXJ)NKDBAh!Am%ReV^$PACD`QVL zw^2-^IU0)UJEH!!ExTqo*0m9cT#afa z;Nak(rlm!@xJ?nfU@~faGhuxv2T#NmhlXwdg>(b;jna5oZw2eik`0_v{zUqYb=L$O zE81-a3Jq2+E)DG@x$p^~jzKsEucFPflo5~crn~(Ged!=JZ;ZytUkNJfw7Jt(;qaW$tA^ zN`xi2R>%T{Pt;vyfA=Rk5a0co`~?^=PcfcfK|#){W@Gp)2+LBb_)qrAVuS$ znLr?-HLESeck=Vs04V?fy=u)G^shJDmT{o(esk~OhQo&sI~ykpJxj%s!?9C?&z9ye zvzAry(Ml#JURdxj1x60Ld3X?&F|m;og<7`*)6EUIn90n{3<87@t|ldYm*7^wh8XK} zI)(vqV_VyIl_f$ges1bQCmxE|)vF|DNo9gU^@WP)G~_)%`or63t-w4n4Un6gj>kt? z?g=ae7Pk#|q&$k*i+XYsUwafLiJt%%8p_zFOH3T7(;0@^mwH0#hO_h9f(xTgI9TrD zA$bUBp>PGlG4v3^A2kZU$@6P@t~1nsZI&W(XLD0T-uPn;eM9x>>2Cc{_UTs83uI=* zl=&D4)qwBz4ciU`lUoT86sr5~b_$wQb|hNz^YizBWRTDoju=OEokS5oI$I|1Hflk1 z6R7n0D_(A-%$7dIa`^x(-^$0u#pM-V7Ltq%h!vC4zki5CH|YRrc=hU4*ra??#!re0 zr4V9Ocm8dy^Ug0 zV8v+qx(ki?ij^ymw)4^H+L}nl?7@UKD6YtXiO1FP$Gh#nI*QcIB(x_r# zR=$CPfrdk>vklXqeNL!rO?R_K8-8laI=A7}fV{4;XTSCATS|9-|0@ixuhC-8LMeAU zbv>gG0{}2amL8mp&F>CSh}`M~q-Xqsf_>oIBu=4C$J6V<&cfAEDWxL0l@`gszq=vr zvZzrKT1x?3Ma3ULWJY7m*;Kxhj#sYysH!>!FPz_`%sT2TA|jpp`$rpg-qc_l5s0iw zf(K?BV$dzPnvjq{#2nY=XCC+SS*Bbc&V5QkPT=3N9PDPcw#H6JCkbhD`6k zCWUd8o64cD%P8d97Vx&wmm3YrvPPkduN${}Afxk{Gr9F|bon}y2Au?!6Vr ziP2Fq#OOd4d~IRMFk48fo(U6EEG*~zes$<}jsiwS4%)dbX2ZXK-yt0~6auqQPl>)o zwp_^eLotvptV09;G~9m_Pbvl3TEHvhvb#Bxh32P_xjQqMe!u^HFh$(@4LfMNGzcmp z^~{L5eI2{FHyC~Gz>(q}bX(}Js;13-D!nnAKZf1_vm>jX3zy_2 zoWU%doDIyilPwtGu`xu-W4+2LCH*pKcay7=jvZO* zS*edwqSN`6uAJFtp=&?e9gjQC)y)l(m@s5xAU|} za)uCV#csGIWA>VB1xDNJld6ED0-(7f2A(JCD}Lhh;~CNPwtT_Z;} zu!dBjzo|vu!wt03jfE#qp4^B*00L;$&6g<@pGpNB>qNqE7OEW5rz?UOmodS*FYF5u zJ&sWZtUVhH$*K(*@|4kuiQ{;s(C@8iT%6dsd+rT83(HX?t6H{yjQ6Hr`3n^&D+cdE z7@y)|0I487Bc!L=PhBF)>E+n#ZLp-J0x9T4l+xv?lBC6;2IIDHxf?b&%7o1Il@imp z=+KYehkI}a5C}>8CY}nJ{rlT4Ud!{*9Zr&G5cU|n-};p4M|o2W$_p`1g#vK{!<+lN zv&`!r%r7%IWof!LWE6z!54?L1W@P>ihY|llW?+W9n1aRnN-zCbuZ_}*wZz91uA*y$ z`u80_|L*;>uU8V;~Jej&r;!G44 zQhe-eHz=9u21A$j~gcZgr@p_e2_`CEW0 zQuk$D0|7bK;+hq4;!G`Z8RStEDHt$W9PTPFZ+EVEc*=v~88zveaKp^&C44!2Pb^2L zcaIHAFLQlxX~fm7rpDI6UF_URBO`K=)DN_Hm)zW*&EnP31`w-Bu#&4#-CMD8 zl1qAUzl=jiVL;j`TQR9wY@x`M4^gG?xW0pQwHrr^iQgvSt zw+OQR954eEX(^e!cJksz+!J@(=Q?#~&4+aj=JFn;Ox=B8Kr;xX411MF6;6(3+6K

<+YHMzf0VkPj|%H;5+7W2gZc#!m#^$L|kf{ZZ5h-(MFuN5-jo+keT=vu_HD%Sv(~sHcP~U`VfXJlxc;Q#j3S%>u}6^CQL-%UE`ah%jRUb}lE` zOW?9<@WA!94OHr8ceQz(`PA-gXT{cOxXD1Wb=!p%s+U#>DmHVjF5*0u!19^#@IIUE z>H1vp)4@aZu9}MTisis! zNCyv!l|$$14RYQRzz<^^Zp~s`o#(OyYWWoq29+_}Fp`2#3|0hekeBa#w3n)Os!+mB zzFKU(E{kTON5PdR^MXRYXF_?lrU%PwsTymEn0skAf|-z>YzR_vrSavC?~Ygd!Az%n zN8t;R24XzPKIX-nH`0K?iRv(bm4yZQGulLJgQ+x4+hPUl_T8J<*j^*WbUV&tVaffR zA#!9LP#@HXK4we#fpHH}J~s#|(U}71j*Qx5+!d`9qXg2jHKB2Y@Un`4lr#UfYy|M2 zfL>K-*RG>$pcg#`gXTVHEe5${pN9aY0E3LqZ8ccK{>#f1OPY(Jqs5QkbeZ8}QB}5R zL3zlna>J&_F#Z#G1jN$%#o_bw^WqpRt-8GZ>sneOu?EIq>8L1hRQ`P*I=Y?6o{O0n z<6BUG;18|thN}pjr792DSDFE}_eY1^YGOAs&HRhs$|UyV#*1 z{m!?dmyv-r1G5evb+uL0o=oBn*weN*f}_Al%K+lRHKS+ut6xVr>Eic$4sOurEj%VPRKS1t1FO{-2gx%`!W8A^vB!q=k zfS~Tfj}}P%_j$MvUgSFBF9f*R{?txF)+DS6=kvyt$nm@oQA;U$d|1+L&+r&BFm4EOF_^FsdqgckZsN? zA@{Gsepzs{BUN?t^ysJptpBNhFYp`#Q7;V64f8WC2Wmd;%?z^fPXj9iip_iqKrL<`vVCxFr$Hhp)I6C zgGwB_rQ8$>N<;xwf*QE6G( zDocsS{Pe+SWk>#X;bUJWghUr&BE(7~7`{vxJ$jgF2JX`(otX)6x?JXPTUbE;VCDf;v&|G5SB4e z2z&1I?Ovg#BsQKhVGE-)ahvqU2AJYWiuTG6B-b8|YYmzP{%56c-+EkH*Ffi!35m;?3SHW zDp9vATed9!@`%aFva7})H!v|VZDfJl=4i}M=@@dC!Co@;!9JkkEpA=N0w~OF^){Fe`sQGYet>51#rvfFn#OIhzq}6gx;> z01N^=PukZG4m0RX#@WFR<+eC4gF%k&BHQ2HO=c}BpfXaTkJ!TZu@Y0cQK{QwgVaRu zS`dG{#-w1WPQCY)318qwNy#CM8NjoTb5Ft}7Tn$aUo;|)R362rP7pIyoGi4{camM# zQG6kYVL9VY=uS3?B@4;!LBCAa7oqK4D=a0PFxe_l-2kurs<<168Ww_~cN}91zsXu} z=nzjE*ZBg^8O7MO48ES<7A0&~Mu#+jW?l7AUELXyj0jCHQMA1t`1LE8 z9Qs%kfZ7GJyJKL=9|P1vQfxtJ=v+LDcr7=e6gV@VM1Llb6ug8`)$T&hv`T}+0L2l< z3UPLILK6I2Te$fm2p0Re{Ie7=hyE}M?c(M2!O!|*;B=43gGilS#+h%VtK7Wmm4fcc z3Xr_Q@Gl}5V^~BxlE~Ze!lMp-rK99eB0SOwzrL;S`{*3h)0M~@02CR0)} z>V6B_fDbs`2|6FT<48=08m0m%oTShu6nR`Z$i8;f+O>#3;0J;?J5~-LW(H8AI1Zm{ zlRv}dwwV#y4WbH;wT#cVZxXwH{kpp{_rRVPh0zV9X=>?)-xm2<;Sb3JLCLNtw29ww?10^yY% z5~8LqidIuUhnhY%7`8kyyj|QNGH7jORbMOk3BnJ!N0U(~=olQF+t7SEVag=vIM_4a z2v;zzFbg+AAeBz^Ci3!${;L2QP;prl3{~Qe-3rfd;={{c`5lZQ0{!*`PD53X@)5{1 zP;A4HS%+Q5K*3C|i#`JJ*ZyE67bIDWt{wj%I=+@S!4QHDuT z*4Ig#l}%`mHnOoP{l~nK>&Maj0a;7IzI6o8QrNOi^w4)ltc=;YIXxT~1I3>*beH31 z5Jmhwl0^ef8{+wuI8>48LC!967jjCI&{-0s!zOB^fbUKGfFx?bJ>n0TO=P6s7;xgG z2tksMf=7i8|1spnM&A#wqCB;03q;~caA~RBMwYuA$lM}=D`@ULSFK;Kq@xoStwvfA zT)`b;P`+FNg@N1*U-$mKGSHZT_hI;bS6%ntH3^8ye8woao7dx)!N>cca54Srz|gWw zczz*KZ`Rbsd}~Ca>xHovI+3GFk7&+A8;47fAPi|pTwBqQ5xqe`zc99Ri2hRZz;vyGdb?1Jgn^etjKE~cG$6O8XuU0bVsZSM4U!z0&)R|jx;e{_)N zwDaT15;c&Ds(&8UP2MSHV#rPh&6QZg=b^PtywAnD~03 zI?S3abAAa*hDZ{{8nxRBqol**)ywc^K^QGbov9e#~}hD2QXXn795i z(U1fPMN^tuP?t8Y$`lmzbW}5nlAdY^ats{G7|8UKL_RW#^n{vG=1b`|coi~ZJe}}) z>hpq5=$ln#rvJ4|@L(@;473S!R4d$$A%mB{@97UO(?s(G8A<`4*g*l0r-lMnW$pI2 z@7>!5u5L=r%Q`!KBP$ylk+L_R4b7LMpbdI}%IkRP(&l&jR9xcX&>m&rOc)&>pL^iIR z>4n@c)#>8nr`TppA4Hccs!hs1D|9>G87ZF8Xm^^J9cyaYv`j}>$BL8<3vE8z`OwOT z+dfqbw;c>zGsN)N?|}-t$2)~i@!S{1oO!qExdnGdKS|4PkP~|@=Qc{gcrI4*d=up) zmK7$6T_7xG7qqn>H5@qK`Ba&$c*bVxzDsD_ASR-DFu++*{QaAoyXjB1CusZI@VF6> zL@7mgBIwQ1vh;#rm)sLEfSi3WwZmRexA_&Bbh`+=|Ly7+wlh`TmMoEnlJ64Qc+WTac)#JHHCwC zdVSF08x<~rS$rZwDc~}8a#$8XD8drH=ZJg){qWYF=BiA7LER$1rrwx&q_Fq&~l%Ntmo6rYQH+J^EXl~ zf$m&WnW=xx)p-9qeAmb(n^#cE;|Nkj-^ZV+sm5j(>@m{p{Ln;V(_ndwSfepCnhYXzC`>(p8QmCM5iF zU(@U&-XJGmT2r}KuetXFqc<)f3`1K^LSra8jT1{UTJmuLzqhcl5o8RkpAR~Xjal-a zryL`84@U33XfDzSV4U0fiSf$=3U{S8M6xc>mTl%pdl^u-xSVUzc9YCjVe=5xQxBhw z%n82WwLkNXp=*0#)$J42uk&r zFZa@NU$*8i#!4&RRkUjRM5RJISpnjC9}W!Iv={-D#Y=~%Evp$>n_IhSM1XCt>SQ5z z%}8dp%bq1m-zR>tvgQrhW-n!lb_)u>UYt$H1$neFeca@auu9&n%RdC(szp{EK-M_RfLfEJ(I~qoD#8^}ZqaF!f0?Zu_;UUR`kd~2I7?uwD zqD>z+8AsPHDDO4L3ff8uFdYL0OS~3O=ydz-HCVv;g9(#v!m}9k+xDbgwU<%eubC`F z^*~*M!^mfH5lRXpqv6$0F`{k12-1Qq1u0v&e%9oDRO)QPq_jSEC z<~qEjbYy9B!t}>2os#+dgZp&NzA)=$rRVJJO{+1Da>&k!`SDLsm0BV_RrtGI;&yK# zJdpBtM=soVZhgEp{o7B=4O&e<4Y?BNb0*S?9+@8vUyhqdPj$d1sy|^6W7v_xtu}el>da4*MLFT z4|qb~t+#FS!m~n7_} zx$}9iQuwL42mBkdzU16Gp?+9xPVF#Hk{xGp6yU>c&pf7f$BG(9h<^I-;?rc17MIXo zLy4A&ieh1{rm3ljt{CB;VfB_gpjqJU-F4{DHr#;o0NrKT*~*WLliF2F@c=o?0GHhWMMAOETf5fg7M9#=h3Oj< zl`^8K3ul7zXh*U}D%l*Ynp>FU57P^jGQ+r8#J<-sNS(>xWWH*eZRLbpnR*#Tk#B^&_o;`sH2sjVdE3uD(Bd{T1%r%9m= z1FL3T6DhpFanMm8+z!n=7z`-9Rv>`Mai~>TToQje5*RSA@(_048HAjZ-y9Dr9*NX% zA;;+-9IS^pH1bOH>Hl)?_DMn@wH(qkm*I~g-sR;S=Vk2D>&^|WoUXu3=i={2T7W&XFht;MdHEGeqdYqU?% zHvAdxZq-w2yS|YHD3kjMrIg1uf|7V4#77My9Qm{$inrryJ)mt|edI0&<~9V>Z3h6y zsI06^peV>^6fi_q$pT@aLoYXuhL`aB802m@+Ykj$2O_3n%VIG4W0CSH-DI2+)%Ehu`r$#UlT+5PZBJCh(Nj7yLmkV`s0-CFoJuID;jl&b z@yhg%aCH((=`*blw8J71-0;$?pTGiYZ){`+31-bh2W-~|-PTWJGh<}tF^s-ON&(o) zqadwVxVY9P+mug1DL4evJl^&VsG@ Date: Sat, 5 Aug 2023 23:42:49 +0100 Subject: [PATCH 3/9] Working through correlation / causation --- source/correlation_causation.Rmd | 176 +++++++++++++++++++------------ 1 file changed, 109 insertions(+), 67 deletions(-) diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index 7b9aa84f..c6b2d829 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -245,7 +245,7 @@ difference* refer to *groups of individuals* , and in such a situation it does make sense to ask whether or not two groups are observably different from each other. -### Example: Is Athletic Ability Directly Related to Intelligence? +### Example: Is Athletic Ability Directly Related to Intelligence? {#sec-athletic-iq-ranks} Question: **Is there correlation between two variables or are they independent?** @@ -311,6 +311,25 @@ described, because this one may be easier to understand: * **Step 5.** Calculate the proportion "yes." This estimates the probability sought. +:::{.callout-note} +## Sum of the top ranks and the difference in sums + +Notice that we could also find the sum of the IQ ranks for the *top* five +athletes, as above, and then *subtract* the sum of the *bottom* five. The +result will be negative if the athletes have lower-numbered (higher) IQ ranks +on average, and positive otherwise. But gives us no more information than the +sum of the top five. The sum of all the ranks ($1 + 2 + ... + 10$) is $55$. +Say we find the sum of the top five ranks is 18, then the sum of the bottom +five ranks must be $55 - 18 = 37$, and the subtraction of the two is $18 +- (55 - 18) = 18 * 2 - 55 = -19$. In general, for any sum of the top five +ranks $T$, the sum of the bottom five ranks $B$ is $55 - T$, and the difference +is $T - B = T - (55 - T) = T * 2 - 55$. So, if the top-five sum $T$ is smaller, +the difference between $T - B$ will also be more negative. Put another way, if +we know the sum of the top five ranks $T$ is 18, we also know the difference is +$2 * 18 = 55 = -19$, when $T = 19$, the difference $T - B$ is $2 * 19 - 15 = +-17$, when $T = 20$, $T - B = -15$ and so on. +::: + In @tbl-ability-trials we see that the observed sum (17) is lower than the sum of the top 5 ranks in all but one (shown in bold) of the ten random trials (trial 5), which suggests that there is a good chance (9 in 10) that the @@ -523,16 +542,74 @@ the probability that an association as strong as that observed might occur by chance even if there is no association. Table 23-3 shows a proportion of five trials out of twenty. -In the RESAMPLING STATS program "Ability2" we first note that the top 5 -athletes had 4 of the top 5 I.Q. scores. So we constitute the set of 10 -IQ rankings (vector A). We then SHUFFLE A and TAKE 5 I.Q. rankings (out -of 10). We COUNT how many are in the top 5, and keep SCORE of the -result. After REPEATing 1000 times, we find out how often we select 4 of -the top 5. +In the {{< var lang >}} program below, we first note that the top 5 athletes +had 4 of the top 5 I.Q. scores. So we constitute the set of 10 IQ rankings +{{< var array >}} `a`. We then SHUFFLE A and TAKE 5 I.Q. rankings (out of 10). +We COUNT how many are in the top 5, and keep SCORE of the result. After REPEATing 1000 +times, we find out how often we select 4 of the top 5. -Table 23-3 +```{python opts.label="py_ed"} +# Number of repeats. +N = 10000 +# The IQ ranks, ready for shuffling. +iq_ranks = np.arange(1, 11) # 1 through 10. +# Scorekeeping array. +z = np.zeros(N) -**Results of 20 Random Trials of the Problem "ABILITY2"** +for i in range(N): + # Shuffle the ranks. + shuffled = rnd.permuted(iq_ranks) + # Take the first 5 ranks. + top_5 = shuffled[:5] + # Everything up until this point is the same as the code above. + # Here is the difference. + # Check whether the selected IQ ranks are in the top 5. + are_top = top_5 <= 5 + # Count how many were in the top 5 + n_are_top = np.sum(are_top) + # Keep track of the result of each trial. + z[i] = n_are_top + # End the experiment, go back and repeat. + +# Determine how many trials produced 4 or more top ranks by chance. +k = np.sum(z >= 4) +# Convert to a proportion. +kk = k / N +# Show the result. +kk +``` + +```{r opts.label="py_ed"} +# Number of repeats. +N <- 10000 +# The IQ ranks, ready for shuffling. +iq_ranks <- 1:10 # 1 through 10. +# Scorekeeping array. +z <- numeric(N) + +for (i in 1:N) { + # Shuffle the ranks. + shuffled <- sample(iq_ranks) + # Take the first 5 ranks. + top_5 <- shuffled[1:5] + # Everything up until this point is the same as the code above. + # Here is the difference. + # Check whether the selected IQ ranks are in the top 5. + are_top <- top_5 <= 5 + # Count how many were in the top 5 + n_are_top <- sum(are_top) + # Keep track of the result of each trial. + z[i] <- n_are_top + # End the experiment, go back and repeat. +} + +# Determine how many trials produced 4 or more top ranks by chance. +k = sum(z >= 4) +# Convert to a proportion. +kk = k / N +# Show the result. +kk +``` Observed Score: 4 @@ -560,58 +637,26 @@ Observed Score: 4 20 4 Yes ----------- ----------- --------------- +: Results of 20 random trials of the top-5 rank counts {#tbl-top-rank-counts} - - -```matlab -' Program file: "ability2.rss" - -REPEAT 1000 - ' Do 1000 experiments. - NUMBERS 1,10 a - ' Constitute the set of I.Q. ranks. - SHUFFLE a b - ' Shuffle them. - TAKE b 1,5 c - ' Take the first 5 ranks. - COUNT c between 1 5 d - ' Of those 5, count how many are among the top half of the ranks (1-5). - SCORE d z - ' Keep track of that result in z -END -' End one experiment, go back and repeat until all 1000 are complete. -COUNT z >= 4 k -' Determine how many trials produced 4 or more top ranks by chance. -DIVIDE k 1000 kk -' Convert to a proportion. -PRINT kk -' Print the result. - -' Note: The file "ability2" on the Resampling Stats software disk contains -' this set of commands. -``` - -So far we have proceeded on the theory that if there is *any* -relationship between athletics and I.Q., then the better athletes have -higher rather than lower I.Q. scores. The justification for this -assumption is that past research suggests that it is probably true. But -if we had *not* had the benefit of that past research, we would then -have had to proceed somewhat differently; we would have had to consider -the possibility that the top five athletes could have I.Q. scores either -higher *or* lower than those of the other students. The results of the -"two-tail" test would have yielded odds weaker than those we observed. -**Example 23-2: Athletic Ability and I.Q. a Third Way.** +So far we have proceeded on the theory that if there is *any* relationship +between athletics and I.Q., then the better athletes have higher rather than +lower I.Q. scores. The justification for this assumption is that past research +suggests that it is probably true. But if we had *not* had the benefit of that +past research, we would then have had to proceed somewhat differently; we would +have had to consider the possibility that the top five athletes could have I.Q. +scores either higher *or* lower than those of the other students. The results +of the "two-tail" test would have yielded odds weaker than those we observed. -(Program "Ability3"). +### Example: Athletic ability and I.Q. a third way -Example 23-1 investigated the relationship between I.Q. and athletic -score by ranking the two sets of scores. But ranking of scores loses -some efficiency because it uses only an "ordinal" (rank-ordered) rather -than a "cardinal" (measured) scale; the numerical shadings and relative -relationships are lost when we convert to ranks. Therefore let us -consider a test of correlation that uses the original cardinal numerical -scores. +The example in @sec-althletic-iq-ranks investigated the relationship between +I.Q. and athletic score by ranking the two sets of scores. But ranking of +scores loses some efficiency because it uses only an "ordinal" (rank-ordered) +rather than a "cardinal" (measured) scale; the numerical shadings and relative +relationships are lost when we convert to ranks. Therefore let us consider a +test of correlation that uses the original cardinal numerical scores. First a little background: @fig-hypot_iq_athletic_1 and @fig-hypot_iq_athletic_2 show two hypothetical cases of very high association @@ -626,12 +671,10 @@ an unfortunate deficiency of such diagrams that *some* variable must arbitrarily be placed on the x-axis, whether you intend to suggest causation or not.) - ```{r fig-hypot_iq_athletic_1, opts.label="svg_fig", fig.cap="Hypothetical Scores for I.Q. and Athletic Ability — 1"} include_svg('diagrams/hypot_iq_athletic_1.svg') ``` - ```{r fig-hypot_iq_athletic_2, opts.label="svg_fig", fig.cap="Hypothetical Scores for I.Q. and Athletic Ability — 2"} include_svg('diagrams/hypot_iq_athletic_2.svg') ``` @@ -648,10 +691,9 @@ include_svg('diagrams/given_iq_athletic.svg') ## Correlation: sum of products -Now let us take advantage of a handy property of numbers. The more -closely two sets of numbers match each other in order, the higher the -sums of their products. Consider the following arrays of the numbers 1, -2, and 3: +Now let us take advantage of a handy property of numbers. The more closely two +sets of numbers match each other in order, the higher the sums of their +products. Consider the following arrays of the numbers 1, 2, and 3: 1 x 1 = 1 @@ -665,10 +707,10 @@ SUM = 14 SUM = 11 -I will not attempt a mathematical proof, but the reader is encouraged to -try additional combinations to be sure that the highest sum is obtained -when the order of the two columns is the same. Likewise, the lowest sum -is obtained when the two columns are in perfectly opposite order: +I will not attempt a mathematical proof, but the reader is encouraged to try +additional combinations to be sure that the highest sum is obtained when the +order of the two columns is the same. Likewise, the lowest sum is obtained when +the two columns are in perfectly opposite order: 1 x 3 = 3 From 980ade55e8e8bc7f8ced78a0ed84f010dbb1217e Mon Sep 17 00:00:00 2001 From: Ben Herbst Date: Wed, 25 Oct 2023 19:40:26 +0200 Subject: [PATCH 4/9] Add code and more sections" --- source/correlation_causation.Rmd | 760 ++++++++++++++++++++++++++++++- 1 file changed, 756 insertions(+), 4 deletions(-) diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index c6b2d829..0d873191 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -39,7 +39,744 @@ We will remove this warning when the page has adequate formatting, and we have ported the code. ::: +## Preview + +In this chapter we'll reflect on the basic principles behind resampling in particular, and statistics or probability in +general. One way of approaching this is to ask, Is there anything that we cannot do with resampling? +The answer is an emphatic *yes*! In order to be able to sample, the probability distributions to sample +*from* has to be available. There are many situations where this is simply not true. Any time a manager +at a financial institution asks you whether a particular client is likely to default if granted a loan, +it takes you outside the realm of sampling. In this case you'll need to build a probabilistic *model* +will that will estimate the *probability of default*. + +We'll return to this question and explain how one can develop models that answer questions like these. + +Another question one can ask is whether everything one can ever want to know, can in principle be learned from data. +In this case the answer is not so straightforward. Ultimately, it is a deep philosophical question, one that we'll not pursue. +However, we can approach this question from the data point of view and examine examples where data cannot provide the +answers. + +The topic of this chapter is to understand the limitations of resampling or data in general, and give an indication of what approaches +are available beyond resampling. + +First we need to provide some background. + +## An excursion into linear algebra. + +:::{.callout-note} +## On solving linear systems of equations. + +If you are familiar with the normal equations for solving linear least squares equations, you can safely skip this +section. +::: + +Lets set it as our goal to solve any linear system of equations that can be written in matrix form as + +$$ +A \mathbf{x} = \mathbf{y}, +$$ +where $A$ is an $m\times n$ matrix, $\mathbf{x}$ is an $n\times 1$ (columns) vector, and $\mathbf{y}$ is an $m\times 1$ +(column) vector. Note that we want to find a solution for any such system - there are no other conditions other than the +constraints on the dimensions that we have listed. Note in particular that $m$ need not equal $n$, and if $m=n$ we don't +demand that the determinant of $A$ be non-zero. Let's give examples of the typical situations that one encounters in general, +although we'll mainly be interested in the second example. + + +1. The first example represents all systems of equations where $m=n$ with non-zero determinant. In all these cases the equation is *solvable* by perhaps using + Gaussian elimination with partial pivoting. (Not Cramer's rule!) + $$ x = 1.$$ + This equation is so trivial that we almost don't recognize it as an equation. In this case $A = [1]$, i.e. + a $1\times 1$ matrix, $\mathbf{x} = 1$ and $\mathbf{y} = 1$. +2. The second example, and the one we are primarily interested in, represents all overdetermined systems of + linear equations, i.e. systems of equations where we have more equations than unknowns. + \begin{eqnarray*} + x &=& 1\\ + x &=& 1.1\\ + x &=& 1.2. + \end{eqnarray*} + This system clearly has a problem. How can $x$ simultaneously be equal to 1, 1.1, and 1.2? The system is no solution! + not in the ordinary sense. It turns out that this system is of the form of our generic equation. We see this by rewriting + it in matrix form, + + $$ + \left[ + \begin{array}{c} + 1\\1\\1 + \end{array} + \right] + x = \left[ + \begin{array}{c} + 1.0\\1.1\\1.2 + \end{array} + \right]. + $$ + This equation represents all equations where there are more equations than unknowns, i.e. the system is *over determined*. + It can only be solved in the least squares sense. Instead of looking for an exact solution, which does not exist, we look + instead for a solution that best fits the system in the least squares sense. Technically, we are looking for the solution + that minimizes the squared error. Please take our word for it that the solution is given by the *normal equations* + (of course you should not take our word for it but do a little investigation yourself!), + + $$ + A^T A \mathbf{x} = A^T \mathbf{y}. + $$ + Here $A^T$ is the transpose of $A$, $A^TA$ is a square, $n\times n$ matrix and $A^T \mathbf{y}$ is an $n\times 1$ (column) + vector. Moreover, if $A$ is of full rank, it can be shown that $A^TA$ has an inverse, a situation that we assume to be true. + + Before we continue with the example let's point out that this situation is common in practice. When Worsely measured the + height of the sun above the horizon 10 times, and got 10 different answers, he wat exactly in this this position. Recall + that he took the expedient to calculate the mean of his observations. Let's calculate the solution of our system + above using the normal equations, + + $$ + \left[ \begin{array}{ccc} + 1 & 1 & 1 \end{array} \right] + \left[ + \begin{array}{c} + 1\\1\\1 + \end{array} + \right] + x = \left[ \begin{array}{ccc} + 1 & 1 & 1 \end{array} \right] + \left[ + \begin{array}{c} + 1\\1\\1 + \end{array} + \right] + \left[ + \begin{array}{c} + 1.0\\1.1\\1.2 + \end{array} + \right]. + $$ + Multiplying out we get + + $$ + 3 x = 1.0+1.1+1.2, + $$ + or + $$ x = \frac{1.0+1.1+1.2}{3}.$$ + This is simply the mean of the 'measurements', exactly what Worsely used! +3. The example represents all systems of equations where we have an infinite number of solutions. + $$ + x + y = 1. + $$ + In contrast with the previous system of equations that did not have any solutions in the strict sense, this one suffers + from the opposite problem - it has an infinite number of solutions. Yet it is of the form of our generic system with + $ A = [1 1]$, $\mathbf{x} = \left[ \begin{array}{c}x\\y \end{array} \right]$ and $\mathbf{y} = [0]$. The problem now is + to identify a natural solution among the infinity of available solutions. To find this solution one has to calculate + the generalized inverse that will take us too far from our core focus. But it turns out that it can again be cast as an + optimization problem. In this case we are looking for the solution with the smallest norm. In this case it is given by + $$ + \mathbf{x} = \left[\begin{array}{c} \frac12\\ \frac12\end{array} \right]. + $$ + It should be obvious that this is indeed a solution of the equation. What is more, it is the solution + that is the closest to the origin, i.e. out of the infinite number of solutions, this is the one with the + shortest length. + + +Armed with the normal equations we can explain the linear correlation between variables. + +## Linear correlation + +Later in this chapter, we'll discuss the general correlation or association between variables. Note that it is very important that variables are associated or correlated +with one another. In general terms it means that one can infer something of one variable by observing +another, correlated variable. The simples correlation is the linear correlation given by, + +$$ y = ax + b,$$ +for two variables $x$ and $y$. Clearly, given $a$ and $b$, we can infer the value of $y$ if we are +given the value of $x$. If $a\not=0$ then we can also infer the value of $x$, given the value of $y$. +In this case, one can have an exact correlation - there is no doubt about the value of either of the variables, +given the value of the other variable. + +In practice things are never this easy - if they were we would not have had any need for statistics of +probability - variables are never exactly correlated. If we know the value of one variable, we can deduce +an approximate value for the other, correlated, variable. The better the correlation, the more accurate +the approximation. + +Let's illustrate the ideas by means of an example. For this, we first need to generate data, i.e. +values for $x$ and corresponding values for $y$. + +First we import the necessary libraries. + + +```{python} +import numpy as np +from scipy import stats +from scipy.stats import uniform, pearsonr + +from numpy.random import Generator, PCG64 + +seed = 12311 # This is a random seed so that we can reproduce the results from one run to the next + +numpy_randomGen = Generator(PCG64(seed)) +scipy_randomGen = uniform +scipy_randomGen.random_state=numpy_randomGen +``` +Now we choose random values for $x$ in the interval $(0, 1)$ and corresponding values for $y$ in the +same interval: + +```{python} +import numpy as np +from scipy import stats +from scipy.stats import uniform, pearsonr + +from numpy.random import Generator, PCG64 + +seed = 12311 # This is a random seed so that we can reproduce the results from one run to the next + +numpy_randomGen = Generator(PCG64(seed)) +scipy_randomGen = uniform +scipy_randomGen.random_state=numpy_randomGen + +A = 0.33 # The chosen slope of the linear relationship +B = 1.0 # The chosen intercept of the linear relationship +NOISE = 0.1 # This determined the amount of noise we add to the simulation + +# Generate 50 random samples for x. +x = scipy_randomGen.rvs(size=50) + +# Generate the corresponding values for y. +# Noise is added to ensure that the linear relationship is not exact. +y = A*x + B + NOISE*scipy_randomGen.rvs(size=50) + +# Show the relationship +import matplotlib.pyplot as plt +fig, ax = plt.subplots(1, 1) +ax.plot(x, y, '.r') +``` +Two questions now arise: Given a value for $x$, what is the estimate for $y$? And, +how good is this estimate? Let's answer the first question first. The data values are clearly clustered +around a straight line. In fact, we have the equation for this line, the one we used to generate the +samples. In practice we don't have the line, only the samples, and we would like to find the line that +best fits the data. + +Every measurement $x_i,\;\; i=1,\ldots,n$ we have a corresponding $y_i,\;\; i=1,\ldots,n$. We are looking +for a linear relationship between the $x_i$ and $y_i$. That is, we are looking for $a$ and $b$ such that +$$ +y_i = ax_i + b,\;\; i =1,\ldots, n. +$$ +This gives us $n$ equations in the two unknowns, $a$ and $b$. We are back at example 2 above! We cannot +solve these equations exactly but we can solve them in a least squares sense. Writing the system in +matrix form, gives + +$$ +\left[ \begin{array}{cc} +x_1 & 1\\ +\vdots & \\ +x_n & 1 +\end{array} \right] +\left[\begin{array}{c} +a\\b \end{array} +\right] += +\left[ \begin{array}{c} y_1\\ \vdots \\ y_n\end{array} \right]. +$$ +The normal equations are given by +$$ +\left[ \begin{array}{ccc} +x_1 & \cdots & x_n\\ +1 & & 1 \end{array} \right] +\left[ \begin{array}{cc} +x_1 & 1\\ +\vdots & \\ +x_n & 1 +\end{array} \right] +\left[\begin{array}{c} +a\\b \end{array} +\right] += +\left[ \begin{array}{ccc} +x_1 & \cdots & x_n\\ +1 & & 1 \end{array} \right] +\left[ \begin{array}{c} y_1\\ \vdots \\ y_n\end{array} \right], +$$ +or after multipying, + +$$ +\left[ \begin{array}{cc} +\sum_i x_i^2 & \sum_i x_i\\ +\sum_i x_i & n \end{array}\right] +\left[\begin{array}{c} +a\\b \end{array} +\right] += +\left[ \begin{array}{c} +\sum_i x_i y_i\\ +\sum_i y_i \end{array} \right]. +$$ +Let's solve these equations for the numerical example above and see what we find. The key function is +the scipy function `stats.linregress(x,y)`. It gives us the slope and intercept of the best linear fit +of the data. + +```{python} +import numpy as np +from scipy import stats +from scipy.stats import uniform + +from numpy.random import Generator, PCG64 + +seed = 12311 # This is a random seed so that we can reproduce the results from one run to the next + +numpy_randomGen = Generator(PCG64(seed)) +scipy_randomGen = uniform +scipy_randomGen.random_state=numpy_randomGen + +A = 0.33 # The chosen slope of the linear relationship +B = 1.0 # The chosen intercept of the linear relationship +NOISE = 0.1 # This determined the amount of noise we add to the simulation + +# Generate 50 random samples for x. +x = scipy_randomGen.rvs(size=50) + +# Generate the corresponding values for y. +# Noise is added to ensure that the linear relationship is not exact. +y = A*x + B + NOISE*scipy_randomGen.rvs(size=50) + +slope, intercept, _, _, _ = stats.linregress(x,y) + +print(' slope: ', slope, '\n', 'intercept: ', intercept) + +# The coordinates for drawing the best linear fit. +x_fit = np.array([0, 1]) +y_fit = slope*x_fit + intercept + +import matplotlib.pyplot as plt +fig, ax = plt.subplots(1, 1) +ax.plot(x,y,'.r', x_fit, y_fit, 'b') +``` +The Figure shows that we indeed have a good linear fit of the data. Moreover, the slope and intercept +calculated from the data is close the values we used to produce the data. + +Let's now simplify the normal equations and see what we can learn from that. We can always normalize the data +by subtracting the means of both the $x_i$ and the $y_i$ before we proceed with the calculations. +This means that + +$$ +\sum_i x_i = 0 = \sum_i y_i. +$$ +The normal equations now simplify as, + +$$ +\left[ \begin{array}{cc} +\sum_i x_i^2 & 0\\ +0 & n \end{array}\right] +\left[\begin{array}{c} +a\\b \end{array} +\right] += +\left[ \begin{array}{c} +\sum_i x_i y_i\\ +0 \end{array} \right]. +$$ +This gives us $b=0$. Therefore, if we fit a straight line to the data *after* removal of the means, +the intercept is 0. + +Let's do another normalization and divide the data with their standard deviations. Since we know that the +means are 0, the standard deviations are just the length of the data vectors, + +$$ +\sigma_x^2 = \sum_i x_i^2 +$$ +and +$$ +\sigma_y^2 = \sum_i y_i^2. +$$ +The squares of the standard deviations, $\sigma_x^2$ and $\sigma_y^2$ also have special names, namely +the variance. + +Now if we assume that we have already subtracted the means and also divided by the standard deviations, +we can set $\sum_i x_i^2 = 1$ in the normal equations. Since we already know that $b=0$, we find that + +$$ +a = \sum_i x_i y_i. +$$ +The right hand side should be familiar from linear algebra - it is the inner product between the two data vectors +$$ +\begin{eqnarray*} +\mathbf{x} & = & \left[ \begin{array}{c} +x_1 \\ \vdots \\x_n \end{array} \right]\\ +\mathbf{y} & = & \left[ \begin{array}{c} +y_1 \\ \vdots \\y_n \end{array} \right]. +\end{eqnarray*} +$$ +But we have already normalized the two data vectors so that their respective lengths are 1. Therefore +the inner product equals the cosine of the angle between the two vectors. + +We have therefore arrived at an important conclusion: + +*If the data of two data vectors are normalized by first subtracting their means and then divide by their +lengths (standard deviations), the slope and intercept of the best linear fit have a special meanings. +intercept $b = 0$, i.e. the best linear fit passes through the origin, and the slope equals the cosine of +the angle between the two data vectors. + +Let's reflect on the meaning of the slope. The cosine is a measure of how much the two vectors differ from +one another. If the angle between the two is zero, the cosine is either 1 or -1. If the two vectors are +perpendicular, the two vectors form a right angle and the cosine is 0. + +If the cosine of the angle is either +1 or -1 we say that the two datasets are perfectly correlated. In general, +if two dataset are positively correlated ($a>0$(not necessarily perfectly) it means that if the values of one dataset +has an upward trend, the values of the other dataset will also have an upward trend. On the other hand, +if the two datasets are negatively correlated $a<0$ it means that if the values of one dataset has an upward +trend, the other dataset will have a downward trend. + +The cosine of the angle between the two datasets is called the *correlation coefficient*. It is also +called *Pearson's* correlation coefficient after Karl Pearson who developed it from ideas proposed by Francis +Galton and actually first formulated by Auguste Bravais about 40 years earlier. We'll stick to calling +it the *correlation coefficient*! + +:::{.callout-note} +## The cosine measure. + +In machine learning the cosine of the angle between two vectors are commonly used as a measure of how much +the two vectors differ from one another. +::: + +The formula for the correlation coefficient between two variables, $x$ and $y$, is given by +$$ +r_{xy} = \frac{ \sum_{i=1}^n (x_i - \bar{x})((y_i - \bar{y}) } + { \sqrt{ \sum_{i=1}^n (x_i - \bar{x})^2} \times \sqrt{ \sum_{i=1}^n (y_i - \bar{y})^2} }, +$$ +where the means are given by, +$$ +\bar{x} = \frac1n\sum_{i=1}^n x_i, +$$ +$$ +\bar{y} = \frac1n \sum_{i=1}^n y_i. +$$ + +We'll now put everything together in code. + +```{python} +import numpy as np +from scipy import stats +from scipy.stats import uniform, pearsonr + +from numpy.random import Generator, PCG64 + +seed = 12311 # This is a random seed so that we can reproduce the results from one run to the next + +numpy_randomGen = Generator(PCG64(seed)) +scipy_randomGen = uniform +scipy_randomGen.random_state=numpy_randomGen + +A = 0.33 # The chosen slope of the linear relationship +B = 1.0 # The chosen intercept of the linear relationship +NOISE = 0.1 # This determined the amount of noise we add to the simulation + +# Generate 50 random samples for x. +x = scipy_randomGen.rvs(size=50) + +# Generate the corresponding values for y. +# Noise is added to ensure that the linear relationship is not exact. +y = A*x + B + NOISE*scipy_randomGen.rvs(size=50) + +# Normalize the data +# Subtract the means +x = x - np.mean(x) +y = y - np.mean(y) + +# Calulate the standard deviation using scipy's tsdt() method and normalize +std_x = stats.tstd(x) +std_y = stats.tstd(y) +x = x/std_x +y = y/std_y + + +# Calculate the best linear fit +slope, intercept, _, _, _ = stats.linregress(x,y) +x_fit = np.array([-2, 2]) +y_fit = slope*x_fit + intercept + +# The coordinates for drawing the best linear fit. +x_fit = np.array([-2, 2]) +y_fit = slope*x_fit + intercept + +# Plot the result +from matplotlib import pyplot as plt +fig, ax = plt.subplots(1, 1) +ax.plot(x,y,'.r', x_fit, y_fit, 'b') + +# We are particularly interested in the values of the slope and intersect +# We also want to compare the slope with the correlation coefficient as calculated +corr = pearsonr(x,y) # The correlation coefficient + +print(' intercept: ', intercept, '\n', 'slope: ', slope, '\n', 'corr coeff: ', corr.statistic) +``` +This is close enough to the theoretical values to convince the most sceptical among us! + +### Example: Is Athletic Ability Directly Related to Intelligence? {#sec-athletic-iq-ranks} + +Question: **Is there correlation between the two variables?** + +A scientist often wants to know whether or not two characteristics go together, +that is, whether or not they are correlated (that is, related or associated). +For example, do youths with high athletic ability tend to also have high I.Q.s? + +Hypothetical physical-education scores of a group of ten high-school boys are +shown in @tbl-physical-mental, ordered from high to low, along with the I.Q. +score for each boy. The ranks for each student's athletic and I.Q. scores are +then shown in the third and fourth columns: + +| Athletic Score | I.Q. Score | Athletic Rank | I.Q.Rank | +|--------------------|----------------|-------------------|--------------| +| 97 | 114 | 1 | 3 | +| 94 | 120 | 2 | 1 | +| 93 | 107 | 3 | 7 | +| 90 | 113 | 4 | 4 | +| 87 | 118 | 5 | 2 | +| 86 | 101 | 6 | 8 | +| 86 | 109 | 7 | 6 | +| 85 | 110 | 8 | 5 | +| 81 | 100 | 9 | 9 | +| 76 | 99 | 10 | 10 | + +: Hypothetical athletic and i.q. scores for high school boys {#tbl-physical-mental} + +From a casual inspection of the Athletic scores and the IQ scores it appears as if there is indeed a correlation or +association. But we don't need to guess, we can calculated the correlation coefficient. + +```{python} +import numpy as np +from scipy.stats import pearsonr + +athl_score = np.array([ +97, 94, 93, 90, 87, 86, 86, 85, 81, 76 +]) +iq_score = np.array([ +114, 120, 107, 113, 118, 101, 109, 110, 100, 99 +]) + +corr = pearsonr(athl_score, iq_score) +print('correlation: ', corr.statistic) +``` +The correlation coefficient indicates that there might indeed be a strong (positive) relationship between athletic +ability and I.Q. The reason for the cautionary note is that this is a tiny sample and there might be many other reasons +for the relationship, such as recruitment policy, etc. + +This brings us naturally to a discussion of causality. + +## Causality + +Before commencing on a discussion of causality proper, let's summarize what we have just learned. Our discussion above +centered around *linear correlation*. We developed the notion of the *correlation coefficient* which is a measure of +how well two variables are linearly correlated. The question then arises what about more general associations between +variables, such a nonlinear relationships between variables? On that, linear correlation has nothing to contribute. +If there is a nonlinear association between variables it will not possible to determine that using linear correlation. + +Then there is the matter of statistical dependence or independence. Intuitively, two variables are statistically +independent if knowledge about one variable does not tell us anything about the other variable. Here a formula makes it +clear. $x$ and $y$ are statistically independent if +$$ +P(x | y) = P(x). +$$ +This means that the probability distribution of $x$ remains the same, no matter what the value of $y$ might be. One cannot +do inference given statistically independent variables. + +Are variables that are statistically independent, linearly uncorrelated? The answer is, yes, because there is no +association between the variables, linear or nonlinear. On the other hand, are variables that are statistically dependent, +also linearly correlated? In this case the answer is, no. There might be a nonlinear relationship between the variables, +that is linearly uncorrelated. We an easily construct a simple example. + +```{python} +import numpy as np +from scipy import stats +from scipy.stats import uniform, pearsonr + +from numpy.random import Generator, PCG64 + +seed = 12311 # This is a random seed so that we can reproduce the results from one run to the next + +numpy_randomGen = Generator(PCG64(seed)) +scipy_randomGen = uniform +scipy_randomGen.random_state=numpy_randomGen + +NOISE = 0.1 # This determined the amount of noise we add to the simulation + +# Generate data +t = np.linspace(0, 2*np.pi, 101) +x = np.cos(t) + NOISE*scipy_randomGen.rvs(size=101) +y = np.sin(t) + NOISE*scipy_randomGen.rvs(size=101) + +slope, intercept, _, _, _ = stats.linregress(x,y) + +print(' slope: ', slope, '\n', 'intercept: ', intercept) + +# The coordinates for drawing the best linear fit. +x_fit = np.array([-1.1, 1.1]) +y_fit = slope*x_fit + intercept + +import matplotlib.pyplot as plt +fig, ax = plt.subplots(1, 1) +ax.plot(x,y,'.r', x_fit, y_fit, 'b') + +# The correlation coefficient +corr = pearsonr(x, y) +print('corr coeff: ', corr.statistic) + +``` +Although we can learn much from the value of $y$ by knowing the value of $x$, the correlation coefficient is tiny +and gives no hint of the strong association between the two variables. This example shows us that looking for +*linear* associations do us no good if we are dealing with nonlinear associations. In practice of course, the most +interesting associations are nonlinear, requiring techniques that are able to uncover these relationships. That is +exactly what the huge breakthroughs since about 2012 deep neural networks provided for us. You can think of the +different (nonlinear) layers of a deep neural network as a nonlinear transformation of the data into a linear +relationship where we can apply our linear techniques. + +In statistics, everything used to be about *inference*, based on the association between variables. The goal was to +extract the maximum amount of information possible from the data. To give an idea of just how wildly successful this +program has been, one has only to refer to the thoroughly impressive generative methods, such as ChatGPT, that are solely +based on data. So successful has this been that it is becoming increasingly difficult to distinguish between the +generative model and a human response. At first sight it would appear as if we have arrived at true machine intelligence. +But appearances can be misleading, and humans are gullible. One should not mistake a huge, perfect memory with intelligence +of understanding. The generative model is just as likely to repeat the latest conspiracy theory as providing useful advice. +Granted, this depends on the prevalence of data in specific fields. In some areas conspiracy theories are more prevalent +than the genuine article, in which case the generative model is more likely to regurgitate the nonsense than the sense. +We don't mean to disparage generative models - their achievements are spectacular. And there is no doubt that it will change +the general landscape in significant ways. To quote a lost author (sadly lost to me), it is not hard to envisage a future +where the principle task of the programmer will be to adapt a generative model to specific situations. Something like, +get rid of your conspiracy theories, in this application they are not wellcome! + +What lies beyond data, what are the questions we cannot answer from data? + +### Example: A statistical view. + +Many years ago a statistician friend of mine made the comment that we know that the sun will rise tomorrow morining +because we have watched it rise for thousands, if not millions, of years. As a budding applied mathematician who knew +about gravity and things like that, the comment bothered me. Apart from the fact that it probably reflected a specific +philosophical view of the nature of science, I was hard-fetched to find an adequate response. + +Today my response will perhaps be along the lines of, there is a reason, beyond the observations, why the sun rises +every morning. And if we ignore the reasons, we loose all understanding, and all the benefits that go with understanding. + +With data there *must* be a reason to *why* we observe what we observe. If there is no reason, no coherent mechanism that +that produces the data, there is little if anything the data will tell us. Of course, the mechanism is often too +complex to understand in any useful way. In that case we have no choice but to try and learn as much as we can from the +data. But that inevitably limits our understanding. + +It is not unreasonable to ask the question whether there is any reason to believe that the information that is needed +to answer a specific questions, is actually present in the data. Practical experience has taught us that one should indeed +ask this question. But it has also taught us that a answer is not always easily forthcoming. Often, the information is +disguised and only present via proxy. + +The next example is one where the information really is not in the data. + +### Example: A shoe sale. + +A potential client approached our company asking for a model that will optimize their profit putting their unsold shoes +on sale at season's end. The problem is clear: If they give the shoes away, they will get rid of them but without any profit. +If they charge too much, they won't sell any and will be left with a lot of surplus shoes, and no profit. Our sales people +said sure, just give us your historical data as far back as you can of prices versus sales. Our clever experts will +develop a model in no time. Wrong! The clever experts had no idea how to approach this problem. Fortunately they could +refer the sales people to The Book of Why, where Judea Pearl and Dana Mackenzie wrote, *Many scientists have been quite +traumatized to learn that none of the methods they learned in statistics will answer a simple question like "What happens +if we double the price"?* Or, as in our case, half the price. + +Note that nothing in the historical data informs us about the outcome of our actions. Last time when the price was half +the present price, may have been before inflation (or covid!). Or all the other stores has an even lower or higher +price. The best-known way, and still the prevalent approach, to answer questions like these, is through carefully controlled +experiments. However, there is another way due the more recent studies into cause-effect relationships - *causality* - +and that is to develop a model of consumer behaviour, including present marketing conditions. The next step is to find +surrogate data that can serve as a proxy for the real thing (see Pearl and Mackenzie). + +### Example: Covid + +During the Covid pandemic there was an association between the number of Covid tests that were performed and the number +of positive test cases. The more tests that were performed, the more positive test cases were recorded. Also, an association +was recorded between Civid-related deaths and the number of tests that were performed. That did make +a politician too happy - it reflected badly on the administration. So, the politician decided to intervene and stop the +Covid tests. Sure enough the positive test cases dropped to zero. +The politician was therefore right in asserting that there was a causal relationship between the tests and the positive +test cases. However, the number of deaths due to Covid showed no reaction to the stopping of the rests. More importantly, +the number of Covid-related deaths did not show any response due to the stopping of the tests. Not surprisingly, this shows +that, although there was an association between the number of tests and Covid-related deaths, there is no causual relationship - +the tests did not cause the Covid-related deaths. + +The previous examples are all rooted in *causality* - the concept that was for a long time banned from statistics. +Although most of us have a pretty good idea what is meant by *causality*, a precise definition alludes us; it might be +impossible not to end up in a circular argument. +We'll therefore treat it as a primitive concept that we intuitively understand, much like points or lines in geometry. +In short, we assume that you know what we are talking about when we talk about *causality*. + +Let's now look at the keys ideas of causality as illustrated by the previous examples. A very readable and popular introduction by one +of the pioneers of the field is, The Book of Why. + +### Association + +*Association* is just that, from data we learn that if we observe the value of some variable, we can get a quantitative +idea of what the value of another, associated, variable can be. There is no understanding, no explanation of how or why this +might have come about. In order to understand, to explain, to establish causal relationships, one has to understand +the *why* of the observed data. That is, we need to understand the mechanism that produced the data. +Even if it is not in detail, we need to have some understanding of it. + +And to do this, we need to look outside the data. + + +One cannot explain gravitation by observing at what time the sun rises in the morning. We cannot build +an optimum sales model, based on the historical data of sales prices and sales. + +### Intervention + +In order to make any progress beyond learning from data, we have, not only to observe, but also to act, to intervene. +We need to move beyond being a computer that consumes the dat we are given, to being human, to take charge. + + + +$$ +P(x|y) = P(x) +$$ +then also +$$ +P(y|x) = P(y). +$$ + +Now things are about to change (for us, the ideas have been around for quite a while!). First of all we need to make a +distinction about *association* and *causality*. If we talk about the association, better perhaps, *correlation* in general +(not linear) terms, between variables, we are not allowed to infer any *causal* relationship between the variables. + + + + + + + + + + - + + + + - + \ No newline at end of file From 11d1abeb6150395f9914dff412f9f89279b59801 Mon Sep 17 00:00:00 2001 From: Stefan van der Walt Date: Wed, 25 Oct 2023 11:46:42 -0700 Subject: [PATCH 5/9] Remove draft banner; fix inline math --- source/correlation_causation.Rmd | 21 ++------------------- 1 file changed, 2 insertions(+), 19 deletions(-) diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index 0d873191..b896cce6 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -22,23 +22,6 @@ source("_common.R") # Correlation and Causation -:::{.callout-warning} -## Draft page partially ported from original PDF - -This page is an automated and partial import from the [original second-edition -PDF](https://resample.com/content/text/27-Chap-23.pdf). - -We are in the process of updating this page for formatting, and porting any -code from the original [RESAMPLING-STATS -language](http://www.statistics101.net) to Python and R. - -Feel free to read this version for the sense, but expect there to be multiple -issues with formatting. - -We will remove this warning when the page has adequate formatting, and we have -ported the code. -::: - ## Preview In this chapter we'll reflect on the basic principles behind resampling in particular, and statistics or probability in @@ -162,7 +145,7 @@ although we'll mainly be interested in the second example. $$ In contrast with the previous system of equations that did not have any solutions in the strict sense, this one suffers from the opposite problem - it has an infinite number of solutions. Yet it is of the form of our generic system with - $ A = [1 1]$, $\mathbf{x} = \left[ \begin{array}{c}x\\y \end{array} \right]$ and $\mathbf{y} = [0]$. The problem now is + $A = [1 1]$, $\mathbf{x} = \left[ \begin{array}{c}x\\y \end{array} \right]$ and $\mathbf{y} = [0]$. The problem now is to identify a natural solution among the infinity of available solutions. To find this solution one has to calculate the generalized inverse that will take us too far from our core focus. But it turns out that it can again be cast as an optimization problem. In this case we are looking for the solution with the smallest norm. In this case it is given by @@ -2466,4 +2449,4 @@ sum-of-products method, and expect to get the same result. problem. 2. Write a computer program to implement those steps. ---> \ No newline at end of file +--> From cfc29a3760de7505748cb294065a3cf45f4d7da5 Mon Sep 17 00:00:00 2001 From: Ben Herbst Date: Thu, 26 Oct 2023 09:19:01 +0200 Subject: [PATCH 6/9] * Slight improvement on formulation * On gravity * On athletic/IQ scores --- source/correlation_causation.Rmd | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index 0d873191..b5988cb9 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -512,6 +512,13 @@ A scientist often wants to know whether or not two characteristics go together, that is, whether or not they are correlated (that is, related or associated). For example, do youths with high athletic ability tend to also have high I.Q.s? +:::{.callout-note} +## The data. + +This data is clearly hypothetical and should not be used to draw any important conclusions. For one, it is a very small +dataset. For larger datasets it might well be that the scores follow a different distribution. +::: + Hypothetical physical-education scores of a group of ten high-school boys are shown in @tbl-physical-mental, ordered from high to low, along with the I.Q. score for each boy. The ranks for each student's athletic and I.Q. scores are @@ -530,9 +537,9 @@ then shown in the third and fourth columns: | 81 | 100 | 9 | 9 | | 76 | 99 | 10 | 10 | -: Hypothetical athletic and i.q. scores for high school boys {#tbl-physical-mental} +: Hypothetical athletic and I.Q. scores for high school boys {#tbl-physical-mental} -From a casual inspection of the Athletic scores and the IQ scores it appears as if there is indeed a correlation or +From a casual inspection of the Athletic scores and the I.Q. scores it appears as if there is indeed a correlation or association. But we don't need to guess, we can calculated the correlation coefficient. ```{python} @@ -642,8 +649,10 @@ What lies beyond data, what are the questions we cannot answer from data? Many years ago a statistician friend of mine made the comment that we know that the sun will rise tomorrow morining because we have watched it rise for thousands, if not millions, of years. As a budding applied mathematician who knew -about gravity and things like that, the comment bothered me. Apart from the fact that it probably reflected a specific -philosophical view of the nature of science, I was hard-fetched to find an adequate response. +about gravity, the elliptic orbits of the planets and things like that, the comment was bothersome. Apart from the fact +that it probably reflected a specific philosophical view of the nature of science, I was hard-pressed to find an adequate +response. Isn't science just about observation, developing models to fit the facts, to be discarded when it comes into +conflict with the facts and replaced with another ephemeral model? Today my response will perhaps be along the lines of, there is a reason, beyond the observations, why the sun rises every morning. And if we ignore the reasons, we loose all understanding, and all the benefits that go with understanding. @@ -653,6 +662,10 @@ that produces the data, there is little if anything the data will tell us. Of co complex to understand in any useful way. In that case we have no choice but to try and learn as much as we can from the data. But that inevitably limits our understanding. +It does sounds odd, though, that one should discard hard-earned knowledge and submit to the limitations of data. Our +understanding of gravity, that earth revolves around the sun and spins on its own axis, did not come without a considerable +effort, and price. Just ask Copernicus or Galileo. No reason to pretend that we don't need it! + It is not unreasonable to ask the question whether there is any reason to believe that the information that is needed to answer a specific questions, is actually present in the data. Practical experience has taught us that one should indeed ask this question. But it has also taught us that a answer is not always easily forthcoming. Often, the information is @@ -718,7 +731,7 @@ an optimum sales model, based on the historical data of sales prices and sales. ### Intervention In order to make any progress beyond learning from data, we have, not only to observe, but also to act, to intervene. -We need to move beyond being a computer that consumes the dat we are given, to being human, to take charge. +We need to move beyond being a computer that consumes the dat we are given, to being human, to take charge. From 628c2f6b247a735a9de72f1937a3de2f6935d5e4 Mon Sep 17 00:00:00 2001 From: Ben Herbst Date: Tue, 31 Oct 2023 21:13:13 +0200 Subject: [PATCH 7/9] * Add to causality * Need to iterate on this, not add to it. --- source/correlation_causation.Rmd | 72 +++++++++++++++++++++++++------- 1 file changed, 56 insertions(+), 16 deletions(-) diff --git a/source/correlation_causation.Rmd b/source/correlation_causation.Rmd index 5e6bc197..0f91d3af 100644 --- a/source/correlation_causation.Rmd +++ b/source/correlation_causation.Rmd @@ -714,22 +714,62 @@ an optimum sales model, based on the historical data of sales prices and sales. ### Intervention In order to make any progress beyond learning from data, we have, not only to observe, but also to act, to intervene. -We need to move beyond being a computer that consumes the dat we are given, to being human, to take charge. - - - -$$ -P(x|y) = P(x) -$$ -then also -$$ -P(y|x) = P(y). -$$ - -Now things are about to change (for us, the ideas have been around for quite a while!). First of all we need to make a -distinction about *association* and *causality*. If we talk about the association, better perhaps, *correlation* in general -(not linear) terms, between variables, we are not allowed to infer any *causal* relationship between the variables. - +We need to move beyond being a computer that consumes the data we are given, to being human, to take charge. By +*intervention* we mean that we are actively 'turning a knob', and see how that affects the value that we are interested +in. If the active intervention changes the value of the observation, we conclude a causal effect. Turning the knob +*causes* the value to change. If you press down on the accelerator of you car, it goes faster. Your action causes the +increase in speed. + +Note that this is quite different from, say, linear regression that might show the relationship between how hard you +press on the accelerator and how fast you go. Given how hard one presses, it is possible to *infer* the speed for that +pressure. However, it is not possible to conclude any causal relationship. For that we need to *understand* the +relationship between force and acceleration. + +It is not possible to infer causal relationships from data, mainly due to the possibility of *confounders* - the common +cause of two observed variables. We already alluded to it when we examined the strong relationship between I.Q. +and athletic ability. One cannot conclude that the one is the cause of the other because of confounders. It is quite +possible that the relationship is caused by the recruitment program - only talented athletes with a high I.Q. are +recruited with the individuals that score high on both abilities the most sought after. There may also be other +confounders. + +One way one can assess whether athletic ability causes a high I.Q. is to study an intervention in terms of an injury +and see whether it has a measurable effect on I.Q. + +### Counterfactuals + +Pearl calls this the highest level of causation. At this level one starts to ask questions about situations that have +not been observed. Even situations that cannot be observed. One such question, and of particular importance for someone +my age, is, what is my life expectancy? It is really important to know if you need to survive after retirement on the +money you saved during your working life. But you have no observations to guide you. Note that this question is very +different from the one that insurance companies ask their actuaries to calculate: what is the expected or average +life expectancy of someone who doesn't smoke or drink excessively, has no known health issues and does not engage in +any high-risk sport, etc. In that case one can use lots of data and come with an estimate of the expected life +expectancy for different *categories* of the population. For me as individual, it is just about meaningless. + +Physicists, among others, ask and answer counterfactual questions all the time. We don't need to observe a particular +planet orbiting a particular star to be able to predict what its position will be a thousand years from now. In this case, +as in many others, a few data points together with a deep understanding of the processes involved, give us the answer. + +### Summary + +"In God we trust. All others [must] have data". This is the provocative title of one of the chapters in The Emperor of +All Maladies. A Biography of Cancer by Siddhartha Mukherjee. In this chapter he addresses the question why the +radical surgery on carcinoma of the breast followed the principle that "the disease, even in its early stage, is such +a formidable enemy that it is my duty to carry out as radical an operation as the ... anatomy permits", persisted for +so long. Tragically, proponents of radical surgery refused to run clinical trials, perhaps in fear of what the data +would tell them. When trials were eventually run, the results were unambiguous, any surgery beyond local surgery holds +no benefits. And countless number of women were maimed. + +This example illustrates both the importance of data, as well as determining causal effects. What, if any, were +the benefits of radical surgery? + +Extracting associations from data is powerful. However, it is also limited - there are questions that cannot be +answered just from the data. The information just isn't always available in the data. Instead of despairing, this author +takes comfort out of it. It means that there are processes that are either hard or impossible to automate. Human ingenuity +is required. Instead of describing, we also have to understand. + +The takeaway message is, perhaps, *think* carefully about your data. Try and *understand* as much as is humanly possible +about mechanisms that produced the data.