EAL integrated a lot of chapters and repositioned sections. All chapt…

…ers knit and the book compiles as a whole. Ch 4 and 8 need revision by EAL. Ch 10 and 11 have to be expanded. After this push is merged with master, authors must get a new brach or clone and work according to instructions in TODO.txt. (#23)
PLS120BookTeam · Sep 24, 2018 · 224f965 · 224f965
1 parent a93cd75
commit 224f965
Show file tree

Hide file tree

Showing 55 changed files with 4,115 additions and 4,841 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.Rhistory (1) b/.Rhistory (1)
diff --git a/00.1_FrontMatter.Rmd b/00.1_FrontMatter.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "PLS 120. Introduction to Applied Statistics"
-author: "Emilio A. Laca, Jennifer Brazeal, Cale Miller, Stephanie, Zullo"
+author: "Emilio A. Laca, Jennifer Brazeal, Cale Miller, Stephanie Zullo"
 date: "2018-02-16"
 site: bookdown::bookdown_site
 documentclass: book

diff --git a/01.0_Intro2Stats.Rmd b/01.0_Intro2Stats.Rmd
diff --git a/02.0_Rcomputation.Rmd b/02.0_Rcomputation.Rmd
diff --git a/03.0_MathSymbols.Rmd b/03.0_MathSymbols.Rmd
@@ -5,6 +5,7 @@ output:
     number_sections: yes
     theme: readable
     toc: yes
+    toc_depth: 3
 ---
 
 # Required Math Skills and Symbols {#chMath}
@@ -381,7 +382,7 @@ where $e = \hat{\epsilon} = (Y_i-\hat{\mu})$ are the deviations of each cow from
 \end{equation} 
 <br>
 
-Suppose that the estimated mean and variance were $28.3 \ kg \ day^{-1}$ and $25 \ kg^2 \ day^{-2}$. USDA reported an average production of $28.3 \ kg \ day^{-1}$ of milk per milking cow in 2016 (https://www.nass.usda.gov/Publications/Ag_Statistics/2017/Chapter08.pdf). The variance was not reported, so we use a fictitious number: $\hat\sigma^2 = 25$. The area under the curve (Figure \@ref(fig:propLtMean)) is calculated in R as follows (note how the code is written in several lines and indented for better readability):
+Suppose that the estimated mean and variance were $28.3 \ kg \ day^{-1}$ and $25 \ kg^2 \ day^{-2}$. USDA reported an average production of $28.3 \ kg \ day^{-1}$ of milk per milking cow in 2016. a href="https://www.nass.usda.gov/Publications/Ag_Statistics/2017/Chapter08.pdf" target="_blank">See report here.</a>. The variance was not reported, so we use a fictitious number: $\hat\sigma^2 = 25$. The area under the curve (Figure \@ref(fig:propLtMean)) is calculated in R as follows (note how the code is written in several lines and indented for better readability):
 
 ```{r}
 
@@ -904,7 +905,10 @@ The shape of polynomials can be modified by changing their parameter values. The
 ```
 <br>
 <br>
+
+
 ## Symbols and Terms{#mathSymbls}
+
 Term or symbol| Explanation
 ------------| ----------------------------------------------------------
 Observation | Result of measuring one or several variables in one unit of experimental material. For example, the milk production of one cow in a day.

diff --git a/04.0_DataExploration.Rmd b/04.0_DataExploration.Rmd
@@ -1,14 +1,17 @@
 ---
-output:
-  html_document: default
-  pdf_document: default
+output: 
+  html_document: 
+    fig_caption: yes
+    number_sections: yes
+    theme: readable
+    toc: yes
+    toc_depth: 3
 ---
 
 ```{r setup0, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 ```
 
-#Updated 9-16
 ```{r message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
 # load packages for chapter
 
@@ -45,11 +48,32 @@ The next one has to be explained more:
 
 ## Data curation
 
-Within a statistical framework, the term population refers to an entire set of measurments or a specific characteristic that is being examined. A descriptive measure for a population is refered to as a parameter. For exxample, popualtion parameters can be the number of all diary cows in California (population size = N) or the mean body weight of all diary cow calves in California. To determine these parameters a survey would need to be conducted, which is often not fesible. In such cases, inferential statistics is used by taking representative, random samples, of the popualtion and infering conclusions about the entire popualtion.
+See wikipedia entry on <a href="https://en.wikipedia.org/wiki/Data_curation" target = "_blank">Data Curation</a>.
 
-When conducting random samples from a popualtion, collected parameters from a sample are refered to as statistics. Using the example above where *N* (captial letter) referred to the entire popualtion of dairy cows in California, a random sample of dairy cows in California can be designated with the varibale *n* (lower-case). 
 
-In the figure below (\@ref(fig:Map_cows)) two random samples were drawn from the California dairy cow popualtion to infer the average weight of newborn calves for the entire popualtion. In this example the sample statistic is weight of newborn dairy cow calves.  
+See message from Duncan Lang and Vessela Ensberg:
+
+The content that you want to cover is very similar to one of my seminars on keeping data tidy and organized (slides here) and data sharing (slides attached). If you prefer, I can adapt this content for your class. I have presented data management sessions for a number of graduate seminar classes.
+
+For references, I’d recommend starting with FAIR data:
+https://www.nature.com/articles/sdata201618
+https://www.force11.org/group/fairgroup/fairprinciples
+
+For tidy data, Data carpentry has some good guidelines:
+http://www.datacarpentry.org/semester-biology/materials/tidy-data/
+
+New England Collaborative Data Management Curriculum has extensive material and lesson plans on data management:
+https://library.umassmed.edu/necdmc/modules
+
+
+
+## Data comes from samples
+
+Within a statistical framework, the term population refers to an entire set of measurments or a specific characteristic that is being examined. A descriptive measure for a population is refered to as a parameter. For example, popualtion parameters can be the number of all diary cows in California (population size = N) or the mean body weight of all diary cow calves in California. To determine these parameters a survey would need to be conducted, which is often not fesible. In such cases, inferential statistics is used by taking representative, random samples, of the popualtion and infering conclusions about the entire popualtion.
+
+When conducting random samples from a population, collected parameters from a sample are refered to as statistics. Using the example above where *N* (captial letter) referred to the entire population of dairy cows in California, a random sample of dairy cows in California can be designated with the varibale *n* (lower-case). 
+
+In the figure below (\@ref(fig:Map_cows)) two random samples were drawn from the California dairy cow population to infer the average weight of newborn calves for the entire population. In this example the sample statistic is weight of newborn dairy cow calves.  
 
 <br>
 
@@ -65,7 +89,7 @@ knitr::include_graphics("images/Map_cows.pdf")
 
 The measure of central tendency is a summary measure that represents the center point of a whole data set and indicates the typical value of the data. The three measures of central tendency are the mean, median, and mode. 
 
-In the above example the mean weight of newborn dairy cows in California was collected from a random sample of the entire popualtion. In statistics, the **mean** is the average value from the collected data set. Advantages of using the mean are that most other statitcs such as variance and standard deviations can be determined algebraically using the mean. The variable that represents the mean for a popualtion is 
+In the above example the mean weight of newborn dairy cows in California was collected from a random sample of the entire popualtion. In statistics, the **mean** is the average value from the collected data set. Advantages of using the mean are that most other statitcs such as variance and standard deviations can be determined algebraically using the mean. The symbol that represents the mean for a popualtion is 
 
 $$\mu$$ 
 
@@ -77,17 +101,16 @@ The **median** of a sample data set is the middle value when the data are arrang
 
 The **mode** of a sample data set is the value that occurs most frequently. Since multiple values can occur several times, one or more mode may be present for a data set. Alternatively, there may not be a most frequent value in a data set and, thus, no mode value. 
 
-For the sample data set in table \@ref(tab:PR_seagrass)) the mean, median, and mode can be calculated as:
+For the sample data set in table \@ref(fig:photoRateSeagrass) the average, sample median, and sample mode can be calculated as:
 
-$$ \bar{x} = \frac{y_1 + y_2 + y_3 + y_4 + y_5 ... y_{10}} {n}$$
-For the data in table 1.1:
-$$mean = \frac{180 + 166 + 226 + 226 + 206 ... 154}{10}.$$
+$$ \bar{x} = \frac{y_1 + y_2 + y_3 + y_4 + y_5 + \ldots + y_{10}} {n}$$
+For the data in table :
+$$mean = \frac{180 + 166 + 226 + 226 + 206 +  \ldots + 154}{10}$$
 
-$${y_1 , y_2 , y_3 , y_4, y_5, y_6, y_7}$$
 
-$${median = y_4}$$
+$${median = 206}$$
 
-For the data in table 1.1:
+For the data in table:
 $${180, 166, 226, 206, 197, 180, 108, 243, 289, 154}$$
 
 $$median = \frac{197 + 180}{2}.$$
@@ -97,7 +120,7 @@ $$ mode = 180$$
 
 <br>
 
-```{r PR_seagrass, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '50%', fig.align='center', echo=FALSE, fig.cap ="Sample data set of photosynthetic rate for 10 different shoots of the seagrass *Zostera marina*."}
+```{r photoRateSeagrass, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '30%', fig.align='center', echo=FALSE, fig.cap ="Sample data set of photosynthetic rate for 10 different shoots of the seagrass *Zostera marina*."}
 
 knitr::include_graphics("images/PR_seagrass.pdf")
 
@@ -206,21 +229,6 @@ The CV then describes the variability---in the form of the standard deviation---
 
 It should be noted that the CV should only be used when the measurement scale of a data set is on the ratio scale rather than the interval scale. Interval scasles can have arbitrary zero values which can produce nonsensical CV values. In addition, the CV is sensitive to mean values close to zero potentially resulting in an inflated CV value. 
 
-## Data curation
-
-input
-output
-formatting
-variable types
-naming variables
-
-
-
-
-## Numerical  and graphical summaries
-
-## Data description vs. Estimation vs. Inference
-
 
 ## Exercises and Solutions
 
@@ -232,12 +240,12 @@ naming variables
 
 Prepare an .Rmd document starting with the following text, where you substitute the corresponsing information for author name and date.
 
----
+"---" Unquote the three dashes
 title: "Lab01: Data Exploration and Summaries"
 author: "YourFirstName YourLastName"
 date: "today's date here"
 output: html_document
----
+"---" Unquote the three dashes
 
 ```{r setup, include=FALSE}
 
@@ -470,7 +478,7 @@ Your answer here:
 
 Make a data frame and a nicely formatted table containing the following statistics: mean, median, range, minimum, maximum, standard deviation, coefficient of variation and sample size for each of the measurements (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) in myiris data. What variable has the most variation relative to the average? [15 points]
 
-```{r summary.table, echo=FALSE}
+```{r summary.table1, echo=FALSE}
 
 mymeans <- sapply(X = myiris[ , 1:4], FUN = mean) # put means in a column
 
@@ -515,7 +523,7 @@ What variable has the most variation relative to the average? Your answer here:
 
 Make and print a nicely formatted frequency table for the sepal length of the flowers. Calculate the number of classes (bins) using the formula in the LabO1 instructive. Make a histogram using your calculations "by-hand" and then a new histogram using the hist() function. [15 points]
 
-```{r iris.hist, echo = TRUE, include = TRUE}
+```{r iris.hist1, echo = TRUE, include = TRUE}
 
 (sample.size <- length(myiris$Sepal.Length))
 
@@ -538,7 +546,7 @@ hist(myiris$Sepal.Length) # by default it uses Sturges rule for bins.
 
 Create two box and whisker plots for the sepal length. Label at least three of the elements in the plot. Create a new box and whisker plot with a *range* argument value of 0.5. [15 points]
 
-```{r iris.box, echo = TRUE, include = TRUE}
+```{r iris.box1, echo = TRUE, include = TRUE}
 
 boxplot(myiris$Sepal.Length) # enter the data frame name, $ and the variable name to designate the data used by boxplot
 
@@ -564,7 +572,7 @@ Your answer here:
 
 What function allows you to **combine** numbers to make numeric vectors? Use the function to make a vector with the numbers 1, 3, 4, 9 and call it "sa.vector." [5 points]
 
-```{r iris.c, echo = TRUE, include = TRUE}
+```{r iris.c1, echo = TRUE, include = TRUE}
 
 
 
@@ -577,20 +585,20 @@ What function allows you to **combine** numbers to make numeric vectors? Your an
 
 Knit this file into html. [10 points]
 
-##----------------END PLANT SCIENCES LAB-----------------###
+<!-- ##----------------END PLANT SCIENCES LAB-----------------### -->
 
 ### Animal Sciences Lab
 
 Prepare an .Rmd document starting with the following text, where you substitute the corresponsing information for author name and date.
 
----
+"---" Unquote the three dashes
 title: "Lab01: Data Exploration and Summaries"
 author: "YourFirstName YourLastName"
 date: "today's date here"
 output: html_document
----
+"---" Unquote the three dashes
 
-```{r setup, include=FALSE}
+```{r setup1, include=FALSE}
 
 knitr::opts_chunk$set(echo = TRUE)
 
@@ -641,11 +649,12 @@ Load the Heifer data into a data frame called myheifer, and get its structure. U
 
 ```{r}
 
-myheifer <- read.table('Lab01HeiferData.csv', header=TRUE, sep=',') # put heifer data into dataframe object or container called myheifer
+myheifer <- read.table("Datasets/Lab01HeiferData.csv", header = TRUE, sep = ",") # put heifer data into dataframe object or container called myheifer
 
 str(myheifer)
 
-#Just reapting the data import process but instead of reading a CSV file, you can directly enter the data in text format
+# Instead of reading a CSV file, you can directly enter the data in text format
+
 myheifer <- read.table(header = TRUE, text = "
 Birth_weight	Wean_weight	Yearling_weight
 81	660	902