Skip to content

Commit

Permalink
EAL integrated a lot of chapters and repositioned sections. All chapt…
Browse files Browse the repository at this point in the history
…ers knit and the book compiles as a whole. Ch 4 and 8 need revision by EAL. Ch 10 and 11 have to be expanded. After this push is merged with master, authors must get a new brach or clone and work according to instructions in TODO.txt. (#23)
  • Loading branch information
emilioalaca authored Sep 24, 2018
1 parent a93cd75 commit 224f965
Show file tree
Hide file tree
Showing 55 changed files with 4,115 additions and 4,841 deletions.
Binary file modified .DS_Store
Binary file not shown.
1,022 changes: 511 additions & 511 deletions .Rhistory (1)

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion 00.1_FrontMatter.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "PLS 120. Introduction to Applied Statistics"
author: "Emilio A. Laca, Jennifer Brazeal, Cale Miller, Stephanie, Zullo"
author: "Emilio A. Laca, Jennifer Brazeal, Cale Miller, Stephanie Zullo"
date: "2018-02-16"
site: bookdown::bookdown_site
documentclass: book
Expand Down
62 changes: 38 additions & 24 deletions 01.0_Intro2Stats.Rmd

Large diffs are not rendered by default.

118 changes: 68 additions & 50 deletions 02.0_Rcomputation.Rmd

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion 03.0_MathSymbols.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ output:
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---

# Required Math Skills and Symbols {#chMath}
Expand Down Expand Up @@ -381,7 +382,7 @@ where $e = \hat{\epsilon} = (Y_i-\hat{\mu})$ are the deviations of each cow from
\end{equation}
<br>

Suppose that the estimated mean and variance were $28.3 \ kg \ day^{-1}$ and $25 \ kg^2 \ day^{-2}$. USDA reported an average production of $28.3 \ kg \ day^{-1}$ of milk per milking cow in 2016 (https://www.nass.usda.gov/Publications/Ag_Statistics/2017/Chapter08.pdf). The variance was not reported, so we use a fictitious number: $\hat\sigma^2 = 25$. The area under the curve (Figure \@ref(fig:propLtMean)) is calculated in R as follows (note how the code is written in several lines and indented for better readability):
Suppose that the estimated mean and variance were $28.3 \ kg \ day^{-1}$ and $25 \ kg^2 \ day^{-2}$. USDA reported an average production of $28.3 \ kg \ day^{-1}$ of milk per milking cow in 2016. a href="https://www.nass.usda.gov/Publications/Ag_Statistics/2017/Chapter08.pdf" target="_blank">See report here.</a>. The variance was not reported, so we use a fictitious number: $\hat\sigma^2 = 25$. The area under the curve (Figure \@ref(fig:propLtMean)) is calculated in R as follows (note how the code is written in several lines and indented for better readability):

```{r}

Expand Down Expand Up @@ -904,7 +905,10 @@ The shape of polynomials can be modified by changing their parameter values. The
```
<br>
<br>


## Symbols and Terms{#mathSymbls}

Term or symbol| Explanation
------------| ----------------------------------------------------------
Observation | Result of measuring one or several variables in one unit of experimental material. For example, the milk production of one cow in a day.
Expand Down
95 changes: 52 additions & 43 deletions 04.0_DataExploration.Rmd
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
---
output:
html_document: default
pdf_document: default
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---

```{r setup0, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

#Updated 9-16
```{r message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
# load packages for chapter

Expand Down Expand Up @@ -45,11 +48,32 @@ The next one has to be explained more:

## Data curation

Within a statistical framework, the term population refers to an entire set of measurments or a specific characteristic that is being examined. A descriptive measure for a population is refered to as a parameter. For exxample, popualtion parameters can be the number of all diary cows in California (population size = N) or the mean body weight of all diary cow calves in California. To determine these parameters a survey would need to be conducted, which is often not fesible. In such cases, inferential statistics is used by taking representative, random samples, of the popualtion and infering conclusions about the entire popualtion.
See wikipedia entry on <a href="https://en.wikipedia.org/wiki/Data_curation" target = "_blank">Data Curation</a>.

When conducting random samples from a popualtion, collected parameters from a sample are refered to as statistics. Using the example above where *N* (captial letter) referred to the entire popualtion of dairy cows in California, a random sample of dairy cows in California can be designated with the varibale *n* (lower-case).

In the figure below (\@ref(fig:Map_cows)) two random samples were drawn from the California dairy cow popualtion to infer the average weight of newborn calves for the entire popualtion. In this example the sample statistic is weight of newborn dairy cow calves.
See message from Duncan Lang and Vessela Ensberg:

The content that you want to cover is very similar to one of my seminars on keeping data tidy and organized (slides here) and data sharing (slides attached). If you prefer, I can adapt this content for your class. I have presented data management sessions for a number of graduate seminar classes.

For references, I’d recommend starting with FAIR data:
https://www.nature.com/articles/sdata201618
https://www.force11.org/group/fairgroup/fairprinciples

For tidy data, Data carpentry has some good guidelines:
http://www.datacarpentry.org/semester-biology/materials/tidy-data/

New England Collaborative Data Management Curriculum has extensive material and lesson plans on data management:
https://library.umassmed.edu/necdmc/modules



## Data comes from samples

Within a statistical framework, the term population refers to an entire set of measurments or a specific characteristic that is being examined. A descriptive measure for a population is refered to as a parameter. For example, popualtion parameters can be the number of all diary cows in California (population size = N) or the mean body weight of all diary cow calves in California. To determine these parameters a survey would need to be conducted, which is often not fesible. In such cases, inferential statistics is used by taking representative, random samples, of the popualtion and infering conclusions about the entire popualtion.

When conducting random samples from a population, collected parameters from a sample are refered to as statistics. Using the example above where *N* (captial letter) referred to the entire population of dairy cows in California, a random sample of dairy cows in California can be designated with the varibale *n* (lower-case).

In the figure below (\@ref(fig:Map_cows)) two random samples were drawn from the California dairy cow population to infer the average weight of newborn calves for the entire population. In this example the sample statistic is weight of newborn dairy cow calves.

<br>

Expand All @@ -65,7 +89,7 @@ knitr::include_graphics("images/Map_cows.pdf")

The measure of central tendency is a summary measure that represents the center point of a whole data set and indicates the typical value of the data. The three measures of central tendency are the mean, median, and mode.

In the above example the mean weight of newborn dairy cows in California was collected from a random sample of the entire popualtion. In statistics, the **mean** is the average value from the collected data set. Advantages of using the mean are that most other statitcs such as variance and standard deviations can be determined algebraically using the mean. The variable that represents the mean for a popualtion is
In the above example the mean weight of newborn dairy cows in California was collected from a random sample of the entire popualtion. In statistics, the **mean** is the average value from the collected data set. Advantages of using the mean are that most other statitcs such as variance and standard deviations can be determined algebraically using the mean. The symbol that represents the mean for a popualtion is

$$\mu$$

Expand All @@ -77,17 +101,16 @@ The **median** of a sample data set is the middle value when the data are arrang

The **mode** of a sample data set is the value that occurs most frequently. Since multiple values can occur several times, one or more mode may be present for a data set. Alternatively, there may not be a most frequent value in a data set and, thus, no mode value.

For the sample data set in table \@ref(tab:PR_seagrass)) the mean, median, and mode can be calculated as:
For the sample data set in table \@ref(fig:photoRateSeagrass) the average, sample median, and sample mode can be calculated as:

$$ \bar{x} = \frac{y_1 + y_2 + y_3 + y_4 + y_5 ... y_{10}} {n}$$
For the data in table 1.1:
$$mean = \frac{180 + 166 + 226 + 226 + 206 ... 154}{10}.$$
$$ \bar{x} = \frac{y_1 + y_2 + y_3 + y_4 + y_5 + \ldots + y_{10}} {n}$$
For the data in table :
$$mean = \frac{180 + 166 + 226 + 226 + 206 + \ldots + 154}{10}$$

$${y_1 , y_2 , y_3 , y_4, y_5, y_6, y_7}$$

$${median = y_4}$$
$${median = 206}$$

For the data in table 1.1:
For the data in table:
$${180, 166, 226, 206, 197, 180, 108, 243, 289, 154}$$

$$median = \frac{197 + 180}{2}.$$
Expand All @@ -97,7 +120,7 @@ $$ mode = 180$$

<br>

```{r PR_seagrass, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '50%', fig.align='center', echo=FALSE, fig.cap ="Sample data set of photosynthetic rate for 10 different shoots of the seagrass *Zostera marina*."}
```{r photoRateSeagrass, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '30%', fig.align='center', echo=FALSE, fig.cap ="Sample data set of photosynthetic rate for 10 different shoots of the seagrass *Zostera marina*."}

knitr::include_graphics("images/PR_seagrass.pdf")

Expand Down Expand Up @@ -206,21 +229,6 @@ The CV then describes the variability---in the form of the standard deviation---

It should be noted that the CV should only be used when the measurement scale of a data set is on the ratio scale rather than the interval scale. Interval scasles can have arbitrary zero values which can produce nonsensical CV values. In addition, the CV is sensitive to mean values close to zero potentially resulting in an inflated CV value.

## Data curation

input
output
formatting
variable types
naming variables




## Numerical and graphical summaries

## Data description vs. Estimation vs. Inference


## Exercises and Solutions

Expand All @@ -232,12 +240,12 @@ naming variables

Prepare an .Rmd document starting with the following text, where you substitute the corresponsing information for author name and date.

---
"---" Unquote the three dashes
title: "Lab01: Data Exploration and Summaries"
author: "YourFirstName YourLastName"
date: "today's date here"
output: html_document
---
"---" Unquote the three dashes

```{r setup, include=FALSE}

Expand Down Expand Up @@ -470,7 +478,7 @@ Your answer here:

Make a data frame and a nicely formatted table containing the following statistics: mean, median, range, minimum, maximum, standard deviation, coefficient of variation and sample size for each of the measurements (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) in myiris data. What variable has the most variation relative to the average? [15 points]

```{r summary.table, echo=FALSE}
```{r summary.table1, echo=FALSE}

mymeans <- sapply(X = myiris[ , 1:4], FUN = mean) # put means in a column

Expand Down Expand Up @@ -515,7 +523,7 @@ What variable has the most variation relative to the average? Your answer here:

Make and print a nicely formatted frequency table for the sepal length of the flowers. Calculate the number of classes (bins) using the formula in the LabO1 instructive. Make a histogram using your calculations "by-hand" and then a new histogram using the hist() function. [15 points]

```{r iris.hist, echo = TRUE, include = TRUE}
```{r iris.hist1, echo = TRUE, include = TRUE}

(sample.size <- length(myiris$Sepal.Length))

Expand All @@ -538,7 +546,7 @@ hist(myiris$Sepal.Length) # by default it uses Sturges rule for bins.

Create two box and whisker plots for the sepal length. Label at least three of the elements in the plot. Create a new box and whisker plot with a *range* argument value of 0.5. [15 points]

```{r iris.box, echo = TRUE, include = TRUE}
```{r iris.box1, echo = TRUE, include = TRUE}

boxplot(myiris$Sepal.Length) # enter the data frame name, $ and the variable name to designate the data used by boxplot

Expand All @@ -564,7 +572,7 @@ Your answer here:

What function allows you to **combine** numbers to make numeric vectors? Use the function to make a vector with the numbers 1, 3, 4, 9 and call it "sa.vector." [5 points]

```{r iris.c, echo = TRUE, include = TRUE}
```{r iris.c1, echo = TRUE, include = TRUE}



Expand All @@ -577,20 +585,20 @@ What function allows you to **combine** numbers to make numeric vectors? Your an

Knit this file into html. [10 points]

##----------------END PLANT SCIENCES LAB-----------------###
<!-- ##----------------END PLANT SCIENCES LAB-----------------### -->

### Animal Sciences Lab

Prepare an .Rmd document starting with the following text, where you substitute the corresponsing information for author name and date.

---
"---" Unquote the three dashes
title: "Lab01: Data Exploration and Summaries"
author: "YourFirstName YourLastName"
date: "today's date here"
output: html_document
---
"---" Unquote the three dashes

```{r setup, include=FALSE}
```{r setup1, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

Expand Down Expand Up @@ -641,11 +649,12 @@ Load the Heifer data into a data frame called myheifer, and get its structure. U

```{r}

myheifer <- read.table('Lab01HeiferData.csv', header=TRUE, sep=',') # put heifer data into dataframe object or container called myheifer
myheifer <- read.table("Datasets/Lab01HeiferData.csv", header = TRUE, sep = ",") # put heifer data into dataframe object or container called myheifer

str(myheifer)

#Just reapting the data import process but instead of reading a CSV file, you can directly enter the data in text format
# Instead of reading a CSV file, you can directly enter the data in text format

myheifer <- read.table(header = TRUE, text = "
Birth_weight Wean_weight Yearling_weight
81 660 902
Expand Down
Loading

0 comments on commit 224f965

Please sign in to comment.