ims-tutorials/01-data/index.Rmd at main · OpenIntroStat/ims-tutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: "Introduction to data"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<img style="float: right; margin: 0px 0px 20px 20px" src="../logo/openintro-hex.png" alt="Tutorial illustration" width="250" height="300">

## Tutorial description

Scientists seek to answer questions using rigorous methods and careful observations. These observations - collected from the likes of field notes, surveys, and experiments - form the backbone of a statistical investigation and are called data. Statistics is the study of how best to collect, analyze, and draw conclusions from data. It is helpful to put statistics in the context of a general process of investigation:

1. Identify a question or problem.
2. Collect relevant data on the topic.
3. Analyze the data.
4. Form a conclusion.

In this tutorial, we focus on steps 1 and 2 of this process.

## Learning objectives

* Internalize the language of data.
* Load and view a dataset in R and distinguish between various variable types.
* Classify a study as observational or experimental, and determine whether the study's results can be generalized to the population and whether they suggest correlation or causation between the variables studied.
* Distinguish between various sampling strategies and recognize the benefits and drawbacks of choosing one strategy over another.
* Identify the principles of experimental design and recognize their purposes.
* Apply terminology and principles to a case study.


## Lessons

### 1 - [Language of data](https://openintro.shinyapps.io/ims-01-data-01/)

* Load data from the textbook companion package
* Introduce data frames and tidy data
* Discuss variable types connecting terminology from textbook to R
* Mutate data frames to convert data types
  * Numerical to categorical conversion
  * Filtering and then drop levels
  * Combine levels of categorical
  * Create new variable based on two existing variables (e.g. BMI)

### 2 - [Types of studies](https://openintro.shinyapps.io/ims-01-data-02/)


* Define observational studies and experiments
* Discuss scope of inference 2x2 grid with random assignment and sampling
* Define Simpson's paradox in a 2 cat var case
  * Use R to make a contingency table
  * `group_by()` third variable and make table again to demonstrate Simpson's paradox

### 3 - [Sampling strategies and experimental design](https://openintro.shinyapps.io/ims-01-data-03/)

* Define simple random sample, stratified sample, cluster sample, multistage sample
* Use R to obtain SRS and stratified sample
	* `slice_sample()`
	* `group_by()` |>  `slice_sample()`
* Discuss benefits and drawbacks of choosing one sampling scheme over another

* Identify the principles of experimental design
* Discuss the purpose of each principle
* Use R to do random assignment and random assignment after blocking
	* `slice_sample()`
	* `group_by()` |>  `slice_sample()`

### 4 - [Case study](https://openintro.shinyapps.io/ims-01-data-04/)

* Apply terminology, principles, and R code learned in this tutorial to a case study

## Instructor

<img style="float: left; margin: 0px 20px 20px 0px" src="../instructor-photos/mine.png" alt="Mine Çetinkaya-Rundel" width="150" height="150">

### Mine Çetinkaya-Rundel

#### University of Edinburgh, Duke University, RStudio

Mine Çetinkaya-Rundel is Associate Professor of the Practice position at the Department of Statistical Science at Duke University and Data Scientist and Professional Educator at RStudio.
Mine's work focuses on innovation in statistics and data science pedagogy, with an emphasis on computing, reproducible research, student-centered learning, and open-source education as well as pedagogical approaches for enhancing retention of women and under-represented minorities in STEM.
Mine works on integrating computation into the undergraduate statistics curriculum, using reproducible research methodologies and analysis of real and complex datasets.
She also organizes [ASA DataFest](https://ww2.amstat.org/education/datafest/), an annual two-day competition in which teams of undergraduate students work to reveal insights into a rich and complex dataset.
Mine has been working on the [OpenIntro](openintro.org) project since its founding and as part of this project she co-authored four open-source introductory statistics textbooks (including this one!).
She is also the creator and maintainer of [datasciencebox.org](https://datasciencebox.org/) and she teaches the popular Statistics with R MOOC on Coursera.