-
Notifications
You must be signed in to change notification settings - Fork 37
Expand file tree
/
Copy pathindex.Rmd
More file actions
88 lines (63 loc) · 3.08 KB
/
index.Rmd
File metadata and controls
88 lines (63 loc) · 3.08 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
title: "Exploratory data analysis"
output: html_document
---
<img style="float: right; margin: 0px 0px 20px 20px" src="../logo/openintro-hex.png" alt="Tutorial illustration" width="250" height="300">
## Tutorial description
When your dataset is represented as a table or a database, it's difficult to observe much about it beyond its size and the types of variables it contains.
In this tutorial, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. Which variables suggest interesting relationships? Which observations are unusual?
By the end of the tutorial, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.
## Learning objectives
* Visualize categorical and numerical data using appropriate graphics.
* Create graphical representations of multiple variables.
* Describe the structure revealed by graphics in the language of distributions.
* Use statistics to summarize important aspects of data.
* Identify unusual observations
## Lessons
### 1 - [Visualizing categorical data](https://openintro.shinyapps.io/ims-02-explore-01/)
Creating graphical and numerical summaries of two categorical variables, primarily using two R packages: ggplot2 and dplyr.
* Graphical representation of two categorical variables
* Side-by-side bar charts
* Stacked bar charts
* To normalize or not to normalize
* Tabular representation of two categorical variables
* Computation of margins
* Counts vs proportions
* Law of total probability
* Graphical representation of one categorical variable
* Marginal vs conditional
* Bar chart
* ordering
* data integrity check for levels
* Pie chart
### 2 - [Visualizing numerical data](https://openintro.shinyapps.io/ims-02-explore-02/)
Learn useful statistics for describing distributions of data.
* Graphical representation of one categorical and one numerical variable
* Side-by-side boxplots
* Faceted histograms
* Colored density curves
* Graphical representation of one numerical variable
* Marginal vs conditioning
* Histogram
* binwidth
* Density plot
* bandwidth
* Boxplot
* outlier detection
### 3 - [Summarizing with statistics](https://openintro.shinyapps.io/ims-02-explore-03/)
Statistics for describing distributions of data.
* Center: mean, median, mode
* Shape: skewness, modality
* Spread: range, IQR, SD, variance
* Unusual observations
* Transformations: Logarithm and sqrt to reduce skew in graphics and ease comparisons.
### 4 - [Case study](https://openintro.shinyapps.io/ims-02-explore-04/)
Apply what you've learned to explore and summarize a real world dataset in this case study of email spam.
## Additional references
* Unwin, Anthony. *Graphical Data Analysis with R*.
* Velleman, Paul and Hoaglin, David. *Exploratory Data Analysis*.
## Instructor
<img style="float: left; margin: 0px 20px 20px 0px" src="../instructor-photos/andrew.png" alt="Andrew Bray" width="150" height="150">
### Andrew Bray
#### Reed College
Andrew Bray is an Assistant Professor of Statistics at Reed College and lover of all things statistics and R.