Skip to content

Commit 543983e

Browse files
Dylan ChildsDylan Childs
Dylan Childs
authored and
Dylan Childs
committed
starting dplyr chapters
1 parent 2c45ac3 commit 543983e

5 files changed

+2334
-81
lines changed

2_03_tidy_data_dplyr_intro.Rmd

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# **dplyr** and the tidy data concept
2+
3+
## Introduction
4+
5+
[Data wrangling]
6+
7+
## The value of **dplyr** {#why-dplyr}
8+
9+
The **dplyr** package has been very carefully designed to make it easy to manipulate data frames and similar objects. One reason for its ease-of-use is that **dplyr** is very consistent in the way its functions are designed. For example, the first argument of the main **dplyr** functions is always an object containing our data. This consistency makes it very easy to get to grips with each function---it's usually possible to understand how one works by seeing just one or two examples.
10+
11+
A second reason for favouring **dplyr** is that it is orientated around a few key functions, each of which is designed to do one thing well. The key **dplyr** functions are often referred to as "verbs", reflecting the fact that they "do something" to data. For example: (1) `select` is used to obtain a subset of variables; (2) `filter` is used to obtain a subset of rows; (3) `arrange` is used to reorder rows; (4) `mutate` is used to construct new variables; and (5) `summarise` is used to calculate information about groups. The other topics in this block will cover each of these verbs in detail, as well as a few additional functions such as `rename` and `group_by`.
12+
13+
```{block, type="well"}
14+
The developers of RStudio have produced a [handy cheat sheat](http://www.rstudio.com/resources/cheatsheets/) that summarises data wrangling with **dplyr**. Our advice is to download this, print out a copy and refer to this as you work through the remaining topics in this block.
15+
```
16+
17+
Apart from being easy to use, **dplyr** is also fast compared to base R functions. This will not matter much for the small data sets we'll be using, but when working with very big data sets, **dplyr** is a good option. **dplyr** also allows you to work with data stored in different ways. For example, **dplyr** can interact directly with a number of database systems. This is well beyond the scope of this course, but it is worth knowing you can do this in case you find yourself faced with a real database one day.
18+
19+
## Tidy data
20+
21+
**dplyr** will work with any data frame, but it is at its most powerful when our data are organised in the [tidy](http://vita.had.co.nz/papers/tidy-data.pdf) format. The word "tidy"" has a very specific meaning in this context. Tidy data sets have a specific structure that makes them easy to manipulate, model and visualise. A tidy data set is one where each variable is in a unique column and each observation is in one row. This might seem like the "obvious" way to organise data, but many people fail to adopt this convention.
22+
23+
We aren't going to explore the tidy data copncept in great detail. However, the basic principles are not difficult to understand, especially after seeing an example. Let's return to the made-up experiment that investigated the response of field plots to fertilizer addition. This time, imagine we had only measured biomass, but that we had done this twice over the course of the experiment.
24+
25+
We'll make up some data for this experiment again and then look at two ways to organise it to help us understand the tidy data idea. The first way uses a separate column for each biomass measurement:
26+
```{r, echo=FALSE}
27+
trt <- rep(c("Control","Fertilser"), each = 3)
28+
bms1 <- c(284, 328, 291, 956, 954, 685)
29+
bms2 <- c(324, 400, 355, 1197, 1012, 859)
30+
experim.data <- data.frame(Treatment = trt, BiomassT1 = bms1, BiomassT2 = bms2)
31+
experim.data
32+
```
33+
This may seem like a reasonable way to store the data, especially if for experienced Excel users. However, this format is not __tidy__. Why? The biomass variable has been split across two columns, which means each row corresponds to two observations.
34+
35+
We won't go into the "whys" here, but take our word for it: adopting this format makes it difficult to efficiently work with data. This is not really an R-specific problem. This untidy format is sub-optimal in many other data analysis environments.
36+
37+
A tidy version of the example dataset would still have three columns, but now these would be: "Treatment", denoting the experimental treatment applied; "Time", denoting the sampling occasion; and "Biomass", denoting the biomass measured. A data frame in the tidy format looks like this:
38+
```{r, echo=FALSE}
39+
trt <- rep(c("Control","Fertilser"), each = 3, times = 2)
40+
stm <- rep(c("T1","T2"), each = 6)
41+
bms <- c(bms1, bms2)
42+
experim.data <- data.frame(Treatment = trt, Time = stm, Biomass = bms)
43+
experim.data
44+
```
45+
These data are tidy: each variable is in only one column, and each observation has its own unique row. These data are tidy and ready to use with **dplyr**.
46+
47+
```{block, type="warning"}
48+
#### Always try to start with tidy data
49+
50+
The best way to make sure your data are tidy is to store in that format __when it's first collected and recorded__. There are packages that can help convert data that are not tidy into the tidy data format (e.g. the **tidyr** package), but life is much simpler if you just make sure your data are tidy from the very beginning.
51+
```
52+
53+
## A few basic **dplyr** features {#more-dplyr}
54+
55+
Now that we know about the `iris` and `storms` datasets we can finish up this chapter by reviewing a couple of basic features of the **dplyr** package. **dplyr** is not part of the base R installation, so we must install it first:
56+
```{r, eval=FALSE}
57+
install.packages("dplyr")
58+
```
59+
Remember, we only have to install **dplyr** once, but we have to add `library(dplyr)` to the top of any scripts that use it:
60+
```{r}
61+
library("dplyr")
62+
```
63+
64+
### Tibble (`tbl`) objects
65+
66+
The main purpose of the **dplyr** package is to make manipulating data easier. In order to facilitate this kind of activity **dplyr** also implements a special kind of data object called a "tibble". These are `tbl` objects. You can think of a tibble is as a special data frame with a few extra whistles and bells. We can make a tibble from a data frame by using the `tbl_df` function.
67+
68+
When a tibble is printed to the Console it does so in a compact way, rather than trying to print every row to the Console. To see this, we can convert the `iris` dataset to a tibble using `tbl_df` and then print the resulting object to the Console:
69+
```{r}
70+
# make a "tbl" version of iris
71+
iris_tbl <- tbl_df(iris)
72+
# print it
73+
iris_tbl
74+
```
75+
Notice that only the first 10 rows are printed. This is much nicer than trying to wade through every row of a data frame. However, the main purpose of a tibble is to help us work with variables when we need to do grouped operations. We will learn about this in the last topic of this block.
76+
77+
### The `glimpse` function
78+
79+
Sometimes we just need a quick, compact summary of a data frame or tibble. This is the job of the `glimpse` function from **dplyr**. The glimpse function is very similar to `str`:
80+
```{r}
81+
glimpse(iris_tbl)
82+
```
83+
The function takes one argument: the name of a data frame or tibble. It then tells us how many rows it has, how many variables there are, what these variables are called, and what kind of data are associated with each variable. This function is useful when we're working with a dataset containing many variables.
84+
85+

2_04_dplyr_variables.Rmd

+154
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Working directories and data files
2+
3+
## Introduction
4+
5+
```{r, include=FALSE}
6+
library("dplyr")
7+
library("nasaweather")
8+
```
9+
10+
This chapter will explore the the **dplyr** `select` and `mutate` verbs, as well as the related `rename` and `transmute` verbs. These four verbs are considered together because they all operate on the variables (i.e. the columns) of a data frames or tibble:
11+
12+
- The `select` function selects a subset of variables to retain and (optionally) renames them in the process.
13+
14+
- The `mutate` function creates new variables from preexisting ones and retains the original variables.
15+
16+
- The `rename` function renames one or more variables while keeping the remaining variable names unchanged.
17+
18+
- The `transmute` function creates new variables from preexisting ones and drops the original variables.
19+
20+
### Getting ready
21+
22+
Any script that uses **dplyr** needs to start by loading and attaching the package:
23+
```{r, eval=FALSE}
24+
library("dplyr")
25+
```
26+
Obviously we need to have first installed `dplyr` package (e.g. with `install.packages`) for this to work.
27+
28+
We're goint to use two data sets to illustrate the ideas in this chapter: the `iris` data set in the `datasets` package and the `storms` data set in the `nasaweather` package. The `datasets` package ships with R and is loaded and attached at start up, so there's no need to do anything to make `iris` available.
29+
```{r, eval=FALSE}
30+
library("dplyr")
31+
library("nasaweather")
32+
```
33+
The `iris` data set is an ordinary data frame. Before we start, we need to convert this to table so that it prints to the Console in a more compact way:
34+
```{r}
35+
iris_tbl <- tbl_df(iris)
36+
```
37+
We will use the `iris` dataset in the `datasets` package and `storms` dataset in the `nasaweather` package to learn about `select` and `mutate`.
38+
```{r, eval=FALSE}
39+
library("dplyr")
40+
library("nasaweather")
41+
```
42+
43+
44+
## Subset variables with `select`
45+
46+
We use `select` to __select variables__ from a data frame or tibble. This is typically used when we have a data set with many variables but only need to work with a subset of these. Basic usage of `select` looks like this:
47+
```{r, eval=FALSE}
48+
select(data_set, vname1, vname2, ...)
49+
```
50+
This is not an example you can run---it's designed to show, in general terms, how we use `select`:
51+
52+
- The first argument, `data_set` ("data object"), must be the name of the object containing our data.
53+
54+
- We then include a series of one or more additional arguments, where each of these should be the name of a variable in `data_set`. We've expressed this as `vname1, vname2, ...`, where `vname1` and `vname2` are names of the first two variables, and the `...` is acting as placeholder for the remaining variables (there could be any number of these).
55+
56+
It's easiest to understand how a function like `select` works by seeing it in action. We select the `Species`, `Petal.Length` and `Petal.Width` variables from `iris_tbl` like this:
57+
```{r}
58+
select(iris_tbl, Species, Petal.Length, Petal.Width)
59+
```
60+
Hopefully nothing about this example is surprising or confusing. There are a few things to notice about how `select` works though:
61+
62+
* The `select` function is one of those non-standard functions we briefly mentioned in the [Using functions] chapter. This means the variable names do not need to be surrounded by quotes unless they have spaces in them (which is best avoided).
63+
64+
* The `select` function is just like other R functions: it does not have "side effects", meaning it does not change the original `iris_tbl` in any way. We printed the result produced by `select` to the Console. This means we cannot access the new data set as we did not assign the result a name (using ` <- `).
65+
66+
* The order of variables (i.e. the column order) in the resulting object is the same as the order in which they were supplied as arguments. This means we can easily reorder variables at the same time as selecting them if we need to.
67+
68+
* The `select` function will return the same kind of data object it is working on. It returns a data frame if our data was originally in a data frame and a table if it was a table. In this example, R prints a table because we had converted `iris_tbl` from a data frame to a table.
69+
70+
Sometimes it's more convenient to subset variables specifying those we do __not__ need, rather than in terms of the ones we would like to keep. We use the `-` operator indicate that variables should be dropped. For example, to drop the `Petal.Width` and `Petal.Length` columns, we use:
71+
```{r}
72+
select(iris_tbl, -Petal.Width, -Petal.Length)
73+
```
74+
This returns a table with just the remaining variables:`Sepal.Length`, `Sepal.Width` and `Species`.
75+
76+
It can sometimes be quicker to select (or drop) a set of variables positioned in a next to one another. We work with a series of adjacent variables using the `:` operator. This is used with two variable names, one on the left hand side and one on the right. When we use `:` like this `select` will subset both those variables along with any others that fall in between them. For example, if we need the two `Petal` variables and `Species`, we use:
77+
```{r}
78+
select(iris_tbl, Petal.Length:Species)
79+
```
80+
The `:` operator can be combined with `-` if we need to drop a series of variables according to their position in a data frame or table.
81+
82+
### Renaming variables with `select` and `rename`
83+
84+
In addition to selecting a subset of variables, the `select` function can also rename variables at the same time. We have to name the arguments with `=`, placing the new name on the left hand side, to do this. For example, to select the`Species`, `Petal.Length` and `Petal.Width` variables from `iris_tbl`, but also rename `Petal.Length` and `Petal.Width` to `PetalLength` and `PetalWidth`, we use:
85+
```{r}
86+
select(iris_tbl, Species, PetalLength = Petal.Length, PetalWidth = Petal.Width)
87+
```
88+
89+
Renaming variables is common task when working with data frames and tables. In many cases the _only_ thing we would like to do is rename a variable or two. That is, we would like to certain rename variables without selecting a subset. The `select` function is not very good at this because you have to supply every variable in the focal data object as an argument to avoid dropping it. The `dplyr` provides an additional function called `rename` for this reason. The sole purpose of the `rename` function is to rename certain variables while retaining any others. It works exactly as you would expect it to. For example, in order to rename `Petal.Length` and `Petal.Width` to `PetalLength` and `PetalWidth`, we use:
90+
```{r}
91+
rename(iris_tbl, PetalLength = Petal.Length, PetalWidth = Petal.Width)
92+
```
93+
Notice that the rename function also preserves the order of the variables as found in the original data.
94+
95+
## Creating variables with `mutate`
96+
97+
We use `mutate` to __add new variables__ to a data frame or tibble. This is useful when we need to construct one or more derived variables to support an analysis. Basic usage of `mutate` looks like this:
98+
```{r, eval=FALSE}
99+
mutate(data_set, <expression1>, <expression2>, ...)
100+
```
101+
Again, this is not an example we can run. It's intended to highlight in general terms how we use `mutate`. As is always the case with `dplyr` functions, the first argument, `data_set` ("data object"), should be the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is a valid R expression involving one or more variables in `data_set`. I have expressed these as `<expression1>, <expression2>`, where `<expression1>` and `<expression2>` represent the first two expressions, and the `...` is acting as placeholder for the remaining expressions. This is not valid R code -- remember, it is just intended to show you the general form of `mutate`.
102+
103+
To see `mutate` in action, let's construct a new version of `iris_tbl` that contains a variable summarising the approximate area of sepals:
104+
```{r}
105+
mutate(iris_tbl, Sepal.Width * Sepal.Length)
106+
```
107+
This created a copy of `iris_tbl` with a new column called `Sepal.Width * Sepal.Length` (it is mentioned at the bottom of the printed output). Most of the rules that apply to `select` also apply to `mutate`:
108+
109+
* The expression that performs the required calculation is not surrounded by quotes. This makes sense, because an expression is meant to be evaluated so that it "does something". It is not a value.
110+
111+
* Once again, we just printed the result produced by `mutate` to the Console, rather than assigning the result a name using ` <- `. The `mutate` function does not have side effects, meaning it does not change the original `iris_tbl` in any way.
112+
113+
* The `select` function returns the same kind of data object as the one it is working on: a data frame if our data was originally in a data frame, a table if it was a table.
114+
115+
Creating a variable called something like `Sepal.Width * Sepal.Length` is not exactly ideal. Happily, the `mutate` function can name variables at the same time as they are created. We have to name the arguments using `=`, placing the name on the left hand side, to do this. We can use this construct to name the area variable `Sepal.Area`:
116+
```{r}
117+
mutate(iris_tbl, Sepal.Area = Sepal.Width * Sepal.Length)
118+
```
119+
120+
We can create more than one variable by supplying `mutate` multiple (named) arguments:
121+
```{r}
122+
mutate(iris_tbl,
123+
Sepal.Area = Sepal.Width * Sepal.Length,
124+
Petal.Area = Petal.Width * Petal.Length,
125+
Area.Ratio = Petal.Area / Petal.Area)
126+
```
127+
Notice that I placed each argument on a new line (remembering the comma to separate arguments). There is nothing to stop us doing this -- R does not care about white space. This is useful because it makes long function calls easier to read.
128+
129+
This last example reveals a nice feature of `mutate`: we can use newly created variables in further calculations. Here we constructed approximate sepal and petal area variables, and then used these to construct a third variable containing the ratio of these two quantities, `Area.Ratio`.
130+
131+
### Transforming and dropping variables
132+
133+
Occasionally we may want to construct one or more new variables, and then drop all other variables in the original dataset. The `transmute` function is designed to do this:
134+
```{r}
135+
transmute(iris_tbl,
136+
Sepal.Area = Sepal.Width * Sepal.Length,
137+
Petal.Area = Petal.Width * Petal.Length,
138+
Area.Ratio = Petal.Area / Petal.Area)
139+
```
140+
Here we repeated the previous example, but now only the new variables were retained in the resulting table. If we also want to retain one or more variables without altering them we just have to pass them as unnamed arguments. For example, if we need to retain species identity in the output, we use:
141+
```{r}
142+
transmute(iris_tbl,
143+
Species,
144+
Sepal.Area = Sepal.Width * Sepal.Length,
145+
Petal.Area = Petal.Width * Petal.Length,
146+
Area.Ratio = Petal.Area / Petal.Area)
147+
```
148+
149+
150+
151+
152+
153+
154+

0 commit comments

Comments
 (0)