You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: 2_03_tidy_data_dplyr_intro.Rmd
+11-11
Original file line number
Diff line number
Diff line change
@@ -6,23 +6,23 @@
6
6
7
7
## The value of **dplyr** {#why-dplyr}
8
8
9
-
The **dplyr** package has been carefully designed to make it easy to manipulate data frames and other kinds of similar objects. A key reason for its ease-of-use is that **dplyr** is very consistent in the way its functions work. For example, the first argument of the main **dplyr** functions is always an object containing our data. This consistency makes it very easy to get to grips with each of the main **dplyr** functions---it's often possible to understand how one works by seeing one or two examples of its use.
9
+
The **dplyr** package has been carefully designed to make life easier to manipulate data frames and other kinds of similar objects. A key reason for its ease-of-use is that **dplyr** is very consistent in the way its functions work. For example, the first argument of the main **dplyr** functions is always an object containing our data. This consistency makes it very easy to get to grips with each of the main **dplyr** functions---it's often possible to understand how one works by seeing one or two examples of its use.
10
10
11
-
A second reason for favouring **dplyr** is that it is orientated around a few key functions, each of which is designed to do one thing well. The key **dplyr** functions are often referred to as "verbs", reflecting the fact that they "do something" to data. For example: (1) `select` is used to obtain a subset of variables; (2) `mutate` is used to construct new variables; (3) `filter` is used to obtain a subset of rows; (4) `arrange` is used to reorder rows; and (5) `summarise` is used to calculate information about groups. We'll cover each of these verbs in detail in later chapters, as well as a few additional functions such as `rename` and `group_by`.
11
+
A second reason for favouring **dplyr** is that it is orientated around a few core functions, each of which is designed to do one thing well. The key **dplyr** functions are often referred to as "verbs", reflecting the fact that they "do something" to data. For example: (1) `select` is used to obtain a subset of variables; (2) `mutate` is used to construct new variables; (3) `filter` is used to obtain a subset of rows; (4) `arrange` is used to reorder rows; and (5) `summarise` is used to calculate information about groups. We'll cover each of these verbs in detail in later chapters, as well as a few additional functions such as `rename` and `group_by`.
12
12
13
-
Apart from being easy to use, **dplyr** is also fast compared to base R functions. This won't matter much for the small data sets we use in this book, but **dplyr** is a good option for very large data sets. The **dplyr** package also allows us to work with data stored in different ways, for example, by interacting directly with a number of database systems. We won't work with anything pther than data frames (and closely-related "tibbles") but it is worth knowing about this facility---learning to use **dplyr** with data frames makes it easy to work with these other kinds of data objects.
13
+
Apart from being easy to use, **dplyr** is also fast compared to base R functions. This won't matter much for the small data sets we use in this book, but **dplyr** is a good option for large data sets. The **dplyr** package also allows us to work with data stored in different ways, for example, by interacting directly with a number of database systems. We won't work with anything other than data frames (and the closely-related "tibble") but it is worth knowing about this facility. Learning to use **dplyr** with data frames makes it easy to work with these other kinds of data objects.
14
14
15
15
```{block, type="info"}
16
16
#### A **dplyr** cheat sheet
17
17
18
-
The developers of RStudio have produced a very useable [cheat sheat](http://www.rstudio.com/resources/cheatsheets/) that summarises the main data wrangling tools provided by **dplyr**. Our advice is to download this, print out a copy and refer to this often as you start working with **dply**.
18
+
The developers of RStudio have produced a very usable [cheat sheat](http://www.rstudio.com/resources/cheatsheets/) that summarises the main data wrangling tools provided by **dplyr**. Our advice is to download this, print out a copy and refer to this often as you start working with **dplyr**.
19
19
```
20
20
21
21
## Tidy data
22
22
23
23
**dplyr** will work with any data frame, but it's at its most powerful when our data are organised as [tidy data](http://vita.had.co.nz/papers/tidy-data.pdf). The word "tidy" has a very specific meaning in this context. Tidy data has a specific structure that makes it easy to manipulate, model and visualise. A tidy data set is one where each variable is in only one column and each row contains only one observation. This might seem like the "obvious" way to organise data, but many people fail to adopt this convention.
24
24
25
-
We aren't going to explore the tidy data copncept in great detail. However, the basic principles are not difficult to understand. We'll use an example to illustrate what the "one variable = one column" and "one observation = one row" idea means. Let's return to the made-up experiment investigateingthe response of communities to fertilizer addition. This time, imagine we had only measured biomass, but that we had measured it twice over the course of the experiment.
25
+
We aren't going to explore the tidy data concept in great detail, but the basic principles are not difficult to understand. We'll use an example to illustrate what the "one variable = one column" and "one observation = one row" idea means. Let's return to the made-up experiment investigating the response of communities to fertilizer addition. This time, imagine we had only measured biomass, but that we had measured it twice over the course of the experiment.
26
26
27
27
We'll examine some artificial data for the experiment and look at two ways to organise it to help us understand the tidy data idea. The first way uses a separate column for each biomass measurement:
28
28
```{r, echo=FALSE}
@@ -54,24 +54,24 @@ The best way to make sure your data set is tidy is to store in that format __whe
54
54
55
55
## A quick look at **dplyr** {#more-dplyr}
56
56
57
-
We'll finish up this chapter by taking a quick look at a few features of the **dplyr** package, before really drilling down into how it works in later chapters. The package is not part of the base R installation, so we have to install it first via `install.packages("dplyr")`. Remember, we only have to install **dplyr** once, so there's no need to leave the `install.packages` line in script that uses the package. We do have to add `library` to the top of any scripts using the package to load and attach it:
57
+
We'll finish up this chapter by taking a quick look at a few features of the **dplyr** package, before really drilling down into how it works. The package is not part of the base R installation, so we have to install it first via `install.packages("dplyr")`. Remember, we only have to install **dplyr** once, so there's no need to leave the `install.packages` line in script that uses the package. We do have to add `library` to the top of any scripts using the package to load and attach it:
58
58
```{r, eval=FALSE}
59
59
library("dplyr")
60
60
```
61
61
62
-
We need some data to work with too. We'll use two data sets to illustrate the key ideas in the next few chapters: the `iris` data set in the **datasets** package and the `storms` data set in the **nasaweather** package.
62
+
We need some data to work with. We'll use two data sets to illustrate the key ideas in the next few chapters: the `iris` data set in the **datasets** package and the `storms` data set in the **nasaweather** package.
63
63
64
64
The **datasets** package ships with R and is loaded and attached at start up, so there's no need to do anything to make `iris` available. The **nasaweather** package doesn't ship with R so it needs to be installed via `install.packages("nasaweather")`. Finally, we have to add `library` to the top of our script to load and attach the package:
65
65
```{r, eval=FALSE}
66
66
library("nasaweather")
67
67
```
68
-
The **nasaweather** package is a bare bones simple data package. It doesn't contain any new R functions, just data. We'll be using the `storms` data set from **nasaweather**---this contains information about tropical storms in North America (from 1995-2000). We're just using it as a convenient example of a medium-sized data set to illustrate the workings of the **dplyr**, and later, the **ggplot2** packages.
68
+
The **nasaweather** package is a bare bones data package. It doesn't contain any new R functions, just data. We'll be using the `storms` data set from **nasaweather**: this contains information about tropical storms in North America (from 1995-2000). We're just using it as a convenient example to illustrate the workings of the **dplyr**, and later, the **ggplot2** packages.
69
69
70
70
### Tibble (`tbl`) objects
71
71
72
-
The primary purpose of the **dplyr** package is to make it easier to manipulate data interactively. In order to facilitate this kind of work **dplyr**also implements a special kind of data object called a "tbl" objects (pronounced "tibble"). We can think of a tibble is as a special data frame with a few extra whistles and bells. We can convert an ordinary data frame to a a tibble using the `tbl_df` function.
72
+
The primary purpose of the **dplyr** package is to make it easier to manipulate data interactively. In order to facilitate this kind of work **dplyr** implements a special kind of data object known as a `tbl`(pronounced "tibble"). We can think of a tibble as a special data frame with a few extra whistles and bells.
73
73
74
-
It's a good idea (though certainly not necessary) to convert ordinary data frames to tibbles when working with data. Why? When an ordinary data frame is printed to the Console R will try to print every column and row, until it reaches a (very large) maximum number. The result is a mess of text that's virtually impossible to make sense of. In contrast, when a tibble is printed to the Console, it does so in a compact way. To see this, we can convert the `iris`dataset to a tibble using `tbl_df` and then print the resulting object to the Console:
74
+
We can convert an ordinary data frame to a a tibble using the `tbl_df` function. It's a good idea (though not necessary) to convert ordinary data frames to tibbles. Why? When a data frame is printed to the Console R will try to print every column and row until it reaches a (very large) maximum permitted amount of output. The result is a mess of text that's virtually impossible to make sense of. In contrast, when a tibble is printed to the Console, it does so in a compact way. To see this, we can convert the `iris`data set to a tibble using `tbl_df` and then print the resulting object to the Console:
75
75
```{r}
76
76
# make a "tibble" copy of iris
77
77
iris_tbl <- tbl_df(iris)
@@ -86,6 +86,6 @@ Sometimes we just need a quick, compact summary of a data frame or tibble. This
86
86
```{r}
87
87
glimpse(iris_tbl)
88
88
```
89
-
The function takes one argument: the name of a data frame or tibble. It then tells us how many rows it has, how many variables there are, what these variables are called, and what kind of data are associated with each variable. This function is useful when we're working with a dataset containing many variables.
89
+
The function takes one argument: the name of a data frame or tibble. It then tells us how many rows it has, how many variables there are, what these variables are called, and what kind of data are associated with each variable. This function is useful when we're working with a data set containing many variables.
0 commit comments