From 528606008e1e3d18c677b179cce9215dcd2bb735 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Mon, 20 Nov 2023 08:10:43 -0500 Subject: [PATCH 01/43] Fix typo, closes #1601 --- iteration.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.qmd b/iteration.qmd index 9c05c19e6..e2a43ca4e 100644 --- a/iteration.qmd +++ b/iteration.qmd @@ -566,7 +566,7 @@ paths |> What if we want to pass in extra arguments to `read_excel()`? We use the same technique that we used with `across()`. -For example, it's often useful to peak at the first few rows of the data with `n_max = 1`: +For example, it's often useful to peek at the first few rows of the data with `n_max = 1`: ```{r} paths |> From 85d220958f35dbf0ab39c6ebabeab5257de0a72c Mon Sep 17 00:00:00 2001 From: "Pablo E. Garcia" Date: Sat, 25 Nov 2023 16:38:06 +0100 Subject: [PATCH 02/43] Fix typo in numbers.qmd (#1602) --- numbers.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/numbers.qmd b/numbers.qmd index 8fce8f496..c364d6504 100644 --- a/numbers.qmd +++ b/numbers.qmd @@ -583,7 +583,7 @@ The median delay is always smaller than the mean delay because flights sometimes ```{r} #| label: fig-mean-vs-median #| fig-cap: | -#| A scatterplot showing the differences of summarizing daily depature +#| A scatterplot showing the differences of summarizing daily departure #| delay with median instead of mean. #| fig-alt: | #| All points fall below a 45° line, meaning that the median delay is From bfa59fa474acc43744a585b11eec0e5e9ae1a168 Mon Sep 17 00:00:00 2001 From: Max Ranieri Date: Thu, 30 Nov 2023 01:05:30 -0500 Subject: [PATCH 03/43] Make variable names consistent in data-transform.qmd (#1604) --- data-transform.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data-transform.qmd b/data-transform.qmd index ae5d8b926..c52b3561f 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -594,7 +594,7 @@ We'll come back to discuss missing values in detail in @sec-missing-values, but flights |> group_by(month) |> summarize( - delay = mean(dep_delay, na.rm = TRUE) + avg_delay = mean(dep_delay, na.rm = TRUE) ) ``` @@ -605,7 +605,7 @@ You'll learn various useful summaries in the upcoming chapters, but one very use flights |> group_by(month) |> summarize( - delay = mean(dep_delay, na.rm = TRUE), + avg_delay = mean(dep_delay, na.rm = TRUE), n = n() ) ``` From b0aa40096b292fef842432d9ea009a46f41442da Mon Sep 17 00:00:00 2001 From: e-clin Date: Fri, 19 Jan 2024 20:46:00 +0000 Subject: [PATCH 04/43] Update quarto.qmd (#1621) Replaced "a ridiculous degree of accuracy" with "a ridiculous degree of precision", as that is what the number of decimal places relates to. --- quarto.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quarto.qmd b/quarto.qmd index a101f5f4a..155fb8bd4 100644 --- a/quarto.qmd +++ b/quarto.qmd @@ -403,7 +403,7 @@ When the report is rendered, the results of these computations are inserted into > The distribution of the remainder is shown below: When inserting numbers into text, `format()` is your friend. -It allows you to set the number of `digits` so you don't print to a ridiculous degree of accuracy, and a `big.mark` to make numbers easier to read. +It allows you to set the number of `digits` so you don't print to a ridiculous degree of precision, and a `big.mark` to make numbers easier to read. You might combine these into a helper function: ```{r} From d685f8f3f87074421be9f0c47c602e33551ca315 Mon Sep 17 00:00:00 2001 From: Enki WANG <98389771+ynsec37@users.noreply.github.com> Date: Mon, 29 Jan 2024 02:17:22 +0800 Subject: [PATCH 05/43] Fix typo (#1624) --- quarto-formats.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quarto-formats.qmd b/quarto-formats.qmd index 8c749b629..bd9ab42fd 100644 --- a/quarto-formats.qmd +++ b/quarto-formats.qmd @@ -246,7 +246,7 @@ cat(readr::read_file("quarto/example-book.yml")) We recommend that you use an RStudio project for your websites and books. Based on the `_quarto.yml` file, RStudio will recognize the type of project you're working on, and add a Build tab to the IDE that you can use to render and preview your websites and books. -Both websites and books can also be rendered using `quarto::render()`. +Both websites and books can also be rendered using `quarto::quarto_render()`. Read more at about Quarto websites and about books. From 6f512d1b3f387a365a5cffe8bca01cb1aa12cf00 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Sun, 28 Jan 2024 14:52:53 -0500 Subject: [PATCH 06/43] Update Quarto set up action --- .github/workflows/build_book.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_book.yaml b/.github/workflows/build_book.yaml index cdcba86cb..43ce969b3 100644 --- a/.github/workflows/build_book.yaml +++ b/.github/workflows/build_book.yaml @@ -23,8 +23,8 @@ jobs: steps: - uses: actions/checkout@v2 - - name: Install Quarto - uses: quarto-dev/quarto-actions/install-quarto@v1 + - name: Set up Quarto + uses: quarto-dev/quarto-actions/setup@v2 with: # To install LaTeX to build PDF book tinytex: true From e89d0833334b544471cc2ea0d0631ca119fc7756 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Sun, 28 Jan 2024 14:56:49 -0500 Subject: [PATCH 07/43] Update workflow-scripts.qmd Fix typo, closes #1618 and closes #1617 --- workflow-scripts.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/workflow-scripts.qmd b/workflow-scripts.qmd index 5027b58dc..d726e9b80 100644 --- a/workflow-scripts.qmd +++ b/workflow-scripts.qmd @@ -244,7 +244,7 @@ In this R session, the current working directory (think of it as "home") is in h This code will return a different result when you run it, because your computer has a different directory structure than Hadley's! As a beginning R user, it's OK to let your working directory be your home directory, documents directory, or any other weird directory on your computer. -But you're seven chapters into this book, and you're no longer a beginner. +But you're more than a handful of chapters into this book, and you're no longer a beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R's working directory to the associated directory. You can set the working directory from within R but **we** **do not recommend it**: From aa5eaac43eef81c651f3f44ed6577552beb0bbc2 Mon Sep 17 00:00:00 2001 From: Floris Vanderhaeghe Date: Mon, 29 Jan 2024 02:11:21 +0100 Subject: [PATCH 08/43] logicals.qmd: use better variable names in example (#1623) The variable names of this example were copied from the previous example, where logical vectors were summarized with `any()` and `all()`. However the present example the summary is with `sum()` and `mean()`, which was not updated in the variable names. --- logicals.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/logicals.qmd b/logicals.qmd index e7dcbb2b9..254b24109 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -360,8 +360,8 @@ That, for example, allows us to see the proportion of flights that were delayed flights |> group_by(year, month, day) |> summarize( - all_delayed = mean(dep_delay <= 60, na.rm = TRUE), - any_long_delay = sum(arr_delay >= 300, na.rm = TRUE), + proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE), + count_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .groups = "drop" ) ``` From e56ccb6d948ff2d52e1f9ba67b6a6620f4909bb0 Mon Sep 17 00:00:00 2001 From: Floris Vanderhaeghe Date: Mon, 29 Jan 2024 02:11:47 +0100 Subject: [PATCH 09/43] logicals.qmd: use if_else() in exercises instead of ifelse() (#1622) --- logicals.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/logicals.qmd b/logicals.qmd index 254b24109..7d9fb9c40 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -564,9 +564,9 @@ We don't expect you to memorize these rules, but they should become second natur 1. A number is even if it's divisible by two, which in R you can find out with `x %% 2 == 0`. Use this fact and `if_else()` to determine whether each number between 0 and 20 is even or odd. -2. Given a vector of days like `x <- c("Monday", "Saturday", "Wednesday")`, use an `ifelse()` statement to label them as weekends or weekdays. +2. Given a vector of days like `x <- c("Monday", "Saturday", "Wednesday")`, use an `if_else()` statement to label them as weekends or weekdays. -3. Use `ifelse()` to compute the absolute value of a numeric vector called `x`. +3. Use `if_else()` to compute the absolute value of a numeric vector called `x`. 4. Write a `case_when()` statement that uses the `month` and `day` columns from `flights` to label a selection of important US holidays (e.g., New Years Day, 4th of July, Thanksgiving, and Christmas). First create a logical column that is either `TRUE` or `FALSE`, and then create a character column that either gives the name of the holiday or is `NA`. From 716cc5b5c552a8a4d1d39c4c6e8a27beba23eb89 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Sun, 28 Jan 2024 20:16:30 -0500 Subject: [PATCH 10/43] Add both versions of insert anything shortcut, closes #1607 --- quarto.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/quarto.qmd b/quarto.qmd index 155fb8bd4..d110abd18 100644 --- a/quarto.qmd +++ b/quarto.qmd @@ -162,7 +162,7 @@ In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown th While Markdown is designed to be easy to read and write, as you will see in @sec-source-editor, it still requires learning new syntax. Therefore, if you're new to computational documents like `.qmd` files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor. -In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all ⌘ / shortcut to insert just about anything. +In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all + / or Ctrl + / shortcut to insert just about anything. If you are at the beginning of a line (as shown in @fig-visual-editor), you can also enter just / to invoke the shortcut. ```{r} From f40e78af083a2a8df676d92af87431db1831b8cf Mon Sep 17 00:00:00 2001 From: Floris Vanderhaeghe Date: Fri, 2 Feb 2024 01:41:06 +0100 Subject: [PATCH 11/43] Chapter 13 (numbers): minor updates (#1627) * numbers.qmd: improve variable name for IQR * numbers.qmd: textual fix --- numbers.qmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/numbers.qmd b/numbers.qmd index c364d6504..33d50f990 100644 --- a/numbers.qmd +++ b/numbers.qmd @@ -643,11 +643,11 @@ But the code below reveals a data oddity for airport [EGE](https://en.wikipedia. flights |> group_by(origin, dest) |> summarize( - distance_sd = IQR(distance), + distance_iqr = IQR(distance), n = n(), .groups = "drop" ) |> - filter(distance_sd > 0) + filter(distance_iqr > 0) ``` ### Distributions @@ -717,7 +717,7 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: `first(x)`, `last(x)`, and `nth(x, n)`. -For example, we can find the first and last departure for each day: +For example, we can find the first, fifth and last departure for each day: ```{r} flights |> From 33e701227a3a69e607485499c006fe6e9a284d3b Mon Sep 17 00:00:00 2001 From: kew24 <56448994+kew24@users.noreply.github.com> Date: Mon, 26 Feb 2024 15:27:18 -0500 Subject: [PATCH 12/43] small typo fix (#1635) --- data-transform.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data-transform.qmd b/data-transform.qmd index c52b3561f..76e485f87 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -467,7 +467,7 @@ ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram() ## The pipe {#sec-the-pipe} We've shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs. -For example, imagine that you wanted to find the fast flights to Houston's IAH airport: you need to combine `filter()`, `mutate()`, `select()`, and `arrange()`: +For example, imagine that you wanted to find the fastest flights to Houston's IAH airport: you need to combine `filter()`, `mutate()`, `select()`, and `arrange()`: ```{r} flights |> From f55997b9619b4edff8f739510821fa3ad3b59ea4 Mon Sep 17 00:00:00 2001 From: Floris Vanderhaeghe Date: Sat, 2 Mar 2024 02:12:26 +0100 Subject: [PATCH 13/43] Some fixes for chapters regexps & factors (#1636) * regexps.qmd: fix name of 'too_few' arg * regexps.qmd: fix typo * factors.qmd: update argument names to .f, .x, .y * factors.qmd: fix language --- factors.qmd | 12 ++++++------ regexps.qmd | 4 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/factors.qmd b/factors.qmd index a98871e63..5e78ba75f 100644 --- a/factors.qmd +++ b/factors.qmd @@ -177,9 +177,9 @@ It is hard to read this plot because there's no overall pattern. We can improve it by reordering the levels of `relig` using `fct_reorder()`. `fct_reorder()` takes three arguments: -- `f`, the factor whose levels you want to modify. -- `x`, a numeric vector that you want to use to reorder the levels. -- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`. +- `.f`, the factor whose levels you want to modify. +- `.x`, a numeric vector that you want to use to reorder the levels. +- Optionally, `.fun`, a function that's used if there are multiple values of `.x` for each value of `.f`. The default value is `median`. ```{r} #| fig-alt: | @@ -231,7 +231,7 @@ Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered. However, it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. -It takes a factor, `f`, and then any number of levels that you want to move to the front of the line. +It takes a factor, `.f`, and then any number of levels that you want to move to the front of the line. ```{r} #| fig-alt: | @@ -247,7 +247,7 @@ ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable")) Why do you think the average age for "Not applicable" is so high? Another type of reordering is useful when you are coloring the lines on a plot. -`fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values. +`fct_reorder2(.f, .x, .y)` reorders the factor `.f` by the `.y` values associated with the largest `.x` values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend. ```{r} @@ -287,7 +287,7 @@ Combine it with `fct_rev()` if you want them in increasing frequency so that in ```{r} #| fig-alt: | -#| A bar char of marital status ordered in from least to most common: +#| A bar char of marital status ordered from least to most common: #| no answer (~0), separated (~1,000), widowed (~2,000), divorced #| (~3,000), never married (~5,000), married (~10,000). gss_cat |> diff --git a/regexps.qmd b/regexps.qmd index c34dc7c5f..66b7872c0 100644 --- a/regexps.qmd +++ b/regexps.qmd @@ -265,7 +265,7 @@ df |> ) ``` -If the match fails, you can use `too_short = "debug"` to figure out what went wrong, just like `separate_wider_delim()` and `separate_wider_position()`. +If the match fails, you can use `too_few = "debug"` to figure out what went wrong, just like `separate_wider_delim()` and `separate_wider_position()`. ### Exercises @@ -336,7 +336,7 @@ That lets you avoid one layer of escaping: str_view(x, r"{\\}") ``` -If you're trying to match a literal `.`, `$`, `|`, `*`, `+`, `?`, `{`, `}`, `(`, `)`, there's an alternative to using a backslash escape: you can use a character class: `[.]`, `[$]`, `[|]`, \... +If you're trying to match a literal `.`, `$`, `|`, `*`, `+`, `?`, `{`, `}`, `(`, `)`, there's an alternative to using a backslash escape: you can use a character class: `[.]`, `[$]`, `[|]`, ... all match the literal values. ```{r} From bdd847c5bb287d6a348b324f3b34e252a70f8903 Mon Sep 17 00:00:00 2001 From: Steven Primeaux Date: Wed, 13 Mar 2024 11:33:58 -0500 Subject: [PATCH 14/43] 17.2: clarify language (#1637) "POSIXct" is the thing tripping off the tongue in this sentence, not base R. --- datetimes.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datetimes.qmd b/datetimes.qmd index ca6fc6308..d40510a8d 100644 --- a/datetimes.qmd +++ b/datetimes.qmd @@ -55,7 +55,7 @@ There are three types of date/time data that refer to an instant in time: - A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as ``. - Base R calls these POSIXct, but doesn't exactly trip off the tongue. + Base R calls these POSIXct, but that doesn't exactly trip off the tongue. In this chapter we are going to focus on dates and date-times as R doesn't have a native class for storing times. If you need one, you can use the **hms** package. From f312f484e30d65f3dc41c7ac1e4f83ea8875a9df Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Thu, 4 Apr 2024 23:28:10 -0400 Subject: [PATCH 15/43] Fix typo --- factors.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/factors.qmd b/factors.qmd index 5e78ba75f..d0864daf0 100644 --- a/factors.qmd +++ b/factors.qmd @@ -139,7 +139,7 @@ gss_cat |> When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below. -### Exercise +### Exercises 1. Explore the distribution of `rincome` (reported income). What makes the default bar chart hard to understand? From b88683ec69010f41f8f17aaac8a9149685604465 Mon Sep 17 00:00:00 2001 From: David Kane Date: Wed, 17 Apr 2024 06:52:56 -0400 Subject: [PATCH 16/43] Update iteration.qmd (#1646) --- iteration.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.qmd b/iteration.qmd index e2a43ca4e..56fee2cae 100644 --- a/iteration.qmd +++ b/iteration.qmd @@ -822,7 +822,7 @@ This would work well here, but we don't have csv files, instead we have excel sp So we're going to have to do it "by hand". Learning to do it by hand will also help you when you have a bunch of csvs and the database that you're working with doesn't have one function that will load them all in. -We need to start by creating a table that will fill in with data. +We need to start by creating a table that we will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it: From eedfa7378569639420e48554a73f9cd02f339d87 Mon Sep 17 00:00:00 2001 From: David Kane Date: Fri, 19 Apr 2024 09:48:25 -0400 Subject: [PATCH 17/43] Update databases.qmd (#1648) typo fix --- databases.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/databases.qmd b/databases.qmd index c0c0d0d48..949194b26 100644 --- a/databases.qmd +++ b/databases.qmd @@ -336,7 +336,7 @@ Note that unlike `mutate()`, the old name is on the left and the new name is on In the examples above note that `"year"` and `"type"` are wrapped in double quotes. That's because these are **reserved words** in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators. -When working with other databases you're likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe. +When working with other databases you're likely to see every variable name quoted because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe. ``` sql SELECT "tailnum", "type", "manufacturer", "model", "year" From 0e6ee2535175262b9221e32000e88c0a8b3d99d5 Mon Sep 17 00:00:00 2001 From: David Kane Date: Sat, 20 Apr 2024 09:05:48 -0400 Subject: [PATCH 18/43] Update databases.qmd (#1650) Fixed title and removed italics, since "The Three-Valued Logic of SQL" is an article, not a book. --- databases.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/databases.qmd b/databases.qmd index 949194b26..8145250af 100644 --- a/databases.qmd +++ b/databases.qmd @@ -426,7 +426,7 @@ flights |> summarize(delay = mean(arr_delay)) ``` -If you want to learn more about how `NULL`s work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand. +If you want to learn more about how `NULL`s work, you might enjoy "[The Three-Valued Logic of SQL](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand. In general, you can work with `NULL`s using the functions you'd use for `NA`s in R: From e42ee44e0456ece4bc8ee8e79e6597410eb64aa2 Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Sat, 4 May 2024 11:58:01 +0900 Subject: [PATCH 19/43] probably a careless mistake (#1652) --- intro.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intro.qmd b/intro.qmd index aa0a54232..204d0a723 100644 --- a/intro.qmd +++ b/intro.qmd @@ -27,7 +27,7 @@ Our model of the steps of a typical data science project looks something like @f #| fig-alt: | #| A diagram displaying the data science cycle: Import -> Tidy -> Understand #| (which has the phases Transform -> Visualize -> Model in a cycle) -> -#| Communicate. Surrounding all of these is Communicate. +#| Communicate. Surrounding all of these is Program. #| out.width: NULL knitr::include_graphics("diagrams/data-science/base.png", dpi = 270) From da7066349d3bb88c890ef5706a693b197539d3e5 Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Mon, 13 May 2024 22:53:46 +0900 Subject: [PATCH 20/43] Fix/data transform.qmd (#1654) * a typo, correct `gain,` to `gain`, * typos * no data n <= 100 in the plot, and a typo --- data-transform.qmd | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/data-transform.qmd b/data-transform.qmd index 76e485f87..e11d79e98 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -310,7 +310,7 @@ flights |> ) ``` -Note that since we haven't assigned the result of the above computation back to `flights`, the new variables `gain,` `hours`, and `gain_per_hour` will only be printed but will not be stored in a data frame. +Note that since we haven't assigned the result of the above computation back to `flights`, the new variables `gain`, `hours`, and `gain_per_hour` will only be printed but will not be stored in a data frame. And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to `flights`, overwriting the original data frame with many more variables, or to a new object. Often, the right answer is a new object that is named informatively to indicate its contents, e.g., `delay_gain`, but you might also have good reasons for overwriting `flights`. @@ -347,7 +347,9 @@ In this situation, the first challenge is often just focusing on the variables y select(!year:day) ``` - Historically this operation was done with `-` instead of `!`, so you're likely to see that in the wild. These two operators serve the same purpose but with subtle differences in behavior. We recommend using `!` because it reads as "not" and combines well with `&` and `|`. + Historically this operation was done with `-` instead of `!`, so you're likely to see that in the wild. + These two operators serve the same purpose but with subtle differences in behavior. + We recommend using `!` because it reads as "not" and combines well with `&` and `|`. - Select all columns that are characters: @@ -766,7 +768,7 @@ You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse ``` b. Write down what you think the output will look like, then check if you were correct, and describe what `arrange()` does. - Also comment on how it's different from the `group_by()` in part (a)? + Also comment on how it's different from the `group_by()` in part (a). ```{r} #| eval: false @@ -797,7 +799,7 @@ You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse ``` e. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. - How is the output different from the one in part (d). + How is the output different from the one in part (d)? ```{r} #| eval: false @@ -853,9 +855,9 @@ When we plot the skill of the batter (measured by the batting average, `performa #| fig-alt: | #| A scatterplot of number of batting performance vs. batting opportunites #| overlaid with a smoothed line. Average performance increases sharply -#| from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance +#| from 0.2 at when n is ~100 to 0.25 when n is ~1000. Average performance #| continues to increase linearly at a much shallower slope reaching -#| ~0.3 when n is ~15,000. +#| 0.3 when n is ~12,000. batters |> filter(n > 100) |> @@ -880,7 +882,7 @@ You can find a good explanation of this problem and how to overcome it at Date: Mon, 13 May 2024 15:57:27 +0200 Subject: [PATCH 21/43] Add "of" between numbers and rows (#1649) --- numbers.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/numbers.qmd b/numbers.qmd index 33d50f990..37c3db49d 100644 --- a/numbers.qmd +++ b/numbers.qmd @@ -129,7 +129,7 @@ There are a couple of variants of `n()` and `count()` that you might find useful ### Exercises -1. How can you use `count()` to count the number rows with a missing value for a given variable? +1. How can you use `count()` to count the number of rows with a missing value for a given variable? 2. Expand the following calls to `count()` to instead use `group_by()`, `summarize()`, and `arrange()`: 1. `flights |> count(dest, sort = TRUE)` From f703a6c560d6ff599a4a81d114eab84b3a517090 Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Fri, 17 May 2024 21:32:01 +0900 Subject: [PATCH 22/43] Fix/data-tidy.qmd small typos (#1655) --- data-tidy.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data-tidy.qmd b/data-tidy.qmd index c9f962c16..d9562e16f 100644 --- a/data-tidy.qmd +++ b/data-tidy.qmd @@ -397,7 +397,7 @@ household ``` This dataset contains data about five families, with the names and dates of birth of up to two children. -The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name)` and the values of another (`child,` with values 1 or 2). +The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name)` and the values of another (`child`, with values 1 or 2). To solve this problem we again need to supply a vector to `names_to` but this time we use the special `".value"` sentinel; this isn't the name of a variable but a unique value that tells `pivot_longer()` to do something different. This overrides the usual `values_to` argument to use the first component of the pivoted column name as a variable name in the output. @@ -456,7 +456,7 @@ cms_patient_experience |> Neither of these columns will make particularly great variable names: `measure_cd` doesn't hint at the meaning of the variable and `measure_title` is a long sentence containing spaces. We'll use `measure_cd` as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful. -`pivot_wider()` has the opposite interface to `pivot_longer()`: instead of choosing new column names, we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`: +`pivot_wider()` has the opposite interface to `pivot_longer()`: instead of choosing new column names, we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from`): ```{r} cms_patient_experience |> From fe302aca08b598879c56626c5570979a815a203b Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Tue, 28 May 2024 21:22:32 +0900 Subject: [PATCH 23/43] Suggest/layers.qmd shape descriptions in fig-alt, etc. (#1660) * a careless mistake * #14 triangle is point up, fig-alt describes #25 shape, and filled in "red", not in "blue" * show.legend = FALSE * correct fig-alt in 2seater highlighted scatterplot * match character string to set shape * a careless mistake * fig-alt suggests "free" * the first faceted plot in this section is faceted by cyl * a careless mistake * correct fig-alt in position adjustment --- layers.qmd | 60 ++++++++++++++++++++++++++---------------------------- 1 file changed, 29 insertions(+), 31 deletions(-) diff --git a/layers.qmd b/layers.qmd index 268bb248a..b29919317 100644 --- a/layers.qmd +++ b/layers.qmd @@ -89,7 +89,7 @@ When `class` is mapped to `shape`, we get two warnings: Since ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related -- there are 62 SUVs in the dataset and they're not plotted. -Similarly, we can map `class` to `size` or `alpha` aesthetics as well, which control the shape and the transparency of the points, respectively. +Similarly, we can map `class` to `size` or `alpha` aesthetics as well, which control the size and the transparency of the points, respectively. ```{r} #| layout-ncol: 2 @@ -150,28 +150,28 @@ You'll need to pick a value that makes sense for that aesthetic: #| fig.asp: 0.364 #| fig-align: "center" #| fig-cap: | -#| R has 25 built-in shapes that are identified by numbers. There are some +#| R has 26 built-in shapes that are identified by numbers. There are some #| seeming duplicates: for example, 0, 15, and 22 are all squares. The #| difference comes from the interaction of the `color` and `fill` #| aesthetics. The hollow shapes (0--14) have a border determined by `color`; #| the solid shapes (15--20) are filled with `color`; the filled shapes -#| (21--24) have a border of `color` and are filled with `fill`. Shapes are +#| (21--25) have a border of `color` and are filled with `fill`. Shapes are #| arranged to keep similar shapes next to each other. #| fig-alt: | -#| Mapping between shapes and the numbers that represent them: 0 - square, -#| 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, -#| 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, -#| 10 - circle plus, 11 - triangles up and down, 12 - square plus, -#| 13 - circle cross, 14 - square and triangle down, 15 - filled square, -#| 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, -#| 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, -#| 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle -#| point-up blue, 25 - filled triangle point down blue. +#| Mapping between shapes and the numbers that represent them: 0 - square open, +#| 1 - circle open, 2 - triangle open, 3 - plus, 4 - cross, 5 - diamond open, +#| 6 - triangle down open, 7 - square cross, 8 - asterisk, 9 - diamond plus, +#| 10 - circle plus, 11 - star, 12 - square plus, +#| 13 - circle cross, 14 - square triangle, 15 - square, +#| 16 - circle small, 17 - triangle, 18 - diamond, +#| 19 - circle, 20 - bullet, 21 - circle filled, +#| 22 - square filled, 23 - diamond filled, 24 - triangle filled, +#| 25 - triangle down filled. shapes <- tibble( - shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20), - x = (0:24 %/% 5) / 2, - y = (-(0:24 %% 5)) / 4 + shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20, 25), + x = (0:25 %/% 5) / 2, + y = (-(0:25 %% 5)) / 4 ) ggplot(shapes, aes(x, y)) + geom_point(aes(shape = shape), size = 5, fill = "red") + @@ -319,12 +319,13 @@ It is convenient to rely on this feature because the `group` aesthetic by itself #| message: false #| fig-alt: | #| Three plots, each with highway fuel efficiency on the y-axis and engine -#| size of cars, where data are represented by a smooth curve. The first plot +#| size of cars on the x-axis, where data are represented by a smooth curve. +#| The first plot #| only has these two variables, the center plot has three separate smooth #| curves for each level of drive train, and the right plot not only has the #| same three separate smooth curves for each level of drive train but these -#| curves are plotted in different colors, with a legend explaining which -#| color maps to which level. Confidence intervals around the smooth curves +#| curves are plotted in different colors. +#| Confidence intervals around the smooth curves #| are also displayed. # Left @@ -365,10 +366,7 @@ The local data argument in `geom_point()` overrides the global data argument in #| message: false #| fig-alt: | #| Scatterplot of highway fuel efficiency versus engine size of cars, where -#| points are colored according to the car class. A smooth curve following -#| the trajectory of the relationship between highway fuel efficiency versus -#| engine size of subcompact cars is overlaid along with a confidence interval -#| around it. +#| two-seater cars are highlighted with red points and open circles. ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + @@ -503,7 +501,7 @@ In @sec-data-visualization you learned about faceting with `facet_wrap()`, which ```{r} #| fig-alt: | #| Scatterplot of highway fuel efficiency versus engine size of cars, -#| faceted by class, with facets spanning two rows. +#| faceted by number of cylinders, with facets spanning two rows. ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + @@ -528,7 +526,7 @@ ggplot(mpg, aes(x = displ, y = hwy)) + By default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. -Setting the `scales` argument in a faceting function to `"free"` will allow for different axis scales across both rows and columns, `"free_x"` will allow for different scales across rows, and `"free_y"` will allow for different scales across columns. +Setting the `scales` argument in a faceting function to `"free_x"` will allow for different scales of x-axis across columns, `"free_y"` will allow for different scales on y-axis across rows, and `"free"` will allow both. ```{r} #| fig-alt: | @@ -541,7 +539,7 @@ Setting the `scales` argument in a faceting function to `"free"` will allow for ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + - facet_grid(drv ~ cyl, scales = "free_y") + facet_grid(drv ~ cyl, scales = "free") ``` ### Exercises @@ -581,7 +579,7 @@ ggplot(mpg, aes(x = displ, y = hwy)) + ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + - facet_wrap(~ class, nrow = 2) + facet_wrap(~ cyl, nrow = 2) ``` What are the advantages to using faceting instead of the color aesthetic? @@ -775,7 +773,7 @@ You can color a bar chart using either the `color` aesthetic, or, more usefully, #| fig-alt: | #| Two bar charts of drive types of cars. In the first plot, the bars have #| colored borders. In the second plot, they're filled with colors. Heights -#| of the bars correspond to the number of cars in each cut category. +#| of the bars correspond to the number of cars in each drv category. # Left ggplot(mpg, aes(x = drv, color = drv)) + @@ -794,7 +792,7 @@ Each colored rectangle represents a combination of `drv` and `class`. #| Segmented bar chart of drive types of cars, where each bar is filled with #| colors for the classes of cars. Heights of the bars correspond to the #| number of cars in each drive category, and heights of the colored -#| segments are proportional to the number of cars with a given class +#| segments represent the number of cars with a given class #| level within a given drive type level. ggplot(mpg, aes(x = drv, fill = class)) + @@ -813,9 +811,9 @@ If you don't want a stacked bar chart, you can use one of three other options: ` #| fig-width: 4 #| fig-alt: | #| Segmented bar chart of drive types of cars, where each bar is filled with - #| colors for the classes of cars. Heights of the bars correspond to the - #| number of cars in each drive category, and heights of the colored - #| segments are proportional to the number of cars with a given class + #| colors for the classes of cars. + #| Heights of the colored + #| segments represent the number of cars with a given class #| level within a given drive type level. However the segments overlap. In #| the first plot the bars are filled with transparent colors #| and in the second plot they are only outlined with color. From 43ee557a9e60eed2deb0322075b2b8513753bf8f Mon Sep 17 00:00:00 2001 From: David Date: Tue, 28 May 2024 14:25:41 +0200 Subject: [PATCH 24/43] Fixing some minor errors (#1657) * Fixing minor typo in data tidy * Fixing missleading reference in data transform chapter * Update data-tidy.qmd --------- Co-authored-by: Mine Cetinkaya-Rundel --- data-tidy.qmd | 2 +- data-transform.qmd | 2 +- logicals.qmd | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/data-tidy.qmd b/data-tidy.qmd index d9562e16f..0aa24f838 100644 --- a/data-tidy.qmd +++ b/data-tidy.qmd @@ -397,7 +397,7 @@ household ``` This dataset contains data about five families, with the names and dates of birth of up to two children. -The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name)` and the values of another (`child`, with values 1 or 2). +The new challenge in this dataset is that the column names contain the names of two variables (`dob`, `name`) and the values of another (`child`, with values 1 or 2). To solve this problem we again need to supply a vector to `names_to` but this time we use the special `".value"` sentinel; this isn't the name of a variable but a unique value that tells `pivot_longer()` to do something different. This overrides the usual `values_to` argument to use the first component of the pivoted column name as a variable name in the output. diff --git a/data-transform.qmd b/data-transform.qmd index e11d79e98..a52779194 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -164,7 +164,7 @@ flights |> ``` This "works", in the sense that it doesn't throw an error, but it doesn't do what you want because `|` first checks the condition `month == 1` and then checks the condition `2`, which is not a sensible condition to check. -We'll learn more about what's happening here and why in @sec-boolean-operations. +We'll learn more about what's happening here and why in @sec-order-operations-boolean. ### `arrange()` diff --git a/logicals.qmd b/logicals.qmd index 7d9fb9c40..f19654d93 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -248,7 +248,7 @@ A missing value in a logical vector means that the value could either be `TRUE` However, `NA | FALSE` is `NA` because we don't know if `NA` is `TRUE` or `FALSE`. Similar reasoning applies with `NA & FALSE`. -### Order of operations +### Order of operations {#sec-order-operations-boolean} Note that the order of operations doesn't work like English. Take the following code that finds all flights that departed in November or December: From c70b13b07560f692bbce762206c1e38cf451c4e7 Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Tue, 28 May 2024 21:27:06 +0900 Subject: [PATCH 25/43] Suggest/data-import.qmd (#1659) * probably a typo, and no mention about fixing `student_id` column below. * "and" is better than "so" --- data-import.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/data-import.qmd b/data-import.qmd index db74fdcef..705c6b58c 100644 --- a/data-import.qmd +++ b/data-import.qmd @@ -88,7 +88,7 @@ students In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available". This is something we can address using the `na` argument. -By default, `read_csv()` only recognizes empty strings (`""`) in this dataset as `NA`s, we want it to also recognize the character string `"N/A"`. +By default, `read_csv()` only recognizes empty strings (`""`) in this dataset as `NA`s, and we want it to also recognize the character string `"N/A"`. ```{r} #| message: false @@ -131,7 +131,7 @@ students |> Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (``) to factor (``). You'll learn more about factors in @sec-factors. -Before you analyze these data, you'll probably want to fix the `age` and `id` columns. +Before you analyze these data, you'll probably want to fix the `age` column. Currently, `age` is a character variable because one of the observations is typed out as `five` instead of a numeric `5`. We discuss the details of fixing this issue in @sec-import-spreadsheets. From 87fb6eefb96042efd541f77292aa78fb3984bb2b Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Fri, 31 May 2024 13:57:18 -0400 Subject: [PATCH 26/43] Fix typo, closes #1644 --- rectangling.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rectangling.qmd b/rectangling.qmd index 2ba86f481..5a6bf6f7e 100644 --- a/rectangling.qmd +++ b/rectangling.qmd @@ -345,7 +345,7 @@ repos ``` This tibble contains 6 rows, one row for each child of `gh_repos`. -Each row contains a unnamed list with either 26 or 30 rows. +Each row contains an unnamed list with either 26 or 30 rows. Since these are unnamed, we'll start with `unnest_longer()` to put each child in its own row: ```{r} From 5bfcc87d9a519a33381118f919fb97a23d1e1e51 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Fri, 31 May 2024 14:04:23 -0400 Subject: [PATCH 27/43] Fix typo in b calculation, closes #1638 --- functions.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions.qmd b/functions.qmd index 49865235e..2ab4e6daa 100644 --- a/functions.qmd +++ b/functions.qmd @@ -60,7 +60,7 @@ df |> mutate( a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)), b = (b - min(b, na.rm = TRUE)) / - (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)), + (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)), c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)), d = (d - min(d, na.rm = TRUE)) / From 95f1cb1bb8d487c4d42fbc5f665e3eec3861c21d Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Sat, 1 Jun 2024 12:34:43 +0900 Subject: [PATCH 28/43] Fix/communication.qmd, mainly fig-alt corrections (#1663) * correct fig-alt by using previous fig-alt * probably typos and careless mistakes * probably a copy and paste problem * correct fig-alt --- communication.qmd | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/communication.qmd b/communication.qmd index af2ad1a1e..2557d3399 100644 --- a/communication.qmd +++ b/communication.qmd @@ -185,9 +185,10 @@ This useful package will automatically adjust labels so that they don't overlap: ```{r} #| fig-alt: | -#| Scatterplot of highway fuel efficiency versus engine size of cars, where -#| points are colored according to the car class. Some points are labelled -#| with the car's name. The labels are box with white, transparent background +#| Scatterplot of highway mileage versus engine size where points are colored +#| by drive type. Smooth curves for each drive type are overlaid. +#| Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel. +#| The labels are box with white background #| and positioned to not overlap. ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + @@ -364,7 +365,7 @@ ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + You can use `labels` in the same way (a character vector the same length as `breaks`), but you can also set it to `NULL` to suppress the labels altogether. This can be useful for maps, or for publishing plots where you can't share the absolute numbers. You can also use `breaks` and `labels` to control the appearance of legends. -For discrete scales for categorical variables, `labels` can be a named list of the existing levels names and the desired labels for them. +For discrete scales for categorical variables, `labels` can be a named list of the existing level names and the desired labels for them. ```{r} #| fig-alt: | @@ -390,7 +391,7 @@ Note that `breaks` is in the original scale of the data. #| fig-alt: | #| Two side-by-side box plots of price versus cut of diamonds. The outliers #| are transparent. On both plots the x-axis labels are formatted as dollars. -#| The x-axis labels on the plot start at $0 and go to $15,000, increasing +#| The x-axis labels on the left plot start at $0 and go to $15,000, increasing #| by $5,000. The x-axis labels on the right plot start at $1K and go to #| $19K, increasing by $6K. @@ -461,7 +462,7 @@ The theme setting `legend.position` controls where the legend is drawn: #| fig-alt: | #| Four scatterplots of highway fuel efficiency versus engine size of cars #| where points are colored based on class of car. Clockwise, the legend -#| is placed on the right, left, top, and bottom of the plot. +#| is placed on the right, left, bottom, and top of the plot. base <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) @@ -575,7 +576,7 @@ This will also help ensure your plot is interpretable in black and white. ```{r} #| fig-alt: | -#| Two scatterplots of highway mileage versus engine size where both color +#| Scatterplot of highway mileage versus engine size where both color #| and shape of points are based on drive type. The color palette is not #| the default ggplot2 palette. @@ -686,8 +687,9 @@ Subsetting the data has affected the x and y scales as well as the smooth curve. #| fig-width: 4 #| message: false #| fig-alt: | -#| On the left, scatterplot of highway mileage vs. displacement, with -#| displacement. The smooth curve overlaid shows a decreasing, and then +#| On the left, scatterplot of highway mileage vs. displacement +#| where points are colored by drive type. +#| The smooth curve overlaid shows a decreasing, and then #| increasing trend, like a hockey stick. On the right, same variables #| are plotted with displacement ranging only from 5 to 6 and highway #| mileage ranging only from 10 to 25. The smooth curve overlaid shows a @@ -969,10 +971,9 @@ In the following, `|` places the `p1` and `p3` next to each other and `/` moves #| fig-alt: | #| Three plots laid out such that first and third plot are next to each other #| and the second plot stretched beneath them. The first plot is a -#| scatterplot of highway mileage versus engine size, third plot is a -#| scatterplot of highway mileage versus city mileage, and the third plot is -#| side-by-side boxplots of highway mileage versus drive train) placed next -#| to each other. +#| scatterplot of highway mileage versus engine size, the third plot is a +#| scatterplot of highway mileage versus city mileage, and the second plot is +#| side-by-side boxplots of highway mileage versus drive train). p3 <- ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() + From 24b38c608a6d89cd607c5d66b575917ee19df6ab Mon Sep 17 00:00:00 2001 From: David Kane Date: Fri, 31 May 2024 23:35:15 -0400 Subject: [PATCH 29/43] Update iteration.qmd (#1647) Not sure what "Vs" is supposed to mean. Perhaps "Versus?" In any case, "Compare with" seems better. --- iteration.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.qmd b/iteration.qmd index 56fee2cae..62090c688 100644 --- a/iteration.qmd +++ b/iteration.qmd @@ -323,7 +323,7 @@ diamonds |> summarize_means(c(carat, x:z)) ``` -### Vs `pivot_longer()` +### Compare with `pivot_longer()` Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()` (@sec-pivoting). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. From bc998bec1c0fd7e71b57df7632dcedc93f88f7ca Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Fri, 31 May 2024 23:37:12 -0400 Subject: [PATCH 30/43] Fix typo, closes #1643 --- logicals.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/logicals.qmd b/logicals.qmd index f19654d93..b0b7f5400 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -210,7 +210,7 @@ For example, `df |> filter(!is.na(x))` finds all rows where `x` is not missing a #| out-width: NULL #| fig-cap: | #| The complete set of Boolean operations. `x` is the left-hand -#| circle, `y` is the right-hand circle, and the shaded region show +#| circle, `y` is the right-hand circle, and the shaded regions show #| which parts each operator selects. #| fig-alt: | #| Six Venn diagrams, each explaining a given logical operator. The From caf872cc072cd8487bcde044455bdfa570841fbf Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Sat, 1 Jun 2024 21:20:12 +0900 Subject: [PATCH 31/43] Fix/logicals.qmd and transform.qmd; correction of fig-alt, and typos (#1664) --- logicals.qmd | 12 ++++++------ transform.qmd | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/logicals.qmd b/logicals.qmd index b0b7f5400..62a2e1590 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -213,11 +213,11 @@ For example, `df |> filter(!is.na(x))` finds all rows where `x` is not missing a #| circle, `y` is the right-hand circle, and the shaded regions show #| which parts each operator selects. #| fig-alt: | -#| Six Venn diagrams, each explaining a given logical operator. The -#| circles (sets) in each of the Venn diagrams represent x and y. 1. y & -#| !x is y but none of x; x & y is the intersection of x and y; x & !y is -#| x but none of y; x is all of x none of y; xor(x, y) is everything -#| except the intersection of x and y; y is all of y and none of x; and +#| Seven Venn diagrams, each explaining a given logical operator. The +#| circles (sets) in each of the Venn diagrams represent x and y. x & +#| !y is x but none of y; x & y is the intersection of x and y; !x & y is +#| y but none of x; x is all of x; xor(x, y) is everything +#| except the intersection of x and y; y is all of y; and #| x | y is everything. knitr::include_graphics("diagrams/transform.png", dpi = 270) ``` @@ -352,7 +352,7 @@ That leads us to the numeric summaries. ### Numeric summaries of logical vectors {#sec-numeric-summaries-of-logicals} When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0. -This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` gives the number of `TRUE`s and `mean(x)` gives the proportion of `TRUE`s (because `mean()` is just `sum()` divided by `length()`. +This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` gives the number of `TRUE`s and `mean(x)` gives the proportion of `TRUE`s (because `mean()` is just `sum()` divided by `length()`). That, for example, allows us to see the proportion of flights that were delayed on departure by at most an hour and the number of flights that were delayed on arrival by five hours or more: diff --git a/transform.qmd b/transform.qmd index a82beb038..a7800091d 100644 --- a/transform.qmd +++ b/transform.qmd @@ -26,7 +26,7 @@ You can read these chapters as you need them; they're designed to be largely sta - @sec-logicals teaches you about logical vectors. These are the simplest types of vectors, but are extremely powerful. - You'll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations. + You'll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for conditional transformations. - @sec-numbers dives into tools for vectors of numbers, the powerhouse of data science. You'll learn more about counting and a bunch of important transformation and summary functions. From 01afcfb1b27c673329bb0709524e46e59d6b1486 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Sat, 1 Jun 2024 15:16:17 -0400 Subject: [PATCH 32/43] Undo wrong edit --- functions.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions.qmd b/functions.qmd index 2ab4e6daa..5bb8f3a40 100644 --- a/functions.qmd +++ b/functions.qmd @@ -59,7 +59,7 @@ df <- tibble( df |> mutate( a = (a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)), - b = (b - min(b, na.rm = TRUE)) / + b = (b - min(a, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)), c = (c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)), From 6e077968c3e231dfc45943dfa0d2c9a2919e9772 Mon Sep 17 00:00:00 2001 From: Mitsuo Shiota <48662507+mitsuoxv@users.noreply.github.com> Date: Tue, 4 Jun 2024 13:37:48 +0900 Subject: [PATCH 33/43] probably typos in fig-alt (#1667) --- numbers.qmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/numbers.qmd b/numbers.qmd index 37c3db49d..cc5e1aad4 100644 --- a/numbers.qmd +++ b/numbers.qmd @@ -247,7 +247,7 @@ The results are shown in @fig-prop-cancelled. #| fig-alt: | #| A line plot showing how proportion of cancelled flights changes over #| the course of the day. The proportion starts low at around 0.5% at -#| 6am, then steadily increases over the course of the day until peaking +#| 5am, then steadily increases over the course of the day until peaking #| at 4% at 7pm. The proportion of cancelled flights then drops rapidly #| getting down to around 1% by midnight. flights |> @@ -588,9 +588,9 @@ The median delay is always smaller than the mean delay because flights sometimes #| fig-alt: | #| All points fall below a 45° line, meaning that the median delay is #| always less than the mean delay. Most points are clustered in a -#| dense region of mean [0, 20] and median [0, 5]. As the mean delay +#| dense region of mean [0, 20] and median [-5, 5]. As the mean delay #| increases, the spread of the median also increases. There are two -#| outlying points with mean ~60, median ~50, and mean ~85, median ~55. +#| outlying points with mean ~60, median ~30, and mean ~85, median ~55. flights |> group_by(year, month, day) |> summarize( @@ -701,7 +701,7 @@ The distributions seem to follow a common pattern, suggesting it's fine to use t #| fig-alt: | #| The distribution of `dep_delay` is highly right skewed with a strong #| peak slightly less than 0. The 365 frequency polygons are mostly -#| overlapping forming a thick black bland. +#| overlapping forming a thick black band. flights |> filter(dep_delay < 120) |> From 06f8d5c6904e4dda5edee5ddf453791181758470 Mon Sep 17 00:00:00 2001 From: "Luong Vuong (Leo)" <92429953+LeoLuongVuong@users.noreply.github.com> Date: Sun, 14 Jul 2024 00:07:37 +0700 Subject: [PATCH 34/43] Edit typo in EDA chapter, summaries -> summarizes (#1676) --- EDA.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/EDA.qmd b/EDA.qmd index d41f46863..82c01e83b 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -597,7 +597,7 @@ ggplot(smaller, aes(x = carat, y = price)) + ``` `cut_width(x, width)`, as used above, divides `x` into bins of width `width`. -By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summaries a different number of points. +By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell that each boxplot summarizes a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with `varwidth = TRUE`. #### Exercises From 643ab1b4b996275411625357663b4246f3802eb1 Mon Sep 17 00:00:00 2001 From: "Luong Vuong (Leo)" <92429953+LeoLuongVuong@users.noreply.github.com> Date: Sun, 14 Jul 2024 12:15:50 +0700 Subject: [PATCH 35/43] Edit a typo in the logicals chapter (#1677) --- logicals.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/logicals.qmd b/logicals.qmd index 62a2e1590..c262d71d7 100644 --- a/logicals.qmd +++ b/logicals.qmd @@ -437,7 +437,7 @@ There's an optional fourth argument, `missing` which will be used if the input i if_else(x > 0, "+ve", "-ve", "???") ``` -You can also use vectors for the the `true` and `false` arguments. +You can also use vectors for the `true` and `false` arguments. For example, this allows us to create a minimal implementation of `abs()`: ```{r} From 9a9ec24dd64171f96535ea1ec381085ec2b0d821 Mon Sep 17 00:00:00 2001 From: Mine Cetinkaya-Rundel Date: Mon, 2 Sep 2024 09:37:12 -0400 Subject: [PATCH 36/43] Fix typo (closes #1681) + various other copy edits --- data-transform.qmd | 73 +++++++++++++++++++++++----------------------- 1 file changed, 37 insertions(+), 36 deletions(-) diff --git a/data-transform.qmd b/data-transform.qmd index a52779194..b0d0a7c45 100644 --- a/data-transform.qmd +++ b/data-transform.qmd @@ -15,12 +15,12 @@ You'll learn how to do all that (and more!) in this chapter, which will introduc The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We'll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs. We will then introduce the ability to work with groups. -We will end the chapter with a case study that showcases these functions in action and we'll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g., numbers, strings, dates). +We will end the chapter with a case study that showcases these functions in action. In later chapters, we'll return to the functions in more detail as we start to dig into specific types of data (e.g., numbers, strings, dates). ### Prerequisites -In this chapter we'll focus on the dplyr package, another core member of the tidyverse. -We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data. +In this chapter, we'll focus on the dplyr package, another core member of the tidyverse. +We'll illustrate the key ideas using data from the nycflights13 package and use ggplot2 to help us understand the data. ```{r} #| label: setup @@ -32,14 +32,14 @@ library(tidyverse) Take careful note of the conflicts message that's printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()` and `stats::lag()`. -So far we've mostly ignored which package a function comes from because most of the time it doesn't matter. +So far, we've mostly ignored which package a function comes from because it doesn't usually matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we'll use the same syntax as R: `packagename::functionname()`. ### nycflights13 -To explore the basic dplyr verbs, we're going to use `nycflights13::flights`. +To explore the basic dplyr verbs, we will use `nycflights13::flights`. This dataset contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013. -The data comes from the US [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr), and is documented in `?flights`. +The data comes from the US [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr) and is documented in `?flights`. ```{r} flights @@ -48,24 +48,24 @@ flights `flights` is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. -If you're using RStudio, the most convenient is probably `View(flights)`, which will open an interactive scrollable and filterable view. +If you're using RStudio, the most convenient is probably `View(flights)`, which opens an interactive, scrollable, and filterable view. Otherwise you can use `print(flights, width = Inf)` to show all columns, or use `glimpse()`: ```{r} glimpse(flights) ``` -In both views, the variables names are followed by abbreviations that tell you the type of each variable: `` is short for integer, `` is short for double (aka real numbers), `` for character (aka strings), and `` for date-time. -These are important because the operations you can perform on a column depend so much on its "type". +In both views, the variable names are followed by abbreviations that tell you the type of each variable: `` is short for integer, `` is short for double (aka real numbers), `` for character (aka strings), and `` for date-time. +These are important because the operations you can perform on a column depend heavily on its "type." ### dplyr basics -You're about to learn the primary dplyr verbs (functions) which will allow you to solve the vast majority of your data manipulation challenges. +You're about to learn the primary dplyr verbs (functions), which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it's worth stating what they have in common: 1. The first argument is always a data frame. -2. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes). +2. The subsequent arguments typically describe which columns to operate on using the variable names (without quotes). 3. The output is always a new data frame. @@ -86,14 +86,15 @@ flights |> ``` dplyr's verbs are organized into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**. -In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to the join verbs that work on tables in @sec-joins. +In the following sections, you'll learn the most important verbs for rows, columns, and groups. Then, we'll return to the join verbs that work on tables in @sec-joins. Let's dive in! ## Rows The most important verbs that operate on rows of a dataset are `filter()`, which changes which rows are present without changing their order, and `arrange()`, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. -We'll also discuss `distinct()` which finds rows with unique values but unlike `arrange()` and `filter()` it can also optionally modify the columns. +We'll also discuss `distinct()` which finds rows with unique values. +Unlike `arrange()` and `filter()` it can also optionally modify the columns. ### `filter()` @@ -102,7 +103,7 @@ The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that departed more than 120 minutes (two hours) late: -[^data-transform-1]: Later, you'll learn about the `slice_*()` family which allows you to choose rows based on their positions. +[^data-transform-1]: Later, you'll learn about the `slice_*()` family, which allows you to choose rows based on their positions. ```{r} flights |> @@ -170,9 +171,9 @@ We'll learn more about what's happening here and why in @sec-order-operations-bo `arrange()` changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. -If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. +If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns. For example, the following code sorts by the departure time, which is spread over four columns. -We get the earliest years first, then within a year the earliest months, etc. +We get the earliest years first, then within a year, the earliest months, etc. ```{r} flights |> @@ -191,7 +192,7 @@ Note that the number of rows has not changed -- we're only arranging the data, w ### `distinct()` -`distinct()` finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. +`distinct()` finds all the unique rows in a dataset, so technically, it primarily operates on the rows. Most of the time, however, you'll want the distinct combination of some variables, so you can also optionally supply column names: ```{r} @@ -204,7 +205,7 @@ flights |> distinct(origin, dest) ``` -Alternatively, if you want to the keep other columns when filtering for unique rows, you can use the `.keep_all = TRUE` option. +Alternatively, if you want to keep other columns when filtering for unique rows, you can use the `.keep_all = TRUE` option. ```{r} flights |> @@ -213,7 +214,7 @@ flights |> It's not a coincidence that all of these distinct flights are on January 1: `distinct()` will find the first occurrence of a unique row in the dataset and discard the rest. -If you want to find the number of occurrences instead, you're better off swapping `distinct()` for `count()`, and with the `sort = TRUE` argument you can arrange them in descending order of number of occurrences. +If you want to find the number of occurrences instead, you're better off swapping `distinct()` for `count()`. With the `sort = TRUE` argument, you can arrange them in descending order of the number of occurrences. You'll learn more about count in @sec-counts. ```{r} @@ -229,10 +230,10 @@ flights |> - Flew to Houston (`IAH` or `HOU`) - Were operated by United, American, or Delta - Departed in summer (July, August, and September) - - Arrived more than two hours late, but didn't leave late + - Arrived more than two hours late but didn't leave late - Were delayed by at least an hour, but made up over 30 minutes in flight -2. Sort `flights` to find the flights with longest departure delays. +2. Sort `flights` to find the flights with the longest departure delays. Find the flights that left earliest in the morning. 3. Sort `flights` to find the fastest flights. @@ -265,8 +266,8 @@ flights |> ) ``` -By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it difficult to see what's happening here. -We can use the `.before` argument to instead add the variables to the left hand side[^data-transform-2]: +By default, `mutate()` adds new columns on the right-hand side of your dataset, which makes it difficult to see what's happening here. +We can use the `.before` argument to instead add the variables to the left-hand side[^data-transform-2]: [^data-transform-2]: Remember that in RStudio, the easiest way to see a dataset with many columns is `View()`. @@ -279,7 +280,7 @@ flights |> ) ``` -The `.` is a sign that `.before` is an argument to the function, not the name of a third new variable we are creating. +The `.` indicates that `.before` is an argument to the function, not the name of a third new variable we are creating. You can also use `.after` to add after a variable, and in both `.before` and `.after` you can use the variable name instead of a position. For example, we could add the new variables after `day`: @@ -371,7 +372,7 @@ See `?select` for more details. Once you know regular expressions (the topic of @sec-regular-expressions) you'll also be able to use `matches()` to select variables that match a pattern. You can rename variables as you `select()` them by using `=`. -The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side: +The new name appears on the left-hand side of the `=`, and the old variable appears on the right-hand side: ```{r} flights |> @@ -587,10 +588,10 @@ flights |> ) ``` -Uhoh! -Something has gone wrong and all of our results are `NA`s (pronounced "N-A"), R's symbol for missing value. +Uh-oh! +Something has gone wrong, and all of our results are `NA`s (pronounced "N-A"), R's symbol for missing value. This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an `NA` result. -We'll come back to discuss missing values in detail in @sec-missing-values, but for now we'll tell the `mean()` function to ignore all missing values by setting the argument `na.rm` to `TRUE`: +We'll come back to discuss missing values in detail in @sec-missing-values, but for now, we'll tell the `mean()` function to ignore all missing values by setting the argument `na.rm` to `TRUE`: ```{r} flights |> @@ -616,7 +617,7 @@ Means and counts can get you a surprisingly long way in data science! ### The `slice_` functions -There are five handy functions that allow you extract specific rows within each group: +There are five handy functions that allow you to extract specific rows within each group: - `df |> slice_head(n = 1)` takes the first row from each group. - `df |> slice_tail(n = 1)` takes the last row in each group. @@ -740,7 +741,7 @@ You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse 2. Find the flights that are most delayed upon departure from each destination. -3. How do delays vary over the course of the day. +3. How do delays vary over the course of the day? Illustrate your answer with a plot. 4. What happens if you supply a negative `n` to `slice_min()` and friends? @@ -768,7 +769,7 @@ You can learn more about it in the [dplyr 1.1.0 blog post](https://www.tidyverse ``` b. Write down what you think the output will look like, then check if you were correct, and describe what `arrange()` does. - Also comment on how it's different from the `group_by()` in part (a). + Also, comment on how it's different from the `group_by()` in part (a). ```{r} #| eval: false @@ -853,10 +854,10 @@ When we plot the skill of the batter (measured by the batting average, `performa ```{r} #| warning: false #| fig-alt: | -#| A scatterplot of number of batting performance vs. batting opportunites +#| A scatterplot of the number of batting performances vs. batting opportunities #| overlaid with a smoothed line. Average performance increases sharply #| from 0.2 at when n is ~100 to 0.25 when n is ~1000. Average performance -#| continues to increase linearly at a much shallower slope reaching +#| continues to increase linearly at a much shallower slope, reaching #| 0.3 when n is ~12,000. batters |> @@ -882,8 +883,8 @@ You can find a good explanation of this problem and how to overcome it at Date: Fri, 27 Sep 2024 14:37:33 +0200 Subject: [PATCH 37/43] correct ordered factor definition (#1686) Co-authored-by: Hadley Wickham --- factors.qmd | 47 +++++++++++++++++++++++++---------------------- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/factors.qmd b/factors.qmd index d0864daf0..565b39060 100644 --- a/factors.qmd +++ b/factors.qmd @@ -56,7 +56,7 @@ To create a factor you must start by creating a list of the valid **levels**: ```{r} month_levels <- c( - "Jan", "Feb", "Mar", "Apr", "May", "Jun", + "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" ) ``` @@ -169,7 +169,7 @@ relig_summary <- gss_cat |> n = n() ) -ggplot(relig_summary, aes(x = tvhours, y = relig)) + +ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point() ``` @@ -212,7 +212,7 @@ What if we create a similar plot looking at how average age varies across report #| fig-alt: | #| A scatterplot with age on the x-axis and income on the y-axis. Income #| has been reordered in order of average age which doesn't make much -#| sense. One section of the y-axis goes from $6000-6999, then <$1000, +#| sense. One section of the y-axis goes from $6000-6999, then <$1000, #| then $8000-9999. rincome_summary <- gss_cat |> group_by(rincome) |> @@ -221,7 +221,7 @@ rincome_summary <- gss_cat |> n = n() ) -ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) + +ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) + geom_point() ``` @@ -257,15 +257,15 @@ This makes the plot easier to read because the colors of the line at the far rig #| A line plot with age on the x-axis and proportion on the y-axis. #| There is one line for each category of marital status: no answer, #| never married, separated, divorced, widowed, and married. It is -#| a little hard to read the plot because the order of the legend is -#| unrelated to the lines on the plot. Rearranging the legend makes -#| the plot easier to read because the legend colors now match the -#| order of the lines on the far right of the plot. You can see some -#| unsurprising patterns: the proportion never married decreases with -#| age, married forms an upside down U shape, and widowed starts off +#| a little hard to read the plot because the order of the legend is +#| unrelated to the lines on the plot. Rearranging the legend makes +#| the plot easier to read because the legend colors now match the +#| order of the lines on the far right of the plot. You can see some +#| unsurprising patterns: the proportion never married decreases with +#| age, married forms an upside down U shape, and widowed starts off #| low but increases steeply after age 60. by_age <- gss_cat |> - filter(!is.na(age)) |> + filter(!is.na(age)) |> count(age, marital) |> group_by(age) |> mutate( @@ -273,13 +273,13 @@ by_age <- gss_cat |> ) ggplot(by_age, aes(x = age, y = prop, color = marital)) + - geom_line(linewidth = 1) + + geom_line(linewidth = 1) + scale_color_brewer(palette = "Set1") ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) + geom_line(linewidth = 1) + - scale_color_brewer(palette = "Set1") + - labs(color = "marital") + scale_color_brewer(palette = "Set1") + + labs(color = "marital") ``` Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. @@ -288,7 +288,7 @@ Combine it with `fct_rev()` if you want them in increasing frequency so that in ```{r} #| fig-alt: | #| A bar char of marital status ordered from least to most common: -#| no answer (~0), separated (~1,000), widowed (~2,000), divorced +#| no answer (~0), separated (~1,000), widowed (~2,000), divorced #| (~3,000), never married (~5,000), married (~10,000). gss_cat |> mutate(marital = marital |> fct_infreq() |> fct_rev()) |> @@ -409,21 +409,24 @@ Read the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` whi ## Ordered factors {#sec-ordered-factors} -Before we go on, there's a special type of factor that needs to be mentioned briefly: ordered factors. -Ordered factors, created with `ordered()`, imply a strict ordering and equal distance between levels: the first level is "less than" the second level by the same amount that the second level is "less than" the third level, and so on. -You can recognize them when printing because they use `<` between the factor levels: +Before we continue, it's important to briefly mention a special type of factor: ordered factors. +Created with the `ordered()` function, ordered factors imply a strict ordering between levels, but don't specify anything about the magnitude of the differences between the levels. +You use ordered factors when you know there the levels are ranked, but there's no precise numerical ranking. + +You can identify an ordered factor when its printed because it uses `<` symbols between the factor levels: ```{r} ordered(c("a", "b", "c")) ``` - -In practice, `ordered()` factors behave very similarly to regular factors. +In both base R and the tidyverse, ordered factors behave very similarly to regular factors. There are only two places where you might notice different behavior: - If you map an ordered factor to color or fill in ggplot2, it will default to `scale_color_viridis()`/`scale_fill_viridis()`, a color scale that implies a ranking. -- If you use an ordered function in a linear model, it will use "polygonal contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, we recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine. +- If you use an ordered predictor in a linear model, it will use "polynomial contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, we recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine. -Given the arguable utility of these differences, we don't generally recommend using ordered factors. +For the purposes of this book, correctly distinguishing between regular and ordered factors is not particularly important. +More broadly, however, certain fields (particularly the social sciences) do use ordered factors extensively. +In these contexts, it's important to correctly identify them so that other analysis packages can offer the appropriate behavior. ## Summary From 12c6affd2fe489e435bb28fb114bc328e4dfd728 Mon Sep 17 00:00:00 2001 From: Beatriz Milz <42153618+beatrizmilz@users.noreply.github.com> Date: Sun, 17 Nov 2024 15:45:57 -0300 Subject: [PATCH 38/43] add Portuguese in the list of translations - index.qmd (#1691) --- index.qmd | 1 + 1 file changed, 1 insertion(+) diff --git a/index.qmd b/index.qmd index b42f33def..5880402ca 100644 --- a/index.qmd +++ b/index.qmd @@ -18,6 +18,7 @@ If you speak another language, you might be interested in the freely available t - [Spanish](http://es.r4ds.hadley.nz) - [Italian](https://it.r4ds.hadley.nz) - [Turkish](https://tr.r4ds.hadley.nz) +- [Portuguese](https://pt.r4ds.hadley.nz) You can find suggested answers to exercises in the book at . From f3b95c4ae02644418fbb4813403675b26bdafc6c Mon Sep 17 00:00:00 2001 From: Andrea Scalia Date: Mon, 30 Dec 2024 17:06:03 +0100 Subject: [PATCH 39/43] Added missing word joins.qmd (#1696) --- joins.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/joins.qmd b/joins.qmd index b18759bb7..8a7d21d8d 100644 --- a/joins.qmd +++ b/joins.qmd @@ -581,7 +581,7 @@ We'll come back to non-equi joins in @sec-non-equi-joins. So far we've explored what happens if a row in `x` matches zero or one row in `y`. What happens if it matches more than one row? -To understand what's going let's first narrow our focus to the `inner_join()` and then draw a picture, @fig-join-match-types. +To understand what's going on let's first narrow our focus to the `inner_join()` and then draw a picture, @fig-join-match-types. ```{r} #| label: fig-join-match-types From 6a1bb7abb988d84dee5bd4656b1b8349ab9306e8 Mon Sep 17 00:00:00 2001 From: Tom Klein Date: Sun, 12 Jan 2025 12:46:45 -0600 Subject: [PATCH 40/43] Fix some typos (#1701) * Fix some typos * Stick with the file name `bake-sale.xlsx` instead of using `bake_sale.xlsx` --- EDA.qmd | 2 +- data-import.qmd | 2 +- databases.qmd | 9 ++++----- functions.qmd | 6 +++--- iteration.qmd | 2 +- joins.qmd | 2 +- numbers.qmd | 6 +++--- program.qmd | 2 +- spreadsheets.qmd | 6 +++--- strings.qmd | 2 +- transform.qmd | 2 +- workflow-scripts.qmd | 2 +- 12 files changed, 21 insertions(+), 22 deletions(-) diff --git a/EDA.qmd b/EDA.qmd index 82c01e83b..1172dbe73 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -73,7 +73,7 @@ You can see variation easily in real life; if you measure any continuous variabl This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments). -Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. +Every variable has its own pattern of variation, which can reveal interesting information about how it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable's values, which you've learned about in @sec-data-visualization. We'll start our exploration by visualizing the distribution of weights (`carat`) of \~54,000 diamonds from the `diamonds` dataset. diff --git a/data-import.qmd b/data-import.qmd index 705c6b58c..5d2c617a2 100644 --- a/data-import.qmd +++ b/data-import.qmd @@ -56,7 +56,7 @@ read_csv("data/students.csv") |> We can read this file into R using `read_csv()`. The first argument is the most important: the path to the file. -You can think about the path as the address of the file: the file is called `students.csv` and that it lives in the `data` folder. +You can think about the path as the address of the file: the file is called `students.csv` and it lives in the `data` folder. ```{r} #| message: true diff --git a/databases.qmd b/databases.qmd index 8145250af..ee439d3cd 100644 --- a/databases.qmd +++ b/databases.qmd @@ -53,7 +53,7 @@ There are three high level differences between data frames and database tables: Databases are run by database management systems (**DBMS**'s for short), which come in three basic forms: -- **Client-server** DBMS's run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS's include PostgreSQL, MariaDB, SQL Server, and Oracle. +- **Client-server** DBMS's run on a powerful central server, which you connect to from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS's include PostgreSQL, MariaDB, SQL Server, and Oracle. - **Cloud** DBMS's, like Snowflake, Amazon's RedShift, and Google's BigQuery, are similar to client server DBMS's, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed. - **In-process** DBMS's, like SQLite or duckdb, run entirely on your computer. They're great for working with large datasets where you're the primary user. @@ -295,7 +295,7 @@ flights |> There are two important differences between dplyr verbs and SELECT clauses: - In SQL, case doesn't matter: you can write `select`, `SELECT`, or even `SeLeCt`. In this book we'll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names. -- In SQL, order matters: you must always write the clauses in the order `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how the clauses actually evaluated which is first `FROM`, then `WHERE`, `GROUP BY`, `SELECT`, and `ORDER BY`. +- In SQL, order matters: you must always write the clauses in the order `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how the clauses are actually evaluated which is first `FROM`, then `WHERE`, `GROUP BY`, `SELECT`, and `ORDER BY`. The following sections explore each clause in more detail. @@ -385,7 +385,7 @@ diamonds_db |> show_query() ``` -We'll come back to what's happening with translation `n()` and `mean()` in @sec-sql-expressions. +We'll come back to what's happening with the translation of `n()` and `mean()` in @sec-sql-expressions. ### WHERE @@ -656,8 +656,7 @@ dbplyr's translations are certainly not perfect, and there are many R functions In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr "backend" that allows you to write the dplyr code you're familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it's important to learn some SQL because it's *the* most commonly used language for working with data and knowing some will make it easier for you to communicate with other data folks who don't use R. -If you've finished this chapter and would like to learn more about SQL. -We have two recommendations: +If you've finished this chapter and would like to learn more about SQL, we have two recommendations: - [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organizations. - [*Practical SQL*](https://www.practicalsql.com) by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS. diff --git a/functions.qmd b/functions.qmd index 5bb8f3a40..6152156c4 100644 --- a/functions.qmd +++ b/functions.qmd @@ -28,7 +28,7 @@ In this chapter, you'll learn about three useful types of functions: - Plot functions that take a data frame as input and return a plot as output. Each of these sections includes many examples to help you generalize the patterns that you see. -These examples wouldn't be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. +These examples wouldn't be possible without the help of folks of twitter, and we encourage you to follow the links in the comments to see the original inspirations. You might also want to read the original motivating tweets for [general functions](https://twitter.com/hadleywickham/status/1571603361350164486) and [plotting functions](https://twitter.com/hadleywickham/status/1574373127349575680) to see even more functions. ### Prerequisites @@ -175,7 +175,7 @@ These changes illustrate an important benefit of functions: because we've moved ### Mutate functions -Now you've got the basic idea of functions, let's take a look at a whole bunch of examples. +Now that you've got the basic idea of functions, let's take a look at a whole bunch of examples. We'll start by looking at "mutate" functions, i.e. functions that work well inside of `mutate()` and `filter()` because they return an output of the same length as the input. Let's start with a simple variation of `rescale01()`. @@ -460,7 +460,7 @@ diamonds |> summary6(carat) ``` -Furthermore, since the arguments to summarize are data-masking also means that the `var` argument to `summary6()` is data-masking. +Furthermore, since the arguments to summarize are data-masking, so is the `var` argument to `summary6()`. That means you can also summarize computed variables: ```{r} diff --git a/iteration.qmd b/iteration.qmd index 62090c688..8b2e68cef 100644 --- a/iteration.qmd +++ b/iteration.qmd @@ -220,7 +220,7 @@ df_miss |> If you look carefully, you might intuit that the columns are named using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fn` is the name of the function. That's not a coincidence! -As you'll learn in the next section, you can use `.names` argument to supply your own glue spec. +As you'll learn in the next section, you can use the `.names` argument to supply your own glue spec. ### Column names diff --git a/joins.qmd b/joins.qmd index 8a7d21d8d..73bb161f5 100644 --- a/joins.qmd +++ b/joins.qmd @@ -111,7 +111,7 @@ knitr::include_graphics("diagrams/relational.png", dpi = 270) You'll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as you'll see shortly, will make your joining life much easier. It's also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. -There's only one exception: `year` means year of departure in `flights` and year of manufacturer in `planes`. +There's only one exception: `year` means year of departure in `flights` and year manufactured in `planes`. This will become important when we start actually joining tables together. ### Checking primary keys diff --git a/numbers.qmd b/numbers.qmd index cc5e1aad4..8c10dfae3 100644 --- a/numbers.qmd +++ b/numbers.qmd @@ -449,7 +449,7 @@ df |> ### Offsets -`dplyr::lead()` and `dplyr::lag()` allow you to refer the values just before or just after the "current" value. +`dplyr::lead()` and `dplyr::lag()` allow you to refer to the values just before or just after the "current" value. They return a vector of the same length as the input, padded with `NA`s at the start or end: ```{r} @@ -475,7 +475,7 @@ You can lead or lag by more than one position by using the second argument, `n`. ### Consecutive identifiers Sometimes you want to start a new group every time some event occurs. -For example, when you're looking at website data, it's common to want to break up events into sessions, where you begin a new session after gap of more than `x` minutes since the last activity. +For example, when you're looking at website data, it's common to want to break up events into sessions, where you begin a new session after a gap of more than `x` minutes since the last activity. For example, imagine you have the times when someone visited a website: ```{r} @@ -573,7 +573,7 @@ Here is a selection that you might find useful. So far, we've mostly used `mean()` to summarize the center of a vector of values. As we've seen in @sec-sample-size, because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. -An alternative is to use the `median()`, which finds a value that lies in the "middle" of the vector, i.e. 50% of the values is above it and 50% are below it. +An alternative is to use the `median()`, which finds a value that lies in the "middle" of the vector, i.e. 50% of the values are above it and 50% are below it. Depending on the shape of the distribution of the variable you're interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median. diff --git a/program.qmd b/program.qmd index faabd2a1f..a517efc06 100644 --- a/program.qmd +++ b/program.qmd @@ -49,4 +49,4 @@ The goal of these chapters is to teach you the minimum about programming that yo Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. We've written two books that you might find helpful. [*Hands on Programming with R*](https://rstudio-education.github.io/hopr/), by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. -[*Advanced R*](https://adv-r.hadley.nz/) by Hadley Wickham dives into the details of R the programming language; it's great place to start if you have existing programming experience and great next step once you've internalized the ideas in these chapters. +[*Advanced R*](https://adv-r.hadley.nz/) by Hadley Wickham dives into the details of R the programming language; it's a great place to start if you have existing programming experience and a great next step once you've internalized the ideas in these chapters. diff --git a/spreadsheets.qmd b/spreadsheets.qmd index 3d42f18c6..25db5a0ee 100644 --- a/spreadsheets.qmd +++ b/spreadsheets.qmd @@ -46,7 +46,7 @@ For the rest of the chapter we will focus on using `read_excel()`. ### Reading Excel spreadsheets {#sec-reading-spreadsheets-excel} -@fig-students-excel shows what the spreadsheet we're going to read into R looks like in Excel. This spreadsheet can be downloaded an Excel file from . +@fig-students-excel shows what the spreadsheet we're going to read into R looks like in Excel. This spreadsheet can be downloaded as an Excel file from . ```{r} #| label: fig-students-excel @@ -342,7 +342,7 @@ bake_sale <- tibble( bake_sale ``` -You can write data back to disk as an Excel file using the `write_xlsx()` from the [writexl package](https://docs.ropensci.org/writexl/): +You can write data back to disk as an Excel file using the `write_xlsx()` function from the [writexl package](https://docs.ropensci.org/writexl/): ```{r} #| eval: false @@ -359,7 +359,7 @@ These can be turned off by setting `col_names` and `format_headers` arguments to #| echo: false #| fig-width: 5 #| fig-cap: | -#| Spreadsheet called bake_sale.xlsx in Excel. +#| Spreadsheet called bake-sale.xlsx in Excel. #| fig-alt: | #| Bake sale data frame created earlier in Excel. diff --git a/strings.qmd b/strings.qmd index 0a3fc65fb..4dcde6018 100644 --- a/strings.qmd +++ b/strings.qmd @@ -622,7 +622,7 @@ If you don't already know the code for your language, [Wikipedia](https://en.wik Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country. To avoid this problem, stringr defaults to English rules by using the "en" locale and requires you to specify the `locale` argument to override it. -Fortunately, there are two sets of functions where the locale really matters: changing case and sorting. +Fortunately, there are only two sets of functions where the locale really matters: changing case and sorting. The rules for changing cases differ among languages. For example, Turkish has two i's: with and without a dot. diff --git a/transform.qmd b/transform.qmd index a7800091d..b56a9d5fa 100644 --- a/transform.qmd +++ b/transform.qmd @@ -13,7 +13,7 @@ In this part of the book, you'll learn about the most important types of variabl #| label: fig-ds-transform #| echo: false #| fig-cap: | -#| The options for data transformation depends heavily on the type of +#| The options for data transformation depend heavily on the type of #| data involved, the subject of this part of the book. #| fig-alt: | #| Our data science model, with transform highlighted in blue. diff --git a/workflow-scripts.qmd b/workflow-scripts.qmd index d726e9b80..d5cbc8f05 100644 --- a/workflow-scripts.qmd +++ b/workflow-scripts.qmd @@ -153,7 +153,7 @@ report-2022-04-02.qmd report-draft-notes.txt ``` -Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. +Numbering the key scripts makes it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and `temp` is renamed to `report-draft-notes` to better describe its contents. If you have a lot of files in a directory, taking organization one step further and placing different types of files (scripts, figures, etc.) in different directories is recommended. From d6c3daa29f298be5ca849cf4ac8067fe00c2e07c Mon Sep 17 00:00:00 2001 From: Matthew Vine <32849887+MattTheCuber@users.noreply.github.com> Date: Thu, 23 Jan 2025 18:14:05 -0500 Subject: [PATCH 41/43] Update index.qmd (#1702) --- index.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/index.qmd b/index.qmd index 5880402ca..744757395 100644 --- a/index.qmd +++ b/index.qmd @@ -11,7 +11,7 @@ You'll also learn how to manage cognitive resources to facilitate discoveries wh This website is and will always be free, licensed under the [CC BY-NC-ND 3.0](https://creativecommons.org/licenses/by-nc-nd/3.0/us/) License. If you'd like a physical copy of the book, you can order it on [Amazon](https://www.amazon.com/dp/1492097403?&tag=hadlwick-20). -If you appreciate reading the book for free and would like to give back, please make a donation to [Kākāpō Recovery](https://www.doc.govt.nz/kakapo-donate): the [kākāpō](https://www.youtube.com/watch?v=9T1vfsHYiKY) (which appears on the cover of R4DS) is a critically endangered parrot native to New Zealand; there are only 248 left. +If you appreciate reading the book for free and would like to give back, please make a donation to [Kākāpō Recovery](https://www.doc.govt.nz/kakapo-donate): the [kākāpō](https://www.youtube.com/watch?v=9T1vfsHYiKY) (which appears on the cover of R4DS) is a critically endangered parrot native to New Zealand; there are only 244 left. If you speak another language, you might be interested in the freely available translations of the 1st edition: From be437871029e1244175f8ccab8f2c92833f42fa6 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Tue, 18 Feb 2025 11:39:47 -0600 Subject: [PATCH 42/43] Deploy to GitHub pages --- .github/workflows/build_book.yaml | 40 ++++++++++++++++++------------- 1 file changed, 23 insertions(+), 17 deletions(-) diff --git a/.github/workflows/build_book.yaml b/.github/workflows/build_book.yaml index 43ce969b3..9cd2ba6f8 100644 --- a/.github/workflows/build_book.yaml +++ b/.github/workflows/build_book.yaml @@ -7,9 +7,9 @@ on: workflow_dispatch: schedule: # run every day at 11 PM - - cron: '0 23 * * *' + - cron: "0 23 * * *" -name: Render and deploy Book to Netlify +name: build_book.yaml env: isExtPR: ${{ github.event.pull_request.head.repo.fork == true }} @@ -42,19 +42,25 @@ jobs: run: | quarto render - - name: Deploy to Netlify - if: contains(env.isExtPR, 'false') - id: netlify-deploy - uses: nwtgck/actions-netlify@v1.1 + - name: Upload website artifact + if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' }} + uses: actions/upload-pages-artifact@v3 with: - publish-dir: './_book' - production-branch: main - github-token: ${{ secrets.GITHUB_TOKEN }} - deploy-message: - 'Deploy from GHA: ${{ github.event.pull_request.title || github.event.head_commit.message }} (${{ github.sha }})' - enable-pull-request-comment: false - enable-commit-comment: false - env: - NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }} - NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID_2E }} - timeout-minutes: 1 + path: "_book" + + deploy: + needs: build + + permissions: + pages: write # to deploy to Pages + id-token: write # to verify the deployment originates from an appropriate source + + environment: + name: github-pages + url: ${{ steps.deployment.outputs.page_url }} + + runs-on: ubuntu-latest + steps: + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v4 From f42667408196999d063d2c6335e24cdeaaf333fd Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Tue, 18 Feb 2025 11:41:44 -0600 Subject: [PATCH 43/43] Fix name --- .github/workflows/build_book.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/build_book.yaml b/.github/workflows/build_book.yaml index 9cd2ba6f8..dd808ca0f 100644 --- a/.github/workflows/build_book.yaml +++ b/.github/workflows/build_book.yaml @@ -16,7 +16,7 @@ env: RUST_BACKTRACE: 1 jobs: - build-deploy: + build: runs-on: ubuntu-latest env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}