|
| 1 | +--- |
| 2 | +title: "block-3-worksheet" |
| 3 | +output: html_document |
| 4 | +--- |
| 5 | + |
| 6 | +```{r setup, include=FALSE} |
| 7 | +knitr::opts_chunk$set(echo = TRUE) |
| 8 | +# packages |
| 9 | +library(tidyverse) |
| 10 | +library(cowplot) |
| 11 | +# pull out the storms data |
| 12 | +storms <- read_csv("../data_csv/STORMS.CSV") |
| 13 | +``` |
| 14 | + |
| 15 | +### Alternative to histograms | visualising 'small' data |
| 16 | + |
| 17 | +This section picks up from where we left off in the first block 3 practical. We ended that practical by looking at histograms. These are a standard tool for visualising the distribution of numeric variables. If you are still confused about what terms like 'distribution' and 'variable' mean, now is a good time to read through the short chapter on [Exploratory data analysis](https://dzchilds.github.io/eda-for-bio/exploratory-data-analysis.html). A different chapter, [Exploring one numeric variable](https://dzchilds.github.io/eda-for-bio/exploring-one-numeric-variable.html), summarises how to plot histograms and dot plots. The latter are the subject of our first exercise... |
| 18 | + |
| 19 | +#### Dot plots |
| 20 | + |
| 21 | +Histograms are good for visualising sample distributions when we have a reasonable sample size (at least dozens, and ideally, hundreds of observations). They are less effective when the sample in hand is small. In this ‘small data’ situation it’s better to use something called a dot plot. These are useful for exploring the distribution of variables when we have relatively few observations (typically < 100). Here is an example of a dot plot using the iris data set to show the distribution of sepal length for just the setosa species: |
| 22 | + |
| 23 | +```{r, fig.align='center', fig.width=4, fig.height=4} |
| 24 | +# step 1. obtain just the setosa cases |
| 25 | +setosa <- filter(iris, Species == "setosa") |
| 26 | +# setp 2. make a dot plot of sepal length |
| 27 | +ggplot(setosa, aes(x = Sepal.Length)) + |
| 28 | + geom_dotplot(binwidth = 0.1) |
| 29 | +``` |
| 30 | + |
| 31 | +Each observation in the data adds one dot. A dot plot is based on binning the data, just as with a histogram, but here the dots that fall into the same bin are stacked up on top of one another. |
| 32 | + |
| 33 | +#### Exercise: |
| 34 | + |
| 35 | +One of the oddities of dot plots produce by `ggplot2` is that the y axis scale isn't really meaningful---the positions and counts of stacked dots give us all the information we need to understand a distribution. The fact that a y scale is produced reflects a (rare) limitation of `ggplot2`. It would be better if the information about the y axis were not shown at all. |
| 36 | +See if you can make the dot plot of the sepal length variable for the _Setosa_ species. However, this time remove the y axis labels and the grid lines. You have not been shown how to do this yet so this will require a bit of detective work. Here's a hint: |
| 37 | + |
| 38 | +1. Look at the examples in the help file for `geom_dotplot` to work out what to do with `scale_y_continuous`. In particular, read the code comments (`#`) in the 'Examples' section at the end. |
| 39 | + |
| 40 | +2. Experiment with the options presented by RStudio after you type `theme_`. Basically, you need to find a them that will remove the axes. Just try the different ones out until you land on one that works. |
| 41 | + |
| 42 | +A code outline and an example of the output you're aiming for is given below. The `<????>` are place holders that show the bits to complete. |
| 43 | + |
| 44 | +```{r, eval=FALSE, fig.align="center"} |
| 45 | +setosa <- filter(iris, Species == "setosa") |
| 46 | +ggplot(setosa, aes(x = Sepal.Length)) + |
| 47 | + geom_dotplot(binwidth = 0.1) + |
| 48 | + scale_y_continuous( <????> ) + |
| 49 | + theme_<????>() |
| 50 | +``` |
| 51 | + |
| 52 | +Here is the plot you are aiming for: |
| 53 | + |
| 54 | +```{r, echo=FALSE, fig.align='center', fig.width=4, fig.height=4} |
| 55 | +setosa <- filter(iris, Species == "setosa") |
| 56 | +ggplot(setosa, aes(x = Sepal.Length)) + |
| 57 | + geom_dotplot(binwidth = 0.1) + |
| 58 | + scale_y_continuous(NULL, breaks = NULL) + # <- remove the y-axis |
| 59 | + theme_classic() # <- remove the grid lines |
| 60 | +``` |
| 61 | + |
| 62 | +What is the purpose of this exercise? Partly we want to make sure you can produce a 'nice' looking dot plot. The more important goal is to briefly demonstrate that a lot of customisation is possible with **ggplot2**. You can read more about this in the [Doing more with ggplot2 chapter](https://dzchilds.github.io/eda-for-bio/doing-more-with-ggplot2.html) |
| 63 | + |
| 64 | +### Boxplots | Relationships between categorical and continuous data |
| 65 | + |
| 66 | +Box and whisker plots (aka 'box plots') can also be used to summarise the distribution of a variable, but in a compact way that allows us to summarise its distribution in different groups. The groups in this scenario are defined by the levels of a categorical variable, meaning that this type of plot can be thought of as describing the relationship between a categorical (x axis) and numeric variable (y axis). Here is an example: |
| 67 | + |
| 68 | +```{r, fig.width=3, eval=TRUE, echo=FALSE, fig.align="center"} |
| 69 | +ggplot(iris, aes(x = Species, y = Petal.Length/Petal.Width)) + |
| 70 | + geom_boxplot() + |
| 71 | + labs(x = "Species", y = "Eccentricty") + |
| 72 | + theme_minimal(base_size = 14) |
| 73 | +``` |
| 74 | + |
| 75 | +Each box-and-whisker shows the group median (line) and the interquartile range ("boxes"). The vertical lines ("whiskers") highlight the range of the rest of the data in each group. Potential outliers are plotted individually. The details can be found in the [Relationships between two variables](https://dzchilds.github.io/eda-for-bio/relationships-between-two-variables.html#graphical-summaries-3) chapter. |
| 76 | + |
| 77 | +You can probably guess which `geom_` function we have to use to make a boxplot (yes, it's `geom_boxplot`)... Here's the R code that made the plot above: |
| 78 | + |
| 79 | +```{r, fig.width=3, eval=FALSE, fig.align="center"} |
| 80 | +ggplot(iris, aes(x = Species, y = Petal.Length/Petal.Width)) + |
| 81 | + geom_boxplot() + |
| 82 | + labs(x = "Species", y = "Eccentricty") + |
| 83 | + theme_minimal(base_size = 14) |
| 84 | +``` |
| 85 | + |
| 86 | +<BR> |
| 87 | + |
| 88 | +#### Exercise: |
| 89 | + |
| 90 | +Working with the `storms` dataset, construct a box and whiskers plot to summarise the distribution of wind speed for each type of storm. See if you can... |
| 91 | + |
| 92 | +1. customise the fill colour of the boxes, |
| 93 | + |
| 94 | +2. get rid of the grey in the plot background (hint: use a `theme_`), |
| 95 | + |
| 96 | +3. increase the size of the text on the graph. |
| 97 | + |
| 98 | +This is what you are aiming for: |
| 99 | + |
| 100 | +```{r, fig.width=5, echo=FALSE, fig.align="center"} |
| 101 | +ggplot(storms, aes(x = type, y = wind)) + |
| 102 | + geom_boxplot(fill= "lightgrey") + |
| 103 | + labs(x = "Type of storm", y = "Wind Speed (mph)") + |
| 104 | + theme_classic(base_size = 14) |
| 105 | +``` |
| 106 | + |
| 107 | +That's OK but the labels are a it of a mess. See if you can use the `coord_flip` function to make this version (much nicer): |
| 108 | + |
| 109 | +```{r, fig.width=5, fig.height=2, echo=FALSE, fig.align="center"} |
| 110 | +ggplot(storms, aes(x = type, y = wind)) + |
| 111 | + geom_boxplot(fill= "lightgrey") + |
| 112 | + labs(x = "Type of storm", y = "Wind Speed (mph)") + |
| 113 | + theme_classic(base_size = 14) + coord_flip() |
| 114 | +``` |
| 115 | + |
| 116 | + |
| 117 | +### Saving your plots |
| 118 | + |
| 119 | +There are lots of different ways to save plots in R. One is to save it after you make it using the 'Export' facility in the RStudio plot window. However, this tends not to produce very nice looking results. It's better, and more reproducible, to use the `ggsave` function. We can use this in two ways. |
| 120 | + |
| 121 | +The first 'adds' it to the ggplot code using the `+` when we're building our plot, like this: |
| 122 | + |
| 123 | +```{r, eval=FALSE} |
| 124 | +# version 1. |
| 125 | +ggplot(setosa, aes(x = Sepal.Length)) + |
| 126 | + geom_dotplot(binwidth = 0.1) + |
| 127 | + ggsave("Sepal_dotplot.pdf") # <- use ggsave as part of a ggplot construct |
| 128 | +``` |
| 129 | + |
| 130 | +This saves the plot directly to the working directory without first displaying it in RStudio. The second way of doing it uses the `ggsave` function on its own **after** we make the figure, like this: |
| 131 | + |
| 132 | +```{r, eval=FALSE} |
| 133 | +# version 2. |
| 134 | +ggplot(setosa, aes(x = Sepal.Length)) + |
| 135 | + geom_dotplot(binwidth = 0.1) |
| 136 | +
|
| 137 | +# use ggsave on its own *after* making the figure |
| 138 | +ggsave("Sepal_dotplot.pdf") |
| 139 | +``` |
| 140 | + |
| 141 | +This displaying the plot in RStudio and then saves it to the working directory without first. |
| 142 | + |
| 143 | +#### Exercise: Saving plots |
| 144 | + |
| 145 | +Use `ggsave` to save the box and whiskers plot you made in the previous exercise to a PDF file. Look at the help file for `ggsave` and then see if you can answer these questions: |
| 146 | + |
| 147 | +1. Can you work out where R has saved your plot to (i.e. which folder on your computer)? |
| 148 | + |
| 149 | +2. Can you change the dimensions of the saved plot so that these are 4 inches x 4 inches? |
| 150 | + |
| 151 | +Saving plots so that they look good in the saved version usually involves a bit of trial and error with the figure dimensions. Don't expect to get it right first time. |
| 152 | + |
| 153 | +### Bar plots | Summary statistics for groups |
| 154 | + |
| 155 | +We use bar plots to summarise a **summary statistics** in different groups. Often, this number is the mean a numeric variable but it doesn't have to be. It is tempting to use `geom_bar` to make bar plots (and seems logical). However, `geom_bar` will simply count the number of observations by default. For example, we can show the number of observations associated with each year in the storms data set using: |
| 156 | + |
| 157 | +```{r, fig.height=3.0, fig.width=3.0, fig.align="center"} |
| 158 | +ggplot(storms, aes(x = factor(year))) + # <- notice use of 'factor' function |
| 159 | + geom_bar() + |
| 160 | + labs(x = "Year", y = "Number of Storms") |
| 161 | +``` |
| 162 | + |
| 163 | +This is basically a plot showing sample size for each year --- this behaviour is mostly a historic accident and it's not hugely useful. |
| 164 | + |
| 165 | +By the way, notice that we used the `factor` function on `year` in this example. This was to convert a numeric variable to a categorical variable (a factor, in R-land). Whether the x variable is numeric or categorical affects the labelling. To understand what we mean, look at what we get if we run the code without converting `year` to a factor (look at the x axis labels). |
| 166 | + |
| 167 | +```{r, echo=FALSE, fig.height=3.0, fig.width=3.0, fig.align="center"} |
| 168 | +ggplot(storms, aes(x = year)) + # <- notice use of 'factor' function |
| 169 | + geom_bar() + |
| 170 | + labs(x = "Year", y = "Number of Storms") |
| 171 | +``` |
| 172 | + |
| 173 | + |
| 174 | +Usually we use a bar plot to compare **summary statistics** (e.g. the mean). We can do this as a two-step process with **dplyr** and **ggplot2**. First, we need to calculate the summary statistic we want to display. Do this with the `group_by` and `summarise` functions from the **dplyr** package. For example, we can calculate the mean petal length for each species in `iris` like this: |
| 175 | + |
| 176 | +```{r} |
| 177 | +# step 1 |
| 178 | +pl_stats <- |
| 179 | + iris %>% |
| 180 | + group_by(Species) %>% |
| 181 | + summarise(mean_pl = mean(Petal.Length)) |
| 182 | +# take a look at what we made |
| 183 | +pl_stats |
| 184 | +``` |
| 185 | + |
| 186 | +Once we have the information we need we are ready to make the plot. That lives in the `pl_stats` data frame in our `iris` example. To make a bar plot to compare summary statistics across groups it is easiest to use `geom_col` (we can use `geom_bar` but that requires more work). Using `geom_col` is just like using any other `geom_` function: |
| 187 | + |
| 188 | +```{r, fig.height=3.0, fig.width=2.5, fig.align='center'} |
| 189 | +# step 2 |
| 190 | +ggplot(pl_stats, aes(x = Species, y = mean_pl)) + |
| 191 | + geom_col() + |
| 192 | + labs(y = "Mean Petal Length (cm)") |
| 193 | +``` |
| 194 | + |
| 195 | +#### Exercise: Making a barplot of means |
| 196 | + |
| 197 | +Working with the `storms` dataset, construct a bar plot that summarises the mean wind speed (`wind`) in each year (`year`). Once you've made the plot see if you can change the colour of the bars to dark grey. The plot you're aiming for looks like this: |
| 198 | + |
| 199 | +```{r, echo=FALSE, fig.height=3.0, fig.width=3.5, fig.align='center'} |
| 200 | +# step 1 - use dplyr to calculate the means |
| 201 | +wind.means <- |
| 202 | + storms %>% group_by(year) %>% |
| 203 | + summarise(wind = mean(wind)) |
| 204 | +# step 2 - make the plot |
| 205 | +ggplot(wind.means, aes(x = factor(year), y = wind)) + |
| 206 | + geom_col(fill="darkgrey") + |
| 207 | + labs(x = "Year", y = "Wind speed (mph)") |
| 208 | +``` |
| 209 | + |
| 210 | +### Making more complicated plots: multiple layers |
| 211 | + |
| 212 | +Sometimes one geom isn't enough to get all the information we want to display onto a plot. We can build more complex figures by adding more than one layer via the `geom_` functions. For example, if we show the world a set of estimated means, we should also try to include an error bar to summarise how precise the estimates are: |
| 213 | + |
| 214 | +```{r, echo=FALSE, fig.height=3.0, fig.width=2.5, fig.align='center'} |
| 215 | +# step 1 |
| 216 | +pl_stats <- |
| 217 | + iris %>% |
| 218 | + group_by(Species) %>% |
| 219 | + summarise(mean_pl = mean(Petal.Length), |
| 220 | + se = sd(Petal.Length) / sqrt(n())) # <- New calculation |
| 221 | +# step 2 |
| 222 | +ggplot(pl_stats, |
| 223 | + aes(x = Species, y = mean_pl, |
| 224 | + ymin = mean_pl - se, ymax = mean_pl + se)) + |
| 225 | + geom_col(fill = "grey", width = 0.7) + |
| 226 | + geom_errorbar(width = 0.25) + |
| 227 | + labs(y = "Mean Petal Length") |
| 228 | +``` |
| 229 | + |
| 230 | +Each error bar in that bar plot shows the mean +/- one __standard error__. We will learn about standard errors next year. For now, just accept this is an acceptable way to summarise the uncertainty in the estimate. The standard error of the mean is can be calculated like this: |
| 231 | + |
| 232 | +$$ |
| 233 | +\text{Standard Error} = \frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}} |
| 234 | +$$ |
| 235 | +That is, it is equal to something called the standard deviation divided by the square root of the sample size. We are going to need to calculate this to make the plot above. We need to adapt the **dplyr** code above to do this by including a calculation of the standard errors along with the means: |
| 236 | + |
| 237 | +```{r, eval=FALSE} |
| 238 | +# step 1 |
| 239 | +pl_stats <- |
| 240 | + iris %>% |
| 241 | + group_by(Species) %>% |
| 242 | + summarise(mean_pl = mean(Petal.Length), |
| 243 | + se = sd(Petal.Length) / sqrt(n())) # <- New calculation |
| 244 | +``` |
| 245 | + |
| 246 | +This uses two new function: `sd()` calculates the standard deviation and `n()` calculate sample size (it's a weird, one-letter function). Here's what this makes: |
| 247 | + |
| 248 | +```{r, echo=FALSE} |
| 249 | +pl_stats |
| 250 | +``` |
| 251 | + |
| 252 | +Once we have all the information we need in one place we are ready to make the new plot that includes the error bars. This time we have to use two `geom_` functions together: `geom_col` to add the bars and `geom_errorbar` to add the error bars. |
| 253 | + |
| 254 | +However that's not enough. `geom_errorbar` will also expect two new aesthetic mappings to be set up: `ymin` and `ymax`. These define the lower and upper ends of the bars. To make the example above, these need to be the 'Mean - SE' and 'Mean + S.E.'. Putting these ideas together, we get to the following bit of R code to make the plot: |
| 255 | + |
| 256 | +```{r, echo=TRUE, eval=FALSE} |
| 257 | +# step 2 |
| 258 | +ggplot(pl_stats, |
| 259 | + aes(x = Species, y = mean_pl, |
| 260 | + ymin = mean_pl - se, ymax = mean_pl + se)) + |
| 261 | + geom_col(fill = "grey", width = 0.7) + |
| 262 | + geom_errorbar(width = 0.25) + |
| 263 | + labs(y = "Mean Petal Length (cm)") + theme_minimal() |
| 264 | +``` |
| 265 | + |
| 266 | +Notice that we calculated the `ymin` and `ymax` inside the `aes` bit, i.e. `ymin = mean_pl - se` and `ymax = mean_pl + se`. |
| 267 | + |
| 268 | +#### Exercise: Adding error bars to your plot |
| 269 | + |
| 270 | +Return to the bar plot you just made using the `storms` data set and add error bars showing the standard errors of wind speed. The resulting plot should look like this: |
| 271 | + |
| 272 | +```{r, echo=FALSE, fig.height=3.0, fig.width=3.5, fig.align='center'} |
| 273 | +# step 1 - use dplyr to calculate the means |
| 274 | +wind.means <- |
| 275 | + storms %>% |
| 276 | + group_by(year) %>% |
| 277 | + summarise(mean = mean(wind), |
| 278 | + se = sd(wind) / sqrt(n())) |
| 279 | +# step 2 - make the plot |
| 280 | +ggplot(wind.means, aes(x = factor(year), y = mean, |
| 281 | + ymin = mean - se, ymax = mean + se)) + |
| 282 | + geom_col(fill="darkgrey") + |
| 283 | + geom_errorbar(width = 0.25) + |
| 284 | + labs(x = "Year", y = "Wind speed (mph)") |
| 285 | +``` |
| 286 | + |
| 287 | +#### Final exercise: Putting it all together (HARDER!) |
| 288 | + |
| 289 | +R has a built in data set in called `ChickWeight`. Take a look at this with `View`, `glimpse`, etc to make sure you understand what variables it contains. |
| 290 | + |
| 291 | +Now try to make two plots below. Think about a) what these two graphs tell you about the effectiveness of the four diets and b) what other information it might be useful to include. |
| 292 | + |
| 293 | +```{r, echo = FALSE, fig.width=6, fig.height=3, fig.align='center'} |
| 294 | +
|
| 295 | +pltdata<- group_by(ChickWeight, Time, Diet) %>% |
| 296 | + summarise(mn = mean(weight), se = sd(weight)/sqrt(n())) |
| 297 | +pltdata2 <- ungroup(pltdata) %>% |
| 298 | + filter(Time==max(Time)) |
| 299 | +
|
| 300 | +plta <- ggplot(pltdata, aes(x=Time, y = mn, colour = Diet)) + |
| 301 | + geom_point() + |
| 302 | + geom_line() + |
| 303 | + theme_classic() + |
| 304 | + labs(y = "Mean weight (g)", x = "Time (days)") |
| 305 | +
|
| 306 | +pltb <- ggplot(pltdata2, aes(x=Diet, y = mn, ymin = mn-se, ymax = mn+se)) + |
| 307 | + geom_col(fill = 'cornflowerblue', colour = "black") + |
| 308 | + geom_errorbar(width = 0.3) + |
| 309 | + labs(y = "Final weight (g)") + |
| 310 | + theme_classic() |
| 311 | +
|
| 312 | +plot_grid(plta, pltb, rel_widths = c(1, 0.9)) |
| 313 | +``` |
| 314 | + |
| 315 | +<BR><BR><BR> |
| 316 | + |
| 317 | +**You're done! Congratulations on finishing the course!** |
0 commit comments