-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path04a-data-visualization.Rmd
545 lines (418 loc) · 19.7 KB
/
04a-data-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
---
knit: bookdown::preview_chapter
---
# Data Visualization
This chapter should have
- explanation of the grammar of graphics and how this fits in with tidy data and statistical thinking
- what you can learn about a data set by looking at it in different ways
- cognitive principles
- visual inference
- adding interactivity
## A grammar of graphics
### What is a data plot?
- data
- **aesthetics: mapping of variables to graphical elements**
- geom: type of plot structure to use
- transformations: log scale, ...
- layers: multiple geoms, multiple data sets, annotation
- facets: show subsets in different plots
- themes: modifying style
### Why?
With the grammar, a data plot becomes a statistic. It is a functional mapping from variable to graphical element. Then we can do statistics on data plots.
With a grammar, we don't have individual animals in the zoo, we have the genetic code that says how one plot is related to another plot.
### Elements of the grammar
```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
```
The 7 key elements of the grammar are:
- DATA
- GEOM_FUNCTION
- MAPPINGS
- STAT
- POSITION
- COORDINATE_FUNCTION
- FACET_FUNCTION
## Example: Tuberculosis data
Data on tuberculosis incidence can be found at the [World Health Organisation](http://www.who.int/tb/country/data/download/en/). The case notifications table has been downloaded. We will look at the incidence in Australia. The steps to organising the data are:
1. Select the variables, country name and iso3 code, year and the incidence columns as measured by a sputum test. The incidence columns also encode age and gender.
2. Gather the data into long form, so that gender and age can be put in separate columns.
3. Filter out records for Australia, between years 1997-2012. We also remove values for under 15 because the coding of these age groups is contradictory.
```{r}
library(tidyverse)
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
select(country, iso3, year, new_sp_m04:new_sp_fu) %>%
gather(stuff, count, new_sp_m04:new_sp_fu) %>%
separate(stuff, c("stuff1", "stuff2", "genderage")) %>%
select(-stuff1, -stuff2) %>%
mutate(gender=substr(genderage, 1, 1),
age=substr(genderage, 2, length(genderage))) %>%
select(-genderage)
tb_au <- tb %>%
filter(country == "Australia") %>%
filter(!(age %in% c("04", "014", "514", "u"))) %>%
filter(year > 1996, year < 2013)
```
We want to look at this data with lots of plots in different ways, with different mappings, to learn different things.
### 100% charts
```{r echo=TRUE, fig.width=8, fig.height=2}
ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "fill") +
facet_grid(~ age) +
scale_fill_brewer(palette="Dark2")
```
100% charts, is what excel names these beasts. What do we learn?
#### Code structure
- Basic plot function is `ggplot`
- First argument provided is the name of the data, `tb_au`
- Variable mapping: year is mapped to x, count is mapped to y, gender is mapped to colour, and age is used to subset the data and make separate plots
- The bar geom is used, `geom_bar`
- We have already counted how many TB incidences are in each combination of categories, so `stat = "identity"` says no need to compute the count
- We are mostly interested in proportions between gender, over years, separately by age. The `position = "fill"` option in `geom_bar` sets the heights of the bars to be all at 100%. It ignores counts, and emphasizes the proportion of males and females.
#### What do we learn?
- Focus is on **proportion** in each category.
- Across (almost) all ages, and years, the proportion of males having TB is higher than females
- These proportions tend to be higher in the older age groups, for all years.
### Bar charts
```{r echo=TRUE, fig.width=8, fig.height=2}
ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity") +
facet_grid(~ age) +
scale_fill_brewer(palette="Dark2")
```
#### What is different in the code description?
`, position = "fill"` was removed
#### What do we learn?
- Focus is on **counts** in each category.
- Different across ages, and years, counts tend to be lower in middle age (45-64)
- 1999 saw a bit of an outbreak, in most age groups, with numbers doubling or tripling other years.
- Incidence has been increasing among younger age groups in recent years.
### Side-by-side barcharts
```{r echo=TRUE, fig.width=8, fig.height=2}
ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity", position="dodge") +
facet_grid(~ age) +
scale_fill_brewer(palette="Dark2")
```
#### What is different in the code description?
`, position="dodge"` is used in `geom_bar`
#### What do we learn?
- Focus is on counts by gender, predominantly male incidence.
- Incidence among males relative to females is from middle ag on. There is similar incidence between males and females in younger age groups.
### Separate bar charts
```{r echo=TRUE, fig.width=8, fig.height=3}
ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity") +
facet_grid(gender ~ age) +
scale_fill_brewer(palette="Dark2")
```
#### What is different in the code description?
`facet_grid(gender ~ age) +` faceted by gender as well as age, note `facet_grid` vs `facet_wrap`
#### What do we learn?
- Its easier to focus separately on males and females.
- The 1999 outbreak mostly affected males.
- The growing incidence in the 25-34 age group is still affecting females but seems to be have stablised for males.
### Pie charts?
```{r echo=TRUE, fig.width=8, fig.height=3}
ggplot(tb_au, aes(x = year, y = count, fill = gender)) +
geom_bar(stat = "identity") +
facet_grid(gender ~ age) +
scale_fill_brewer(palette="Dark2") +
coord_polar() +
theme(axis.text = element_blank())
```
Nope! That's a rose chart. Bar charts in polar coordinates produce rose charts.
#### What is different in the code description?
`coord_polar() +` plot is made in polar coordinates, rather than the default Cartesian coordinates
#### What do we learn?
- Emphasizes the middle years as low incidence.
### Rainbow charts?
```{r echo=TRUE, fig.width=8, fig.height=3}
ggplot(tb_au, aes(x = 1, y = count, fill = factor(year))) +
geom_bar(stat = "identity", position="fill") +
facet_grid(gender ~ age)
```
A single stacked bar, in each facet. Year is mapped to colour.
#### What is the code doing?
- Notice how the mappings are different. A single number is mapped to x, that makes a single stacked bar chart.
- year is now mapped to colour (that's what gives us the rainbow charts!)
#### What do we learn?
- Pretty chart but not easy to interpret.
### Pie charts
```{r echo=TRUE, fig.width=8, fig.height=3}
ggplot(tb_au, aes(x = 1, y = count, fill = factor(year))) +
geom_bar(stat = "identity", position="fill") +
facet_grid(gender ~ age) +
coord_polar(theta="y") +
theme(axis.text = element_blank())
```
#### What is different in the code description?
`coord_polar(theta="y")` is using the y variable to do the angles for the polar coordinates to give a pie chart.
#### What do we learn?
- Pretty chart but not easy to interpret, or make comparisons across age groups.
#### Your turn
Focus on a different country and work your way through the plots where we learned the most about the Australia data.
1. Is the incidence similar for this country?
2. Are rates increasing? Or decreasing?
3. Do some age groups experience higher rates?
4. Is it more prevalent among males?
## Variable types and mapping
```{r vartype, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'}
vartype <-
"| Type of variable | How to map | Common errors |
|:-----------------|:----------------------------|:--------------|
| Categorical, qualitative | Category + count/proportion displayed, often as an area plot or with a small number of categories mapped to colour or symbol | Not including 0 on the count/proportion axis. Not ordering categories. |
| Quantitative | Position along an axis | Displaying as a bar, especially when showing mean values. Mapping to colour. |
| Date/Time | Time-ordered axis, different temporal resolutions to study long term trend, or seasonal patterns. Lines typically connect measurements to indicate temporal dependence | Time order corrupted |
| Space | Conventional projections of the sphere, map aspect ratio| Wrong aspect ratio |
"
cat(vartype, fill=TRUE)
```
```{r vartype-hux, echo=FALSE, width=120, eval=FALSE}
library(huxtable)
vartype <- hux(
Type = c("Categorical, qualitative", "Quantitative, numeric"),
Mapping = c("Usually summarised by count or proportion, and category + statistic displayed, often as an area plot; or with a small number of categories mapped to colour or symbol", "Position along an axis"),
add_colnames = TRUE
) %>%
set_align(everywhere, everywhere, c("left", "left")) %>%
set_col_width(1, 0.3) %>%
set_bold(value=c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE), byrow=TRUE) %>%
set_wrap(everywhere, 2, TRUE) %>%
set_width(10)
vartype
```
## Coordinate systems
- *Cartesian, polar:* most plots are made in Cartesian coordinates. Just a few are in polar coordinates, primarily the pie chart. Polar coordinates use radius and angle to describe position in 2D space. Occasionally measurements like wind (direction and speed) make sense to be plotted in polar coordinates.
- *fixed, equal:* When variables are made on scales that should be comparable, it may be important to reflect this in the axes limits and page space that the plot takes. (This is different from `theme(aspect.ratio=1)` which sets the physical size of the plot to be the same, or in some ratio.)
- *map:* Maps come in conventional formats, most often with a specific aspect ratio of vertical to horizontal axes, that depends on latitude.
- *flip:* Useful for generating a plot with a categorical variable on the x axis and then flipping it sideways to look at.
```{r}
df <- tibble(x=runif(100), y=runif(100)*10)
ggplot(df, aes(x=x, y=y)) + geom_point() + coord_fixed()
ggplot(df, aes(x=x, y=y)) + geom_point() + coord_equal()
ggplot(df, aes(x=x, y=y)) + geom_point() + coord_fixed(ratio=0.2)
ggplot(df, aes(x=x, y=y)) + geom_point() + theme(aspect.ratio=1)
```
## Layering
- *Statistical summaries:* It is common to layer plots, particularly by adding statistical summaries, like a model fit, or means and standard deviations. The purpose is to show the **trend** in relation to the **variation**.
- *Maps:* Commonly maps provide the framework for data collected spatially. One layer for the map, and another for the data.
```{r}
df <- tibble(x=runif(100), y1=4*x + rnorm(100),
y2= -x + 10*(x-0.5)^2+ rnorm(100))
ggplot(df, aes(x=x, y=y1)) + geom_point()
ggplot(df, aes(x=x, y=y1)) + geom_point() +
geom_smooth(method="lm", se=FALSE)
ggplot(df, aes(x=x, y=y1)) + geom_point() +
geom_smooth(method="lm")
ggplot(df, aes(x=x, y=y2)) + geom_point()
ggplot(df, aes(x=x, y=y2)) + geom_point() +
geom_smooth(method="lm", se=FALSE)
ggplot(df, aes(x=x, y=y2)) + geom_point() +
geom_smooth(se=FALSE)
```
## Colour palettes
- Qualitative: categorical variables
- Sequential: low to high numeric values
- Diverging: negative to positive values
```{r echo=FALSE, fig.height=7, fig.width=12}
library(RColorBrewer)
display.brewer.all()
```
### Choropleth map
```{r echo=TRUE}
# Read the tb data
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
select(country, year, new_sp_m04:new_sp_fu) %>%
gather(stuff, count, new_sp_m04:new_sp_fu) %>%
separate(stuff, c("stuff1", "stuff2", "genderage")) %>%
select(-stuff1, -stuff2) %>%
mutate(gender=substr(genderage, 1, 1),
age=substr(genderage, 2, length(genderage))) %>%
select(-genderage)
# Compute relative difference between 2012 and 2002
tb_inc <- tb %>%
filter(year %in% c(2002, 2012)) %>%
group_by(country, year) %>%
summarise(count = sum(count, na.rm=TRUE)) %>%
spread(year, count) %>%
mutate(reldif = ifelse(`2002` == 0, 0, (`2012`-`2002`)/(`2002`))) %>%
ungroup()
# Join with a world map
library(maps)
library(ggthemes)
world_map <- map_data("world")
# Names of countries need to be the same in both data tables
tb_inc <- tb_inc %>%
mutate(country=recode(country,
"United States of America"="USA",
"United Kingdom of Great Britain and Northern Ireland"="UK",
"Russian Federation"="Russia"))
tb_map <- left_join(world_map, tb_inc, by=c("region"="country"))
```
#### Sequential
Default
```{r}
ggplot(tb_map) +
geom_polygon(aes(x=long, y=lat, group=group, fill=reldif)) +
theme_map()
```
Modified rainbow
```{r}
library(viridis)
ggplot(tb_map) +
geom_polygon(aes(x=long, y=lat, group=group, fill=reldif)) +
theme_map() +
scale_fill_viridis(na.value = "white")
```
#### Diverging
```{r}
ggplot(tb_map) +
geom_polygon(aes(x=long, y=lat, group=group, fill=reldif)) +
theme_map() +
scale_fill_distiller(palette="PRGn", na.value = "white",
limits = c(-7, 7))
```
## Colour blindness
- About 10% of men have some form of colorblindness; less than 0.2% of women are affected. The most common type of color deficiency causes difficulty distinguishing between red and green.
- There are several colour blind tested palettes: RColorbrewer has an associated web site [colorbrewer.org](http://colorbrewer2.org) where the palettes are labelled, and the `colorblind` package has palettes which are safe for various types of color deficiency.
<!-- Colorbrewer's colorblind friendly sets are actually pretty awful -->
- You can test your color choices for different forms of colorblindness using the `dichromat` package. Below is the same plot usng the default two colour scheme of ggplot, and what it looks like to a person with red/green colorblindness. As there are many mutations which can lead to different severities of color deficiency, simulations are not 100% accurate.
- A foolproof way to ensure your color palettes are distinguishable is to print them in black and white. If all colors can be distinguished, then your palette is safe to use for all types of colorblindness.
```{r fig.show='hold', fig.width=8, fig.height=4}
library(scales)
df <- data.frame(x=runif(100), y=runif(100), cl=sample(c(rep("A", 50), rep("B", 50))))
p <- ggplot(data=df, aes(x, y, colour=cl)) + theme_bw() +
geom_point() + theme(legend.position = "none", aspect.ratio=1)
library(dichromat)
clrs <- hue_pal()(3)
p
clrs <- dichromat(hue_pal()(3))
p + scale_colour_manual("", values=clrs)
```
## Pre-attentive
Some visual features are "read" without significant cognitive effort and attention.
Can you find the odd one out?
```{r}
df <- data.frame(x=runif(100), y=runif(100), cl=sample(c(rep("A", 1), rep("B", 99))))
ggplot(data=df, aes(x, y, shape=cl)) + theme_bw() +
geom_point(size = 2) +
scale_shape_manual(values = c(4, 5)) +
theme(legend.position="None", aspect.ratio=1)
```
Is it easier now?
```{r}
ggplot(data=df, aes(x, y, colour=cl)) +
geom_point(size = 2) +
theme_bw() +
theme(legend.position="None", aspect.ratio=1) +
scale_colour_brewer(palette="Dark2")
```
## Proximity
- Basic rule: place the groups that you want to compare close to each other
```{r echo=FALSE}
library(tidyverse)
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
select(country, iso3, year, new_sp_m04:new_sp_fu) %>%
gather(stuff, count, new_sp_m04:new_sp_fu) %>%
separate(stuff, c("stuff1", "stuff2", "genderage")) %>%
select(-stuff1, -stuff2) %>%
mutate(gender=substr(genderage, 1, 1),
age=substr(genderage, 2, length(genderage))) %>%
select(-genderage)
tb_au <- tb %>%
filter(country == "Australia") %>%
filter(!(age %in% c("04", "014", "514", "u"))) %>%
filter(year > 1996, year < 2013)
```
Here are two different arrangements of the tb data. To answer the question "Is the incidence similar for males and females in 2012 across age groups?" the first arrangement is better. It puts males and females right beside each other, so the relative heights of the bars can be seen quickly. The answer to the question would be "No, the numbers were similar in youth, but males are more affected with increasing age."
The second arrangement puts the focus on age groups, and is better to answer the question "Is the incidence similar for age groups in 2012, across gender?" To which the answer would be "No, among females, the incidence is higher at early ages. For males, the incidence is much more uniform across age groups."
```{r echo=TRUE, fig.width=8, fig.height=2}
tb_au %>% filter(year == 2012) %>%
ggplot(aes(x = gender, y = count, fill = gender)) +
geom_bar(stat = "identity", position="dodge") +
facet_grid( ~ age) +
scale_fill_brewer(palette="Dark2")
```
```{r echo=TRUE, fig.width=8, fig.height=2}
tb_au %>% filter(year == 2012) %>%
ggplot(aes(x = age, y = count, fill = age)) +
geom_bar(stat = "identity", position="dodge") +
facet_grid( ~ gender) +
scale_fill_brewer(palette="Dark2")
```
## Hierarchy of mappings
1. Position - common scale (BEST): axis system
2. Position - nonaligned scale: boxes in a side-by-side boxplot
3. Length, direction, angle: pie charts, regression lines, wind maps
4. Area: bubble charts
5. Volume, curvature: 3D plots
6. Shading, color (WORST): maps, points coloured by numeric variable
[My crowd-sourcing expt](http://visiphilia.org/2016/08/03/CM-hierarchy)
Nice explanation by [Peter Aldous](http://paldhous.github.io/ucb/2016/dataviz/week2.html)
[General plotting advice and a book from Naomi Robbins](https://www.forbes.com/sites/naomirobbins/#2b1e20082a6a)
## Adding interactivity to plots
Interaction on a plot can help de-clutter it, by making labels only show on mouse over. Occasionally it can be useful to zoom into parts of the plot. Often it is useful to change the aspect ratio.
The `plotly` package makes it easy to add interaction to ggplots.
The data
```{r}
library(plotly)
p <- ggplot(tb_au, aes(x = year, y = count,
fill = gender, label = count)) +
geom_bar(stat = "identity", position = "fill") +
facet_grid(~ age) +
ylab("Proportion") +
scale_fill_brewer(palette="Dark2")
ggplotly(p)
```
```{r echo=FALSE, eval=FALSE}
library(readxl)
passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), sheet=3, skip=6)
library(tidyverse)
passengers <- passengers %>%
filter(!is.na(AIRPORT)) %>%
select(airport = AIRPORT,
Year, IN_DOM = INBOUND, OUT_DOM = OUTBOUND,
IN_INTL = INBOUND__1,
OUT_INTL = OUTBOUND__1) %>%
filter(!airport %in% "TOTAL AUSTRALIA") %>%
gather(key = "where", value = "amount", IN_DOM:OUT_INTL) %>%
separate(where, into=c("bound", "type_of_flight"))
```
```{r echo=FALSE, eval=FALSE}
library(plotly)
p <- passengers %>%
filter(type_of_flight == "INTL") %>%
spread(key = bound, value = amount) %>%
ggplot() + geom_point(aes(x=IN, y=OUT, label=airport)) +
facet_wrap(~Year, ncol=8) +
coord_equal() +
scale_x_continuous("Incoming passengers (mil)", breaks=seq(0,8000000,2000000), labels=seq(0,8,2)) +
scale_y_continuous("Outgoing passengers (mil)", breaks=seq(0,8000000,2000000), labels=seq(0,8,2))
ggplotly(p)
```
```{r eval=FALSE, echo=FALSE}
# devtools::install_github("ropenscilabs/eechidna")
library(eechidna)
launchApp(
age = c("Age25_34", "Age35_44", "Age55_64"),
religion = c("Christianity", "Catholic", "NoReligion"),
other = c("NotOwned", "Indigenous", "Population")
)
```
## Themes
The `ggthemes` package has many different styles for the plots. Other packages such as `xkcd`, `skittles`, `wes anderson`, `beyonce`, `ochre`, ....
```{r}
library(xkcd)
ggplot(df, aes(x=x, y=y)) +
geom_point() +
theme_xkcd() +
xkcdaxis(c(0,1), c(0,1)) +
annotate("text", x=0.5, y=0.5, label="Help, I'm lost in here!", family="xkcd", size=5)
```