-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path02-accessing-data.Rmd
313 lines (235 loc) · 15.5 KB
/
02-accessing-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
---
knit: bookdown::preview_chapter
---
# Accessing Data
<!-- Topics to cover here
lots of ways to read in data: flat file, proprietary, data.table??, databases, scraping, json, shape files, netcdfs, ...
missing values
-->
Data can be delivered in many ways, which makes it necessary to have a multitude of ways to harvest it.
## Flat files
Flat files, like comma or tab separated values (`csv`, `tsv`), or values specified by fixed column widths, can be accessed simply, once the format is determined. This is also a very common format is used for very large data, for example, hadoop, where the data is partitioned into many small chunks overlaid with a system to access it.
### Example: tuberculosis
The World Health Organisation (WHO) distributes data on tuberculosis case notifications through their web site at https://www.who.int/tb/country/data/download/en/. Download the latest case notifications data set, which is provided in `csv` format. A data dictionary is also provided explaining the format of the file, and information recorded. The data can be read into R using the code below:
```{r}
library(tidyverse)
library(DT)
tb <- read_csv(here::here("data", "TB_notifications_2020-01-04.csv"))
datatable(tb %>% top_n(10))
```
This data has 8286 rows and 164 columns. Its not in tidy form yet, so you will see this again in the tidying material to learn how to get a messy data set wrangled. The `read_csv` function provides a `tibble` data object in R. An alternative approach to reading is with the base function `read.csv` which will provide a `data.frame` object.
### Your turn: atmospheric carbon dioxide
Atmospheric carbon dioxide recordings are delivered by the Scripps CO$_2$ program, and available at https://scrippsco2.ucsd.edu/data/atmospheric_co2/sampling_stations.html.
Choose a station and download or directly read the csv file. You will need to skip over a number of lines at the top of the file in order to read only the data.
```{r eval=FALSE, echo=FALSE}
CO2.spo <- read_csv("https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/merged_in_situ_and_flask/daily/daily_merge_co2_spo.csv", col_names=c("date", "time", "day", "decdate", "n", "flg", "co2"), skip=69)
```
### Example: standardised test scores for reading, science and math
Every three years the OECD's Programme for International Student Assessment measures the ability of 15 year old students across the globe in reading, science and mathematics skills. This data is made available at https://www.oecd.org/pisa/. The format that the data is distributed in, is different from survey to survey, with the earliest years being text files, and later years in proprietary formats. The text files have fixed width format, where each variable occupies a fixed set of columns in the text.
Let's take a look at the 2006 student data. Download the file from the OECD web site -- you should find that it is named `INT_Stu06_Dec07.txt`, or work with the subset of the first 2000 rows that is provided here.
```{r}
# This is a previously generated data object of the column widths of variables
load(here::here("data", "PISA_student_2006_varwidths.rda"))
# To work with the full data set, use the downloaded file, INT_Sch06_Dec07.zip
# instead of INT_Stu06_Dec07_25k.txt
d <- read_fwf(file=here::here("data", "INT_Stu06_Dec07_2k.txt"),
col_positions=fwf_widths(var_widths$widths,
col_names=var_widths$names))
datatable(d %>% top_n(10))
```
The`learningtower` package available at https://github.com/ropenscilabs/learningtower contains the data for multiple years and code to process the source files.
### Your turn: land surface weather station data
The [Global Historical Climate Network](https://www.ncdc.noaa.gov/ghcn-daily-description) delivers data from land surface stations across the globe. Records for individual stations are provided using a fixed width format flat file. Find a station near you, and download the file. The list of [stations can be found here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt). Your station file is identified by the station id. For example, the Melbourne airport station ID is ASN00086282, and thus the data file can be downloaded directly from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/ASN00086282.dly.
Documentation on the file format and how to cite the archive is available at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt. The column widths of the variables can be determined from this information.
1. Work out the column widths for each variable
2. Write the code to read in this fixed width format
```{r eval=FALSE, echo=FALSE}
# Remote address for file
# fl <- "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/ASN00086282.dly"
# Copy available with book
fl <- "data/ASN00086282.dly"
melb <- read_fwf(fl,
col_positions=fwf_widths(c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31))))
glimpse(melb)
```
The data is far from tidy, and needs considerable work to get it into a suitable format for analysis.
A more friendly and tider `csv` formatted data have been made available recently which can be acessed through the https server on the same location.
```{r eval=FALSE, echo=FALSE}
melb <- read_csv("https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/ASN00086282.csv")
```
## Proprietary formats
Data is often delivered in a specialist formats developed by a company, or individual, that has a special encoding scheme. Examples include excel files, SAS or SPSS files. These require decoding in order to read.
### Example: Australia air traffic data
The [Australian Department of Infrastructure, Transport, Cities and Regional Development](https://www.bitre.gov.au/publications/ongoing/airport_traffic_data) provides excel format files summarising air traffic at airports. Download the most recent data, and open the excel file. There are multiple sheets. The third sheet has passenger numbers. There are 6 rows of meta-information, which need to be ignored when reading the data.
```{r}
library(readxl)
passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), sheet=3, skip=6)
datatable(passengers)
```
### Your turn: Munging spreadsheets
The book [Spreaadsheet Munging Strategies](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) by Duncan Garmonsway is a really good source of dealing with complicated excel spreadsheets. We recommend working through examples in this book to familiarise ways to deal with messy spreadsheets, and incorporating information such as special formatting into the data.
The [case study](https://nacnudus.github.io/spreadsheet-munging-strategies/vaccinations.html#vaccinations) developed from Bob Rudis's post on CDC vaccination data is especially recommended.
<!--
### Your turn: New Zealand Census data
StatsNZ makes [tables of data from the five year censuses](http://nzdotstat.stats.govt.nz/) publicly available. Take a look at the 2018 Census data, the population and migration data. You need to expand the cells, and select the levels to use, to have the numbers broken down by age group, sex and ethnicity. (A sample file `NZ_census.xlsx` is provided as an example.) In excel format, the variables and levels of the variables are in the header names, with a twist, a row for each variable, in a multicolumn format. (The R package `tidyxl` has the capacity to deal with multiple headers like this, but requires `xlsx` format. The sample file has been opened and saved in this format, and the first two blank lines in the origial file were also manually removed.)
Luckily, choosing the `csv` format will provide the data in tidy long form.
GIVING UP ON THE XLS FORMAT - ITS JUST REALLY IRREGULAR - AND tidyxl even cannot handle it.
Similarly, TableBuilder now allows csv download which is in long format, so no need to struggle with the xlsx format.
-->
```{r eval=FALSE, echo=FALSE}
library(tidyxl)
library(unpivotr)
NZ_all_cells <-
xlsx_cells(here::here("data", "NZ_census.xlsx")) %>%
dplyr::filter(!is_blank) %>%
select(row, col, data_type, character, numeric) %>%
dplyr::filter(row > 1)
year <-
NZ_all_cells %>%
dplyr::filter(col >= 2, row == 5) %>%
select(row, col, year = character)
ethnic <-
NZ_all_cells %>%
dplyr::filter(col >= 2, row == 4) %>%
select(row, col, ethnic = character)
sex <-
NZ_all_cells %>%
dplyr::filter(col >= 2, row == 3) %>%
select(row, col, sex = character)
age <-
NZ_all_cells %>%
dplyr::filter(col >= 2, row == 2) %>%
select(row, col, age = character)
NZ <- read_xlsx(here::here("data", "NZ_census.xlsx"), skip=5, n_max=1) %>%
gather(v, count, -Area) %>%
filter(!is.na(count)) %>%
select(-v)
NZ <- NZ %>%
mutate(year = year$year) %>%
mutate(ethnic = rep(ethnic$ethnic, c(rep(3, length(ethnic$ethnic)-1), 1))) %>%
mutate(sex = rep(sex$sex, 43))
# NZ <- read_xls(here::here("data", "NZ_census.xls"), skip=2)
# NZ <- read_csv("data/NZ_census.csv")
```
### Example: SPSS binary
The 2015 and 2018 PISA surveys are available in bniary file format, which is a more complete data format in the sense of variable typing and level descriptions integrated with the data. The R package `haven` can be used to read foreign binary formats. Download the 2018 data from https://www.oecd.org/pisa/data/2018database/ (494Mb), and use the following code to read it into R. There is a data dictionary on the same site which describes fields in detail. This is really useful for deciding which variables to keep.
```{r eval=FALSE}
library(haven)
stu_qqq <- read_sav(here::here("data", "SPSS_STU_QQQ.zip"))
PISA_au_2018 <- stu_qqq %>%
filter(CNT == "AUS") %>%
select(CNT, ST004D01T, ST012Q01TA, ST013Q01TA, PV1MATH:PV10SCIE)
# Keep country, gender, num TVs, num books,
saveRDS(PISA_au_2018, file=here::here("data", "PISA_au_2018.rds"))
```
## Reading multiple files from a website
Often multiple datasets are available in a web site. It is tedious, and inefficient to manually download each file and read into R. An example is data on the rental market in Tasmania from [data.gov.au](http://data.gov.au/dataset/rental-bond-and-rental-data-tasmania-2016-to-2017). The `sawfish` package makes it easier to download multiple files automatically. The package is only available on GitHub, and needs too be installed using the `devtools` package. Here is sample code to read all of the `xlsx` files on Tasmania's rental market.
```{r}
library(readxl)
# devtools::install_github("AnthonyEbert/sawfish")
library(sawfish)
url<-"http://data.gov.au/data/dataset/rental-bond-and-rental-data-tasmania-2016-to-2017"
fls <- find_files(url, "xlsx")
# Check that one file can be read nicely
f1 <- tempfile()
download.file(fls[1], f1, mode="wb")
t1 <- read_xlsx(path=f1, sheet=1)
# Now read all files
rentals <- NULL
for (i in 1:length(fls)) {
download.file(fls[i], f1, mode="wb")
t1 <- read_xlsx(path=f1, sheet=1)
rentals <- bind_rows(rentals, t1)
}
datatable(t1 %>% top_n(10))
```
## JSON
With the advent of web communication, comes JavaScript Object Notation (JSON). It is a language-independent data format, and supplants extensible markup language (XML). It is a verbose data descriptor. Here's an example from [wikipedia](https://en.wikipedia.org/wiki/JSON) describing a person:
```
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers":
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}
```
### Example: Crossrates
An example we have seen is the cross rates data available at [https://openexchangerates.org/](https://openexchangerates.org/). To access this data you need to:
1. Get a free plan from https://openexchangerates.org/signup/free
2. Tell this function your API key -- `Sys.setenv("OER_KEY" = "your-key-here")`
This code can be used to get the latest 10 days of crossrates.
```{r eval=FALSE}
library(lubridate)
library(jsonlite)
library(glue)
# Function to read in rate for one day, given dates and an API key
getDay <- function(day, key) {
u <- sprintf(
"https://openexchangerates.org/api/historical/%s.json?app_id=%s",
day, key
)
d <- jsonlite::fromJSON(u)
# convert timestamp to date object
# http://stackoverflow.com/questions/13456241/convert-unix-epoch-to-date-object-in-r
rates <- as.data.frame(d$rates)
rates$date <- as.POSIXct(res$timestamp, origin = "1970-01-01")
data.frame(rates)
}
# Read in a set of dates using purrr
rates10 <- days %>%
purrr::map_dfr(~getDay(day = ., key="your-API-key"))
saveRDS(rates10, file=here::here("data","rates.rds"))
```
## Shapefiles
The basic building block of a map is a set of points and an ordering variable which can be used to connect the dots with lines, so that a geographic region can be interpreted. Maps are important because they summarise our physical world, and provide the context for data collected for many different purposes -- political, health, environment, commerce.
Map information is often recorded and shared in specialised formats, all of which efficiently keep the structural aspects of spatial data organised. The `sf` (FIXME:REF) package provides tools to access this data.
Here is an example related to bush fires in Australia. In Victoria, there are five Country Fire Authority (CFA) regions. These are organisational regions for fire management. The shapefiles associated with these regions can be downloaded from http://services.land.vic.gov.au/SpatialDatamart/, but these are also provided with this repo. The format is a folder of several binary files, which collectively describe the map geometry. A good explanation of shapefiles can be found [here](https://en.wikipedia.org/wiki/Shapefile).
```{r}
library(sf)
library(tidyverse)
boundaries <- st_read("data/vic_cfa/cfa_region.shp")
```
Shapefiles are typically very detailed, and big in size. For data analysis, these are usually too detailed. Its a good idea to reduce the size, by removing some of the boundary points. This is a tricky task, because some boundary points are more important for defining the boundary than others, and they need to be kept. The `rmapshaper` package has a thinning function that works nicely to create a small and relatively map accurate shapefile. Note that, the `geom_sf` function from the `sf` package can be used to directly plot the shapefile. An alternative, albeit a lot more work, is to convert the object into a `tibble` of points, group ids, and and ordering variable and plot using the `geom_polygon`.
```{r}
library(rmapshaper)
bound_sm <- ms_simplify(boundaries, keep_shapes = TRUE)
ggplot(data=bound_sm) + geom_sf(aes(fill=CFA_REGION))
```
Shapefiles can also be used to contain point information, for example the locations of fire stations. Locations of fire stations in Victoria can be downloaded from the same Victorian Government site, and overlaid on the polygon data.
```{r}
fire_stations <- st_read("data/vic_fire_stations/geomark_point.shp") %>%
filter(FEATSUBTYP =="fire station", STATE=="VIC")
ggplot(data=bound_sm) + geom_sf() +
geom_sf(data=fire_stations, colour="red", shape=3)
```
## APIs
### Melbourne pedestrian traffic: `rwalkr`
### Australian weather: `bomrang`
### Atlas of Living Austraila: `ala4R`
## Images as data
An example is http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/
ETC3250 assignments from Di - simplify, maybe not full set of Bob Ross paintings
maybe Susan's shoeprint?
## Databases