02-accessing-data.Rmd

---
knit: bookdown::preview_chapter
---

# Accessing Data

<!-- Topics to cover here 
lots of ways to read  in data: flat file, proprietary, data.table??, databases, scraping, json, shape files, netcdfs,  ...

missing values
-->

Data can be delivered in many ways, which makes it necessary to have a multitude of ways to harvest it. 

## Flat files

Flat files, like comma or tab separated values (`csv`, `tsv`), or values specified by fixed column widths, can be accessed simply, once the format is determined. This is also a very common format is used for very large data, for example, hadoop, where the data is partitioned into many small chunks overlaid with a system to access it. 

### Example: tuberculosis

The World Health Organisation (WHO) distributes data on tuberculosis case notifications through their web site at https://www.who.int/tb/country/data/download/en/. Download the latest case notifications data set, which is provided in `csv` format. A data dictionary is also provided explaining the format of the file, and information recorded. The data can be read into R using the code below:

```{r}
library(tidyverse)
library(DT)
tb <- read_csv(here::here("data", "TB_notifications_2020-01-04.csv"))
datatable(tb %>% top_n(10))
```

This data has 8286 rows and 164 columns. Its not in tidy form yet, so you will see this again in the tidying material to learn how to get a messy data set wrangled. The `read_csv` function provides a `tibble` data object in R. An alternative approach to reading is with the base function `read.csv` which will provide a `data.frame` object. 

### Your turn: atmospheric carbon dioxide 

Atmospheric carbon dioxide recordings are delivered by the Scripps CO$_2$ program, and available at https://scrippsco2.ucsd.edu/data/atmospheric_co2/sampling_stations.html.

Choose a station and download or directly read the csv file. You will need to skip over a number of lines at the top of the file in order to read only the data.

```{r eval=FALSE, echo=FALSE}
CO2.spo <- read_csv("https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/merged_in_situ_and_flask/daily/daily_merge_co2_spo.csv", col_names=c("date", "time", "day", "decdate", "n", "flg", "co2"), skip=69)
```


### Example: standardised test scores for reading, science and math

Every three years the OECD's Programme for International Student Assessment measures the ability of 15 year old students across the globe in reading, science and mathematics skills. This data is made available at https://www.oecd.org/pisa/. The format that the data is distributed in, is different from survey to survey, with the earliest years being text files, and later years in proprietary formats. The text files have fixed width format, where each variable occupies a fixed set of columns in the text. 

Let's take a look at the 2006 student data. Download the file from the OECD web site -- you should find that it is named `INT_Stu06_Dec07.txt`, or work with the subset of the first 2000 rows that is provided here. 

```{r}
# This is a previously generated data object of the column widths of variables 
load(here::here("data", "PISA_student_2006_varwidths.rda"))
# To work with the full data set, use the downloaded file, INT_Sch06_Dec07.zip
# instead of INT_Stu06_Dec07_25k.txt
d <- read_fwf(file=here::here("data", "INT_Stu06_Dec07_2k.txt"), 
              col_positions=fwf_widths(var_widths$widths, 
                                       col_names=var_widths$names))
datatable(d %>% top_n(10))
```

The`learningtower` package available at https://github.com/ropenscilabs/learningtower contains the data for multiple years and code to process the source files.

### Your turn: land surface weather station data

The [Global Historical Climate Network](https://www.ncdc.noaa.gov/ghcn-daily-description) delivers data from land surface stations across the globe. Records for individual stations are provided using a fixed width format flat file. Find a station near you, and download the file. The list of [stations can be found here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt). Your station file is identified by the station id. For example, the Melbourne airport station ID is ASN00086282, and thus the data file can be downloaded directly from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/ASN00086282.dly.

Documentation on the file format and how to cite the archive is available at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt. The column widths of the variables can be determined from this information. 

1. Work out the column widths for each variable
2. Write the code to read in this fixed width format

```{r eval=FALSE, echo=FALSE}
# Remote address for file
# fl <- "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/ASN00086282.dly"
# Copy available with book
fl <- "data/ASN00086282.dly"
melb <- read_fwf(fl,    
                 col_positions=fwf_widths(c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31))))
glimpse(melb)
```

The data is far from tidy, and needs considerable work to get it into a suitable format for analysis. 

A more friendly and tider `csv` formatted data have been made available recently which can be acessed through the https server on the same location.

```{r eval=FALSE, echo=FALSE}
melb <- read_csv("https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/ASN00086282.csv")
```


## Proprietary formats

Data is often delivered in a specialist formats developed by a company, or individual, that has a  special encoding scheme. Examples include excel files, SAS or SPSS files. These require decoding in order to read. 

### Example: Australia air traffic data

The [Australian Department of Infrastructure, Transport, Cities and Regional Development](https://www.bitre.gov.au/publications/ongoing/airport_traffic_data) provides excel format files summarising air traffic at airports. Download the most recent data, and open the excel file. There are multiple sheets. The third sheet has passenger numbers. There are 6 rows of meta-information, which need to be ignored when reading the data. 

```{r}
library(readxl)
passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), sheet=3, skip=6)
datatable(passengers)
```

### Your turn: Munging spreadsheets

The book [Spreaadsheet Munging Strategies](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) by Duncan Garmonsway is a really good source of dealing with complicated excel spreadsheets. We recommend working through examples in this book to familiarise ways to deal with messy spreadsheets, and incorporating information such as special formatting into the data. 

The [case study](https://nacnudus.github.io/spreadsheet-munging-strategies/vaccinations.html#vaccinations) developed from Bob Rudis's post on CDC vaccination data is especially recommended.

<!--
### Your turn: New Zealand Census data

StatsNZ makes [tables of data from the five year censuses](http://nzdotstat.stats.govt.nz/) publicly available. Take a look at the 2018 Census data, the population and migration data. You need to expand the cells, and select the levels to use, to have the numbers broken down by age group, sex and ethnicity. (A sample file `NZ_census.xlsx` is provided as an example.) In excel format, the variables and levels of the variables are in the header names, with a twist, a row for each variable, in a multicolumn format. (The R package `tidyxl` has the capacity to deal with multiple headers like this, but requires `xlsx` format. The sample file has been opened and saved in this format, and the first two blank lines in the origial file were also manually removed.)

Luckily, choosing the `csv` format will provide the data in tidy long form. 

GIVING UP ON THE XLS FORMAT - ITS JUST REALLY IRREGULAR - AND tidyxl even cannot handle it.

Similarly, TableBuilder now allows csv download which is in long format, so no need to struggle with the xlsx format.
-->

```{r eval=FALSE, echo=FALSE}
library(tidyxl)
library(unpivotr)
NZ_all_cells <-
  xlsx_cells(here::here("data", "NZ_census.xlsx")) %>%
  dplyr::filter(!is_blank) %>%
  select(row, col, data_type, character, numeric) %>%
  dplyr::filter(row > 1)
year <-
  NZ_all_cells %>%
  dplyr::filter(col >= 2, row == 5) %>%
  select(row, col, year = character)
ethnic <-
  NZ_all_cells %>%
  dplyr::filter(col >= 2, row == 4) %>%
  select(row, col, ethnic = character)
sex <-
  NZ_all_cells %>%
  dplyr::filter(col >= 2, row == 3) %>%
  select(row, col, sex = character)
age <-
  NZ_all_cells %>%
  dplyr::filter(col >= 2, row == 2) %>%
  select(row, col, age = character)
NZ <- read_xlsx(here::here("data", "NZ_census.xlsx"), skip=5, n_max=1) %>%
  gather(v, count, -Area) %>%
  filter(!is.na(count)) %>%
  select(-v)
NZ <- NZ %>%
  mutate(year = year$year) %>%
  mutate(ethnic = rep(ethnic$ethnic, c(rep(3, length(ethnic$ethnic)-1), 1))) %>%
  mutate(sex = rep(sex$sex, 43))
# NZ <- read_xls(here::here("data", "NZ_census.xls"), skip=2)
# NZ <- read_csv("data/NZ_census.csv")
```

### Example: SPSS binary

The 2015 and 2018 PISA surveys are available in bniary file format, which is a more complete data format in the sense of variable typing and level descriptions integrated with the data. The R package `haven` can be used to read foreign binary formats. Download the 2018 data from https://www.oecd.org/pisa/data/2018database/ (494Mb), and use the following code to read it into R. There is a data dictionary on the same site which describes fields in detail. This is really useful for deciding which variables to keep.

```{r eval=FALSE}
library(haven)
stu_qqq <- read_sav(here::here("data", "SPSS_STU_QQQ.zip")) 
PISA_au_2018 <- stu_qqq %>% 
  filter(CNT == "AUS") %>%
  select(CNT, ST004D01T, ST012Q01TA, ST013Q01TA, PV1MATH:PV10SCIE) 
# Keep country, gender, num TVs, num books, 
saveRDS(PISA_au_2018, file=here::here("data", "PISA_au_2018.rds"))
```

## Reading multiple files from a website

Often multiple datasets are available in a web site. It is tedious, and inefficient to manually download each file and read into R.  An example is data on the rental market in Tasmania from [data.gov.au](http://data.gov.au/dataset/rental-bond-and-rental-data-tasmania-2016-to-2017). The `sawfish` package makes it easier to download multiple files automatically. The package is only available on GitHub, and needs too be installed using the `devtools` package. Here is sample code to read all of the `xlsx` files on Tasmania's rental market. 

```{r}
library(readxl)
# devtools::install_github("AnthonyEbert/sawfish")
library(sawfish) 
url<-"http://data.gov.au/data/dataset/rental-bond-and-rental-data-tasmania-2016-to-2017"
fls <- find_files(url, "xlsx")
# Check that one file can be read nicely
f1 <- tempfile()
download.file(fls[1], f1, mode="wb")
t1 <- read_xlsx(path=f1, sheet=1)
# Now read all files
rentals <- NULL
for (i in 1:length(fls)) {
  download.file(fls[i], f1, mode="wb")
  t1 <- read_xlsx(path=f1, sheet=1)
  rentals <- bind_rows(rentals, t1)
}
datatable(t1 %>% top_n(10))
```

## JSON

With the advent of web communication, comes JavaScript Object Notation (JSON). It is a language-independent data format, and supplants extensible markup language (XML). It is a verbose data descriptor. Here's an example from [wikipedia](https://en.wikipedia.org/wiki/JSON) describing a person:

```
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": 
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
```

### Example: Crossrates 

An example we have seen is the cross rates data available at [https://openexchangerates.org/](https://openexchangerates.org/). To access this data you need to:

1. Get a free plan from https://openexchangerates.org/signup/free
2. Tell this function your API key -- `Sys.setenv("OER_KEY" = "your-key-here")`

This code can be used to get the latest 10 days of crossrates.

```{r eval=FALSE}
library(lubridate)
library(jsonlite)
library(glue)
# Function to read in rate for one day, given dates and an API key
getDay <- function(day, key) {
  u <- sprintf(
    "https://openexchangerates.org/api/historical/%s.json?app_id=%s",
    day, key
  )
  d <- jsonlite::fromJSON(u)
  # convert timestamp to date object
  # http://stackoverflow.com/questions/13456241/convert-unix-epoch-to-date-object-in-r
  rates <- as.data.frame(d$rates)
  rates$date <- as.POSIXct(res$timestamp, origin = "1970-01-01")
  data.frame(rates)
}

# Read in a set of dates using purrr
rates10 <- days %>% 
  purrr::map_dfr(~getDay(day = ., key="your-API-key")) 
saveRDS(rates10, file=here::here("data","rates.rds"))
```


## Shapefiles

The basic building block of a map is a set of points and an ordering variable which can be used to connect the dots with lines, so that a geographic region can be interpreted. Maps are important because they summarise our physical world, and provide the context for data collected for many different purposes -- political, health, environment, commerce.

Map information is often recorded and shared in specialised formats, all of which efficiently keep the structural aspects of spatial data organised. The `sf` (FIXME:REF) package provides tools to access this data. 

Here is an example related to bush fires in Australia. In Victoria, there are five Country Fire Authority (CFA) regions. These are organisational regions for fire management. The shapefiles associated with these regions can be downloaded from http://services.land.vic.gov.au/SpatialDatamart/, but these are also provided with this repo. The format is a folder of several binary files, which collectively describe the map geometry. A good explanation of shapefiles can be found [here](https://en.wikipedia.org/wiki/Shapefile). 

```{r}
library(sf)
library(tidyverse)
boundaries <- st_read("data/vic_cfa/cfa_region.shp")
```

Shapefiles are typically very detailed, and big in size. For data analysis, these are usually too detailed. Its a good idea to reduce the size, by removing some of the boundary points. This is a tricky task, because some boundary points are more important for defining the boundary than others, and they need to be kept. The `rmapshaper` package has a thinning function that works nicely to create a small and relatively map accurate shapefile. Note that, the `geom_sf` function from the `sf` package can be used to directly plot the shapefile. An alternative, albeit a lot more work, is to convert the object into a `tibble` of points, group ids, and and ordering variable and plot using the `geom_polygon`. 

```{r}
library(rmapshaper)
bound_sm <- ms_simplify(boundaries, keep_shapes = TRUE)
ggplot(data=bound_sm) + geom_sf(aes(fill=CFA_REGION))
```

Shapefiles can also be used to contain point information, for example the locations of fire stations. Locations of fire stations in Victoria can be downloaded from the same Victorian Government site, and overlaid on the polygon data. 

```{r}
fire_stations <- st_read("data/vic_fire_stations/geomark_point.shp") %>%
  filter(FEATSUBTYP =="fire station", STATE=="VIC")
ggplot(data=bound_sm) + geom_sf() +
  geom_sf(data=fire_stations, colour="red", shape=3)
```


## APIs

### Melbourne pedestrian traffic: `rwalkr`

### Australian weather: `bomrang`

### Atlas of Living Austraila: `ala4R`

## Images as data

An example is http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/

ETC3250 assignments from Di - simplify, maybe not full set of Bob Ross paintings

maybe Susan's shoeprint?

## Databases