08-data_frames.Rmd

# Data Frames

The rectangular array of information (e.g. an Excel spreadsheet ) is what many think of when they hear the word "data." Each column contains elements of a shared data type, and these data types can vary from column to column.\index{data frame} 

There is a type for this in R and Python: a data frame. It might even be the most common way that data is stored in both R and Python programs because many functions that read in data from an external source return objects of this type (e.g. `read.csv()` in R and `pd.read_csv()` in Python). 

R and Python's data frames have a number of things in common:

  1. each column must be the same length as all other columns,
  2. each column's elements will all have the same type, 
  3. elements in any rows can have different types, 
  4. columns and rows can be named in different ways, 
  5. there are many ways to get and set different subsets of data, and
  6. when reading in data, the same sorts of difficulties arise in both languages.


## Data Frames in R


Let's consider as an example Fisher's "Iris" data set [@misc_iris_53] hosted by [@uci_data]. We will read this data set in from a comma separated file (more information on input/output can be found in chapter \@ref(input-and-output)). This file can be downloaded from this link: [https://archive.ics.uci.edu/ml/datasets/iris](https://archive.ics.uci.edu/ml/datasets/iris). 


```{r, collapse=TRUE}
irisData <- read.csv("data/iris.csv", header = F)
head(irisData, 3)
typeof(irisData)
class(irisData) # we'll talk more about classes later
dim(irisData)
nrow(irisData)
ncol(irisData)
```

There are some exceptions, but most data sets can be stored as a `data.frame`.\index{data frame!data frames in R} These kinds of two-dimensional data sets are quite common. Any particular row is often an observation on one experimental unit (e.g. person, place or thing). Looking at a particular column gives you one kind of measurement stored for all observations. 

::: {.rmd-caution data-latex=""}
Do not rely on the default arguments of `read.csv()` or `read.table()`! After you read in a data frame, always check to make sure that 

1. the number of columns is correct because the correct column *separator* was used (c.f. `sep=`),
2. column names were parsed correctly, if there were some in the raw text file,
3. the first row of data wasn't used as a column name sequence, if there weren't column names in the text file, and
4. the last few rows aren't reading in empty spaces
5. character columns are read in correctly (c.f. `stringsAsFactors=`), and
6. special characters signifying missing data were correctly identified (c.f. `na.strings=`).
:::


[A `data.frame` is a special case of a `list`](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Data-frame-objects). Every element of the list is a column. Columns can be `vector`s or `factor`s, and they can all be of a different type. This is one of the biggest differences between data frames and `matrix`s. They are both two-dimensional, but a `matrix` needs elements to be all the same type. Unlike a general `list`, a `data.frame` requires all of its columns to have the same number of elements. In other words, the `data.frame` is not a "*ragged*" list.


Often times you will need to extract pieces of information from a `data.frame`. This can be done in many ways. If the columns have names, you can use the `$` operator to access a single column. Accessing a single column might be followed up by creating a new vector. You can also use the `[` operator to access multiple columns by name.  

```{r, collapse=TRUE}
colnames(irisData) <- c("sepal.length", "sepal.width", 
                        "petal.length","petal.width", 
                        "species")
firstCol <- irisData$sepal.length
head(firstCol)
firstTwoCols <- irisData[c("sepal.length", "sepal.width")] 
head(firstTwoCols, 3)
```

The `[` operator is also useful for selecting rows and columns by index numbers, or by some logical criteria.

```{r, collapse=TRUE}
topLeft <- irisData[1,1] # first row, first col
topLeft
firstThreeRows <- irisData[1:3,] # rows 1-3, all cols
firstThreeRows
# rows where species column is setosa
setosaOnly <- irisData[irisData$species == "Iris-setosa",] 
setosaOnly[1:3,-1]
```

In the code above, `irisData$species == "Iris-setosa"` creates a logical vector (try it!) using the vectorized `==` operator. The `[` operator selects the rows for which the corresponding element of this logical vector is `TRUE`. 

::: {.rmd-details data-latex=""}
Be careful: depending on how you use the square brackets, you can either get a `data.frame` or a `vector.` As an example, try both `class(irisData[,1])` and `class(irisData[,c(1,2)])`. 
:::


In R, `data.frame`s might have row names. You can get and set this character `vector` with the `rownames()` function. You can access rows by name using the square bracket operator. 

```{r, collapse = TRUE}
head(rownames(irisData))
rownames(irisData) <- as.numeric(rownames(irisData)) + 1000
head(rownames(irisData))
irisData["1002",]
```


Code that modifies data usually looks quite similar to code extracting data. You'll notice a lot of the same symbols (e.g. `$`, `[`, etc.), but the (`<-`) will point in the other direction. 


```{r, collapse = TRUE}
irisData$columnOfOnes <- rep(1, nrow(irisData))
irisData[,1] <- NULL #delete first col
irisData[1:2,1] <- rnorm(n = 2, mean = 999)
irisData[,'sepal.width'] <- rnorm(n = nrow(irisData), mean = -999)
irisData[irisData$species == "Iris-setosa", 'species'] <- "SETOSA!"
head(irisData, 3)
```


## Data Frames in Python


The Pandas\index{Pandas} library in Python has data frames that are modeled after R's [@pandas_guy].\index{data frame!data frames in Python} 

```{python, collapse=TRUE}
import pandas as pd
iris_data = pd.read_csv("data/iris.csv", header = None)
iris_data.head(3)
iris_data.shape
len(iris_data) # num rows
len(iris_data.columns) # num columns
list(iris_data.dtypes)[:3]
list(iris_data.dtypes)[3:]
```

The structure is very similar to that of R's data frame. It's two dimensional\index{indexing!indexing Pandas data frames}, and you can [access columns and rows by name or number.](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) Each column is a `Series` object, and each column can have a different `dtype`, which is analogous to R's situation. Again, because the elements need to be the same type along columns only, this is a big difference between 2-d Numpy `ndarray`s and `DataFrame`s (c.f. R's `matrix` versus R's `data.frame`). 

::: {.rmd-caution data-latex=""}
Again, do not rely on the default arguments of `pd.read_csv()`! After you read in a data set, always check that

1. the number of columns is correct because the correct column *separator* was used (c.f. `sep=`),
2. column names were parsed correctly, if there were some in the raw text file,
3. the first row of data wasn't used as a column name sequence, if there weren't column names in the text file (c.f. `header=`), and
4. the last few rows aren't reading in empty spaces
5. character columns are read in correctly (c.f. `dtype=`), and
6. special characters signifying missing data were correctly identified (c.f. `na.values=`).
:::

Square brackets are a little different in Python than they are in R. Just like in R, you can access columns by name with square brackets, and you can also access rows. Unlike R, though, you don't have to specify both rows and columns every time you use the square brackets.

```{python, collapse=TRUE}
iris_data.columns = ["sepal.length", "sepal.width", "petal.length", 
                      "petal.width", "species"]
first_col = iris_data['sepal.length']
first_col.head()
first_two_cols = iris_data[["sepal.length", "sepal.width"]]
first_two_cols.head(3)
```


::: {.rmd-details data-latex=""}
Notice that `iris_data['sepal.length']` returns a `Series` and `iris_data[["sepal.length", "sepal.width"]]` returns a Pandas `DataFrame`. This behavior is similar to what happened in R. For more details, click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#indexing-selection).
::: 

You can select columns and rows by number with the [`.iloc` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html). `iloc` is (probably) short for "integer location." 

```{python, collapse=TRUE}
# specify rows/cols by number
top_left = iris_data.iloc[0,0]
top_left
first_three_rows_without_last_col = iris_data.iloc[:3,:-1]
first_three_rows_without_last_col
```

Selecting columns by anything besides integer number can be done with the [`.loc()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). You should generally prefer this method to access columns because accessing things by *name* instead of *number* is more readable. Here are some examples. 

```{python, collapse=TRUE}
sepal_w_to_pedal_w = iris_data.loc[:,'sepal.width':'petal.width']
sepal_w_to_pedal_w.head()
setosa_only = iris_data.loc[iris_data['species'] == "Iris-setosa",]
# don't need the redundant column anymore
del setosa_only['species']
setosa_only.head(3)
```

Notice we used a `slice` (i.e. `'sepal.width':'pedal.width'`) to access many columns by only referring to the left-most and the right-most. Unlike slicing\index{slicing} with numbers, [the right end is included](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#slicing-ranges). Also note that this does not work with the regular square bracket operator (i.e. `iris_data['sepal.width':'pedal.width']`). The second example filters out the rows where the `"species"` column elements are equal to `"Iris-setosa"`.

Each `DataFrame` in Pandas comes with an `.index` attribute. This is analogous to a row name in R, but it's much more flexible because the index can take on a variety of types. This can help us highlight the difference between `.loc` and `.iloc`. Recall that `.loc` was *label-based selection.* Labels don't necessarily have to be strings. Consider the following example.

```{python, collapse = TRUE}
iris_data.index
# reverse the index
iris_data = iris_data.set_index(iris_data.index[::-1]) 
iris_data.iloc[-2:,:3] # top is now bottom
iris_data.loc[0] # last row has 0 index
iris_data.iloc[0] # first row with big index 
```

`iris_data.loc[0]` selects the `0`th index. The second line reversed the indexes, so this is actually the last row. If you want the first row, use `iris_data.iloc[0]`.

<!-- [`DataFrame`s have a huge amount  many useful attributes and methods, too.](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#) -->

Modifying data inside a data frame looks quite similar to extracting data. You'll recognize a lot of the methods mentioned earlier. 

```{python, collapse = TRUE}
import numpy as np
n_rows = iris_data.shape[0]
iris_data['col_ones'] = np.repeat(1.0, n_rows)
iris_data.iloc[:2,0] =  np.random.normal(loc=999, size=2)
rand_nums = np.random.normal(loc=-999, size=n_rows)
iris_data.loc[:,'sepal.width'] = rand_nums
setosa_rows = iris_data['species'] == "Iris-setosa"
iris_data.loc[setosa_rows, 'species'] = "SETOSA!"
del iris_data['petal.length']
iris_data.head(3)
```

You can also use the [`.assign()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html) to create a new column. This method does not modify the data frame in place. It returns a new `DataFrame` with the additional column.  

```{python, collapse = TRUE}
iris_data = iris_data.assign(new_col_name = np.arange(n_rows))
del iris_data['sepal.length']
iris_data.head(3)
```

::: {.rmd-caution data-latex=""}
Above we were assigning Numpy arrays to columns of a `DataFrame`. Be careful when you're assigning `Series` objects. You'll see in the documentation that ["Pandas aligns all AXES when setting Series and `DataFrame` from `.loc`, and `.iloc`."](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics). 
::: 


## Exercises

### R Questions


1. 

Consider the data set `"original_rt_snippets.txt"` [@SocherEtAl2013:RNTN], which is hosted by [@uci_data]. We will calculate the *term frequency-inverse document frequency statistics* [@Jones72astatistical] on this data set, which is a common data transformation technique used in text mining and natural language processing. You may use the `stringr` library for this question, if you wish. 


  a) read in this data set as a `vector` and call it `corpus`.
  b) create a `vector` called `dictionary` that contains the following phrases: "charming","fantasy", "hate", and "boring".
  c) Construct a `data.frame` with four columns called `bagOfWords` that contains the number of appearances of each word in the dictionary. Match the exact phrases. For simplicity, don't worry about the case of letters or using regular expressions (c.f. section \@ref(an-introduction-to-regular-expressions)). Label the columns of this `data.frame` with the phrases you're searching for. Try to write code that is easy to modify if you decide to change the set of phrases in your dictionary.
  d) Create a `data.frame` with four columns called `termFrequency`. Each element should correspond with the count from the previous `data.frame`. Instead of the count, each element should be $\log(1 + \text{count})$.
  e) Create a `vector` of length four called `invDocFrequency`. The inverse document frequency formula for any term $t$ is $\log([\text{number of documents in corpus}])$ minus $\log([\text{number of documents that contain term } t])$. Make sure the names of this vector are the same as the words in the dictionary. 
  f) Create a `data.frame` called `tfidf` (short for "term frequency-inverse document frequency"). For row/document $d$, and column/term $t$, the formula is the product: $[\text{term frequency of term } t \text{ and document } d ] \times [\text{inverse doc. freq. of term } t]$.
  g) Extract the elements of `corpus` that have at least one nonzero element in the corresponding row of `tfidf`. Call the `vector` `informativeDocuments`.
  h) Do you see any documents that were labeled as informative, that do not actually contain the words you searched for?
  
2.

`mtcars` is a data set that is built into R, so you don't need to read it in. You can read more about it by typing `?datasets::mtcars`. 

  a) Create a new `data.frame` called `withoutDisp` that is the same as `mtcars`, but has its `disp` column removed. 
  b) Create a new column for `withoutDisp` called `coolnessScore`. The formula is $\frac{1}{\text{mpg}} + \text{quarter mile time}$.
  c) Create a new `data.frame` called `sortedWD` that is equal to `withoutDisp`, but sorted in descending order by the coolness score. 
  d) Create a new `data.frame` from `sortedWD` called `specialRows` that keeps only rows where $\text{weight (1000lbs)} + \text{rear axle ratio} < 7$
  e) Calculate the percent reduction in number of rows, going from `sortedWD` to `specialRows`. Call it `percReduction`. Make sure it is between $0$ and $100$.
  
  
3.

This question investigates the [*Zillow Home Value Index (ZHVI)*](https://www.zillow.com/research/data/) for single family homes.

  a) read in `"Zip_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv"`. Call the `data.frame` `homeData`. Remember to be careful with file paths. Also, when looking at the data set with a text editor, be sure that "word wrap" is not turned on.
  b) Extract the rows of `homeData` that correspond with Charlottesville, VA, and assign them as a `data.frame` to the variable `cvilleData`
  c) Assign all the unique zip codes to a `character vector` called `cvilleZips`.
  d) Extract the columns of `cvilleData` that correspond with home prices, and transpose them so that each row of the result corresponds with a different month. Call this new `data.frame` `timeSeriesData`. Also, make sure that the column names of this new `data.frame` are set equal to the appropriate zip codes. 
  e) Write a function called `getAveMonthlyChange` that takes in a numeric `vector`, and returns the average change. Your function should not return an `NA`, so be sure to deal with `NA`s appropriately.
  f) Calculate the average monthly price change for each zip code. Store your results in a `vector` called `aveMonthlyChanges`. Make sure this `vector` has named elements so one can extract elements by zip code.


### Python Questions

1. 

This question deals with looking at historical prices of the S\&P500 Index. This data was downloaded from [https://finance.yahoo.com](https://finance.yahoo.com) [@gspc_data]. It contains prices starting from "2007-01-03" and going until "2021-10-01".

  a) Read in the data file `"gspc.csv"` as a `data.frame` and call the variable `gspc`.
  b) Use [`.set_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) to change the index of `gspc` to its `"Index"` column. Store the new `DataFrame` as `gspc_good_idx`.
  c) Recall the formula for *log returns* provided in exercises of chapter \@ref(r-vectors-versus-numpy-arrays-and-pandas-series). Add a column to `gspc_good_idx` called `log_returns`. Calculate them from the column `GSPC.Adjusted`. Make sure to scale them by $100$ and put an `np.nan` in the first element where you don't have a return. 
  d) Extract all the returns that are available for the year 2021, and store them as a `Series` called `this_year_returns`.
  e) Add a column to `gspc_good_idx` containing the *drawdown* time series. Call this column `drawdown`. To calculate drawdown at a given date, subtract that date's price from the running maximum price that is current at that date. Use only adjusted close prices for this calculation.  
  f) Add a column to `gspc_good_idx` containing the percent drawdown time series. Call this column `perc_drawdown`. Use the previous column, but make this number a percent of the corresponding running maximum. 
  g) What is the maximum drawdown of this time series? Store it as a percent in the value `mdd`.


2.

In this question we'll look at [some data on radon measurements](https://www.tensorflow.org/datasets/catalog/radon) [@GelmanHill:2007]. Instead of reading in a text file, we will load the data into Python using the [`tensorflow_datasets` module](https://www.tensorflow.org/datasets) [@TFDS]. 

Please include the following code in your submission. 

```{python, eval=FALSE}
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
d = tfds.load("radon")
d = pd.DataFrame(tfds.as_dataframe(d['train']))
```

Many of you will need to install `tensorflow` and `tensorflow_datasets` before you're able to `import` it. If that's so, please read section \@ref(installing-packages-in-python) for more information on how to install packages.  


  a) Assign to `worst_row` the row of `d` that is associated the highest recorded level of radon. Make sure it a `DataFrame`.
  b) Assign to `nrows` and `ncols` the number of rows and columns of `d`, respectively.
  c) Assign the most common column data type to `most_common_dtype`. Make sure the variable is of type `numpy.dtype`
  d) Are there any observations from Virginia in this data set? If so, assign `True` to `any_va`. Otherwise assign `False`.
  e) Create a  new column in `d` called `dist_from_cville`. Use the **Haversine formula** to calculate distance between each row and the University of Virginia, in kilometers. 
      * Assume the University of Virginia is at 38.0336$^\circ$ N, 78.5080$^\circ$W
      * Assume the Earth's radius $r = 6378.137$ kilometers.
      * The formula for distance between $(\lambda_1, \phi_1)$ (signed longitude in radians, signed latitude in radians) and $(\lambda_2, \phi_2)$ is 

      \begin{equation}
      2 \hspace{1mm} r \hspace{1mm} \text{arcsin}\left( \sqrt{ \sin^2\left( \frac{\phi_2 - \phi_1}{2}\right)  + \cos(\phi_1)\cos(\phi_2) \sin^2\left( \frac{\lambda_2 - \lambda_1}{2} \right)  } \right)
      \end{equation}
  f) What is the average radon measurement between all measurements taken at the place that is closest to where we are now? Assign your answer as a `float` to `close_ave`