07-categorical_data.Rmd

# Categorical Data 

While statisticians may describe data as being either categorical or numerical, this classification is different than classifying data by its *type* in a program. So, strictly speaking, if you have categorical data, you are not obligated to use any particular type to represent it in your script. 

However, there are types that are specifically designed to be used with categorical data, and so they are especially advantageous to use if you end up with the opportunity. We describe a few of them here in this chapter.

## `factor`s in R

Categorical data in R is often stored in a [`factor`](https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Factors) variable.\index{factors in R} `factor`s are more special than `vector`s of integers because

  * they have a `levels` attribute, which is comprised of all the possible values that each response could be; 
  * they may or may not be *ordered*, which will also control how they are used in mathematical functions;
  * they might have a `contrasts` attribute, which will control how they are used in statistical modeling functions.

Here is a first example. Say we asked three people what their favorite season was. The data might look something like this.

```{r, collapse = TRUE}
allSeasons <- c("spring", "summer", "autumn", "winter")
responses <- factor(c("autumn", "summer", "summer"), 
                    levels = allSeasons)
levels(responses)
is.factor(responses)
is.ordered(responses)
#contrasts(responses) 
# ^ controls how factor is used in different  functions
```


`factor`s always have levels, which is the collection of all possible unique values each observation can take. 

::: {.rmd-caution data-latex=""}
You should be careful if you are not specifying them directly. What happens when you use the default option and replace the second assignment in the above code with `responses <- factor(c("autumn", "summer", "summer"))`? The documentation of `factor()` will tell you that, by default, `factor()` will just take the unique values found in the data. In this case, nobody prefers winter or spring, and so neither will show up in `levels(responses)`. This may or may not be what you want. 
:::

`factor`s can be ordered or unordered. Ordered `factor`s are for *ordinal* data. Ordinal data is a particular type of categorical data that recognizes the categories have a natural order (e.g. low/ medium/high and not red/green/blue).

As another example, say we asked ten people how much they liked statistical computing, and they could only respond "love it", "it's okay" or "hate it". The data might look something like this.

```{r, collapse = TRUE}
ordFeelOptions <- c("hate it", "it's okay", "love it")
responses <- factor(c("love it", "it's okay", "love it", 
                      "love it", "it's okay", "love it", 
                      "love it", "love it", "it's okay", 
                      "it's okay"), 
                    levels = ordFeelOptions,
                    ordered = TRUE)
levels(responses)
is.factor(responses)
is.ordered(responses)
# contrasts(responses)
```

::: {.rmd-caution data-latex=""}
When creating ordered factors with `factor()`, be mindful that the `levels=` argument is assumed to be ordered when you plug it into `factor()`. In the above example, if you specified `levels = c("love it", "it's okay", "hate it")`, then the factor would assume `love it < it's okay < hate it`, which may or may not be what you want.
:::

Last, `factor`s may or may not have a `contrast` attribute. You can get or set this with the `contrasts()` function. This will influence some of the functions you use on your data that estimate statistical models. 

I will not discuss specifics of contrasts in this text, but the overall motivation is important. In short, the primary reason for using `factor`s is that they are designed to allow control over *how* you model categorical data. To be more specific, changing attributes of a `factor` could control the paremeterization of a model you're estimating. If you're using a particular function for modeling with categorical data, you need to know how it treats factors. On the other hand, if you're writing a function that performs modeling of categorical data, you should know how to treat factors. 

Here are two examples that you might come across in your studies.

  1. Consider using `factor`s as inputs to a function that performs linear regression. With linear regression models, if you have categorical inputs, there are many choices for how to write down a model. In each model, the collection of parameters will mean different things. In R, you might pick the model by creating the `factor` in a specific way. 
  
  2. Suppose you are interested in estimating a classification model. In this case, the *dependent* variable is categorical, not the independent variable. With these types of models, choosing whether or not your `factor` is ordered is critical. These options would estimate completely different models, so choose wisely!

The mathematical details of these examples is outside of the scope of this text. If you have not learned about dummy variables in a regression course, or if you have not considered the difference between multinomial logistic regression and ordinal logistic regression, or if you have but you're just a little rusty, that is totally fine. I only mention these as examples for how the `factor` type can trigger special behavior. 

In addition to creating one with `factor()`, there are two other common ways that you can end up with `factors`:

  1. creating factors from numerical data, and
  2. when reading in an external data file, one of the columns is coerced to a `factor`.
  
Here is an example of (1). We can take non-categorical data, and `cut()` it into something categorical. 

```{r, collapse = TRUE}
stockReturns <- rnorm(6) # not categorical here
typeOfDay <- cut(stockReturns, breaks = c(-Inf, 0, Inf)) 
typeOfDay
levels(typeOfDay)
is.factor(typeOfDay)
is.ordered(typeOfDay)
```

Finally, be mindful of how different functions read in external data sets. When reading in an external file, if a particular function comes across a column that has characters in it, it will need to decide whether to store that column as a character vector, or as a `factor`. For example, `read.csv()` and `read.table()` have a `stringsAsFactors=` argument that you should be mindful of. 


## Two Options for Categorical Data in Pandas 


Pandas\index{Pandas} provides two options for storing categorical data. They are both very similar to R's `factor`s. You may use either 

  1. a Pandas `Series` with a special `dtype`, or 
  2. a Pandas `Categorical` container.
  
Pandas' `Series` were discussed earlier in sections \@ref(overview-of-python) and \@ref(vectorization-in-python). These were containers that forced every element to share the same `dtype`. Here, we specify `dtype="category"` in `pd.Series()`.

```{python, collapse = TRUE}
import pandas as pd
szn_s = pd.Series(["autumn", "summer", "summer"], dtype = "category") 
szn_s.cat.categories
szn_s.cat.ordered
szn_s.dtype
type(szn_s)
```

The second option is to use Pandas' `Categorical` containers. They are quite similar, so the choice is subtle. Like `Series` containers, they also force all of their elements to share the same shared `dtype`.

```{python, collapse = TRUE}
szn_c = pd.Categorical(["autumn", "summer", "summer"])
szn_c.categories
szn_c.ordered
szn_c.dtype
type(szn_c)
```

You might have noticed that, with the `Categorical` container, methods and data members were not accessed through the `.cat` accessor. It is also more similar to R's `factor`s because you can specify more arguments in the constructor.

```{python, collapse = TRUE}
all_szns = ["spring","summer", "autumn", "winter"]
szn_c2 = pd.Categorical(["autumn", "summer", "summer"], 
                        categories = all_szns, 
                        ordered = False)
```


::: {.rmd-caution data-latex=""}
In Pandas, just like in R, you need to be very careful about what the `categories` (c.f `levels`) are. If you are using ordinal data, they need to be specified in the correct order. If you are using small data sets, be cognizant of whether all the categories show up in the data--otherwise they will not be correctly inferred. 
:::

With Pandas' `Series` it's more difficult to specify a nondefault `dtype`. One option is to change them after the object has been created. 

```{python, collapse = TRUE}
szn_s = szn_s.cat.set_categories(
  ["autumn", "summer","spring","winter"])
szn_s.cat.categories
szn_s = szn_s.cat.remove_categories(['spring','winter'])
szn_s.cat.categories
szn_s = szn_s.cat.add_categories(["fall", "winter"])
szn_s.cat.categories
```

Another option is to create the `dtype` before you create the `Series`, and pass it into `pd.Series()`.

```{python, collapse = TRUE}
cat_type = pd.CategoricalDtype(
  categories=["autumn", "summer", "spring", "winter"],
  ordered=True)
responses = pd.Series(
  ["autumn", "summer", "summer"], 
  dtype = cat_type)
responses
```


Just like in R, you can convert numerical data into categorical. The function even has the same name as in R: [`pd.cut()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#series-creation). Depending on the type of the input, [it will return either a `Series` or a `Categorical`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html).

```{python, collapse = TRUE}
import numpy as np
stock_returns = np.random.normal(size=10) # not categorical 
# array input means Categorical output
type_of_day = pd.cut(stock_returns, 
                      bins = [-np.inf, 0, np.inf], 
                      labels = ['bad day', 'good day']) 
type(type_of_day)
# Series in means Series out
type_of_day2 = pd.cut(pd.Series(stock_returns), 
                      bins = [-np.inf, 0, np.inf], 
                      labels = ['bad day', 'good day']) 
type(type_of_day2)
```


Finally, when reading in data from an external source, choose carefully whether you want character data to be stored as a string type, or as a categorical type. Here we use [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to read in Fisher's Iris data set [@misc_iris_53] hosted by [@uci_data]. More information on Pandas' `DataFrames` can be found in the next chapter. 


```{python, collapse = TRUE}
import numpy as np
# make 5th col categorical
my_data = pd.read_csv("data/iris.csv", header=None, 
                        dtype = {4:"category"}) 
my_data.head(1)
my_data.dtypes
np.unique(my_data[4]).tolist()
```


## Exercises

### R Questions

1.

Read in this chess data set [@misc_chess], hosted by [@uci_data], with the following code. You will probably have to change your working directory, but if you do, make sure to comment out that code before you submit your script to me. 

```{r, collapse=TRUE, eval=FALSE}
d <- read.csv("kr-vs-kp.data", header=FALSE, stringsAsFactors = TRUE)
head(d)
```

  a) Are all of the columns `factor`s? Assign `TRUE` or `FALSE` to `allFactors`.
  b) Should any of these `factor`s be ordered? Assign `TRUE` or `FALSE` to `ideallyOrdered`. Hint: read the data set description from [https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29](https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29).
  c) Are any of these factors currently ordered? Assign `TRUE` or `FALSE` to `currentlyOrdered`.
  d) What percent (between $0$ and $100$) of the time is the first column equal to `'f'`? Assign your answer to `percentF`.
  
  
2. 

Suppose you have the following `vector`. Please make sure to include this code in your script. 

```{r, collapse = TRUE}
normSamps <- rnorm(100)
```

  a) create a `factor` from `normSamps`. Map each element to `"within 1 sd"` or `"outside 1 sd"` depending on whether the element is within $1$ theoretical standard deviation of $0$ or not. Call the `factor` `withinOrNot`.


### Python Questions

1.

Consider the following simulated letter grade data for two students:

```{python, collapse = TRUE}
import pandas as pd
import numpy as np
poss_grades = ['A+','A','A-','B+','B','B-',
               'C+','C','C-','D+','D','D-',
               'F']
grade_values = {'A+':4.0,'A':4.0,'A-':3.7,'B+':3.3,'B':3.0,'B-':2.7,
                'C+':2.3,'C':2.0,'C-':1.7,'D+':1.3,'D':1.0,'D-':.67,
                'F':0.0}
student1 = np.random.choice(poss_grades, size = 10, replace = True)
student2 = np.random.choice(poss_grades, size = 12, replace = True)
```


  a) Convert the two Numpy arrays to one of the Pandas types for categorical data that the textbook discussed. Call these two variables `s1` and `s2`. 
  b) These data are categorical. Are they ordinal? Make sure to adjust `s1` and `s2` accordingly.
  c) Calculate the two student GPAs. Assign the floating point numbers to variables named `s1_gpa` and `s2_gpa`. Use `grade_values` to convert each letter grade to a number, and then average all the numbers for each student together using equal weights. 
  d) Is each category equally-spaced? If yes, then these are said to be *interval data*. Does your answer to this question affect the legitimacy of averaging together any ordinal data? Assign a `str` response to the variable `ave_ord_data_response`. Hint: consider (any) two different data sets that happen to produce the same GPA. Is the equality of these two GPAs misleading?
  e) Compute the mode grade for each student. Assign your answers as `str`s to the variables `s1_mode` and `s2_mode`. If there are more than one modes, then assign the one that comes first alphabetically.


2. 
  
Suppose you are creating a *classifier* whose job it is to predict labels. Consider the following `DataFrame` of predicted labels next to their corresponding actual labels. Please make sure to include this code in your script. 

```{python, collapse = TRUE}
import pandas as pd
import numpy as np
d = pd.DataFrame({'predicted label' : [1,2,2,1,2,2,1,2,3,2,2,3], 
                  'actual label': [1,2,3,1,2,3,1,2,3,1,2,3]}, 
                  dtype='category')
d.dtypes[0]
d.dtypes[1]
```
  
  
  a) Assign the prediction accuracy, as a percent (between $0$ and $100$), to the variable `perc_acc`.
  b) Create a *confusion matrix* to better assess *which* labels your classifier has a difficult time with. This should be a $3 \times 3$ Numpy `ndarray` of percentages. The row will correspond to the predicted label, the column will correspond to the actual label, and number in location $(0,2)$, say, will be the percent of observations where your model predicted label `1` and the actual was a label `3`. Call the variable `confusion`.