-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
82 lines (62 loc) · 4.75 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
title: "PCA for the people"
author: "Karl Rohe"
date: "2024-01-03"
output: github_document
editor_options:
chunk_output_type: console
---
This package introduces a novel formula syntax for PCA. In modern applications (where data is often in "long format"), the formula syntax helps to fluidly imagine PCA as fitting a model, without thinking about matrices. In this way, it provides a layer of abstraction above matrices. Given the formula and the (long) data, the code in this package transforms your data into a proper format for fast PCA via sparse linear algebra. The package also provides code to 1) `pick_dim`: help pick the number of dimensions to compute, 2) `diagnose` whether the data is "dense enough", 3) `core` the data to look only at the dense parts, 4) `plot` the PCs to look for `streaks` (good) and localization (bad), 5) rotate the PCs with varimax, 6) visualize and interpret the dimensions uncovered, and (not yet) 7) make predictions. This package uses "PCA" as a broad term for computing the leading singular vectors of a normalized (sometimes incomplete) matrix. Some might refer to specific instances as factor analysis, correspondence analysis, latent symantic analysis, social network analysis, or low-rank matrix completion, among other possible terms. This is big-tent PCA, all included. `longpca` is in development. So, functions and syntax might change.
The current approach to PCA (principal components analysis) is *matrix first*. This note begins to explore an alternative path, one that is *model first*. The formula syntax provides an alternative way to think about PCA that makes matrices transparent; completely hidden, unless you want to see the code.
I hope this makes PCA legible for folks that have not yet learned linear algebra (just like linear models are legible without solving linear systems of equations).
I am personally inspired by this approach because (despite the fact that I love matrices and linear algebra) I find that this *model first* way of thinking is so much easier and more direct.
This document gives an illustration with a data analysis of the popular `nycflights13` data via PCA. Headline: we find two seasonal effects (annual and weekly) and also the "fly-over-zone" (midwest 4ever. ride or die <3 much love to my midwest fam). Code details follow this analysis.
(Disclaimer: this is very early in this project. So, the syntax and the code is likely to change a great deal. Input is very welcome about ways to improve it.)
#### Install
The functions for PCA for the People are contained in an R package `longpca`. If you do not already have `devtools` installed, you will first need to install that:
```{r, eval = FALSE}
install.packages("devtools")
devtools::install_github("karlrohe/longpca")
```
Thank you to Alex Hayes for helpful feedback in this process and suggesting the name `longpca`.
### PCA the nycflights.
```{r, echo=FALSE,include=FALSE}
library(tidyverse)
library(longpca)
library(nycflights13)
```
The code is fast and nimble. First you define "the model" with a formula... and some data:
```{r, cache=TRUE}
formula = 1 ~ (month & day)*(dest)
im = make_interaction_model(flights, formula)
pcs = pca(im, k = 6)
```
There are three functions to run on `im`: `diagnose`, `pick_dim`, and `pca`.
There are three key functions to run on `pcs`: `plot`, `rotate`, and `top`.
See the vignettes for further illustrations:
1) [In depth example with nycflights13 data](articles/Intro_to_longpca_with_nycflights_data.html)
2) [Inside `make_interaction_model`, you can `parse_text`](articles/parse_text.html)
#### Slightly more detail...
The hope is that *model first* PCA with the formula makes interacting with the matrix / linear algebra unnecessary. That said, it might be instructive to understand the class `interaction_model` to see how it represents a matrix "under the hood".
The function `make_interaction_model` constructs a list with the class `interaction_model`. You can think of this as an abstraction of a matrix...
```{r}
formula = 1 ~ (month & day)*(dest)
im = make_interaction_model(flights,formula)
names(im)
class(im)
```
In this "matrix like thing," the `month & day` index the rows and `dest` indexes the columns. This is because `month & day` come before the interaction `*` in the formula and `dest` comes afterwords.
```{r}
im$row_universe
im$column_universe
```
Then, "the matrix" is in sparse triplet form:
```{r}
im$interaction_tibble
```
If for any reason you actually wanted the sparse `Matrix`...
```{r}
A = get_Matrix(im, import_names = TRUE)
str(A)
```
The hope is that model first PCA with the `interaction_model` makes data analysis more direct, i.e. that you should not need to think about this matrix (too much). Instead, this path is simply a way to estimate a "low rank" statistical model via least squares.