04b-text-data-and-strings.Rmd

---
knit: bookdown::preview_chapter
---

# Text data and strings

<img src="images/text_analysis.png" height=300>

## Why?

- To use the realtors text description to improve the Melbourne housing price model
- Determine the extent of public discontent with train stoppages in Melbourne
- The differences between Darwin's first edition of the Origin of the Species and the 6th edition
- Does the sentiment of posts on Newcastle Jets public facebook page reflect their win/los record?

## Process

1. read in text
2. pre-processing: remove punctuation signs, remove numbers, stop words, stem words
3. tokenise: words, sentences, ngrams, chapters
4. summarise, model

## Resource

[Text Mining with R by Julia Silge and David Robinson](https://www.tidytextmining.com)

## Getting some books to study

The [Gutenberg project](http://www.gutenberg.org/wiki/Main_Page) provides the text of over 57000 books free online. 

Navigate to The Origin of the Species by Charles Darwin. There are two versions available. 

You can either download the text directly or use the `gutenbergr` package. To use the package you need to know the `id` of the book, which means looking this up online anyway. (The first edition is `1228`, and the 6th edition is `2009`)

```{r}
# install.packages("tidytext")
library(tidyverse)
library(tidytext)
library(textdata)
# The tm package is needed because the book has numbers 
# in the text, that need to be removed, and the
# tidytext package doesn't have a function to do that
# install.packages("tm")
library(tm)
# install.packages("gutenbergr")
library(gutenbergr)
darwin1 <- gutenberg_download(1228)
darwin1
```

## Tokenize

```{r}
darwin1$text <- removeNumbers(darwin1$text)
darwin1_words <- darwin1 %>% 
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE) %>%
  mutate(len = str_length(word)) 
quantile(darwin1_words$n, probs = seq(0.9, 1, 0.01))
darwin1_words %>% filter(n > 90) %>%
  ggplot(aes(x=n, y=fct_reorder(word, n))) + geom_point() +
  ylab("")
```

## Your turn

Download and tokenize the 6th edition.

```{r results='hide', fig.show='hide'}
darwin6 <- gutenberg_download(2009)
darwin6$text <- removeNumbers(darwin6$text)
darwin6_words <- darwin6 %>% 
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort=TRUE) %>%
  mutate(len = str_length(word)) 
quantile(darwin6_words$n, probs = seq(0.9, 1, 0.01))
darwin6_words %>% filter(n > 90) %>%
  ggplot(aes(x=n, y=fct_reorder(word, n))) + geom_point() +
  ylab("")
```

## Compare the word frequency

```{r}
library(plotly)
darwin <- full_join(darwin1_words, darwin6_words, by = "word") %>%
  rename(n_ed1 = n.x, len_ed1 = len.x, n_ed6 = n.y, len_ed6 = len.y)
p <- ggplot(darwin, aes(x=n_ed1, y=n_ed6, label=word)) + 
  geom_abline(intercept=0, slope = 1) +
  geom_point(alpha=0.5) +
  xlab("First edition") + ylab("6th edition") +
  scale_x_log10() + scale_y_log10() + theme(aspect.ratio=1)
ggplotly(p)
```

#### Your turn

- Does it look like the 6th edition was an expanded version of the first?
- What word is most frequent in both editions?
- Find some words that are not in the first edition but appear in the 6th.
- Find some words that are used the first edition but not in the 6th.
- Using a linear regression model find the top few words that appear more often than expected, based on the frequency in the first edition. Find the top few words that appear less often than expected. 

```{r eval=FALSE, echo=FALSE}
# This code is a guide, but you need to think about transformations
# and perhaps relative increase, or filtering outliers
library(broom)
mod <- lm(n_ed6~n_ed1, data=darwin)
tidy(mod)
fit <- augment(mod, darwin)
ggplot(fit, aes(x=n_ed1, y=.resid, label=word)) + geom_point(alpha=0.5)
ggplotly()
fit %>% arrange(desc(.resid)) %>% top_n(10)
```

## Book comparison

The idea is to find the important words for the content of each document by decreasing the weight of commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.

The statistic *term frequency, inverse document frequency*, `tf-idf`, is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

$tf_{word} = \frac{Number~of~times~word~t~appears~in~a~document}{Total~number~of~words~in~the~document}$

$idf_{word} = log \frac{number~of~documents}{number~of~documents~word~appears~in}$

$td-idf = tf\times idf$

```{r}
darwin1_words <- darwin1_words %>%
  mutate(edition = "1")
darwin6_words <- darwin6_words %>%
  mutate(edition = "6")
darwin <- bind_rows(darwin1_words, darwin6_words)
darwin <- darwin %>% bind_tf_idf(word, edition, n)
darwin %>% arrange(desc(tf_idf))
darwin %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(edition) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = edition)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~edition, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_brewer(palette="Dark2")
```

What do we learn?

- Mr Mivart appears in the 6th edition, multiple times
```{r}
grep("Mivart", darwin6$text)
darwin6[5435,]
```
- Prof title is used more often in the 6th edition
- There is a tendency for latin names 
- Mistletoe was mispelled in the 1st edition

## Exercise

Text Mining with R has an example comparing historical physics textbooks:  *Discourse on Floating Bodies* by Galileo Galilei, *Treatise on Light* by Christiaan Huygens, *Experiments with Alternate Currents of High Potential and High Frequency* by Nikola Tesla, and *Relativity: The Special and General Theory* by Albert Einstein. All are available on the Gutenberg project. 

Work your way through the [comparison of physics books](https://www.tidytextmining.com/tfidf.html#a-corpus-of-physics-texts). It is section 3.4.

## Sentiment Analysis

Sentiment analysis tags words or phrases with an emotion, and summarises these, often as the positive or negative state, over a body of text. 

Examples of use are:

- Examining effect of emotional state in twitter posts
- Determining public reactions to government policy, or new product releases
- Trying to make money in the stock market by modeling social media posts on listed companies
- Evaluating product reviews on Amazon, restaurants on zomato, or travel options on TripAdvisor

### Lexicons

The `tidytext` package has a lexicon of sentiments, based on four major sources: [AFINN](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010), [bing](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), [Loughran](https://sraf.nd.edu/textual-analysis/resources/#LM%20Sentiment%20Word%20Lists), [nrc](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm) [@mohammad13]

XXX Cite these lexicon sources

```{r}
sentiments %>% count(sentiment, sort=TRUE)
```

```{r}
sentiments
```

What emotion do these words elicit in you?

- summer
- hot chips
- hug
- lose
- stolen
- smile

#### Different sources

- The `nrc` lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. 
- The `bing` lexicon categorizes words in a binary fashion into positive and negative categories. 
- The `AFINN` lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

```{r}
get_sentiments("afinn")
```


### Analysis

Once you have a bag of words, you need to join the sentiments dictionary to the words data. Particularly the lexicon `nrc` has multiple tags per word, so you may need ot use an "inner_join". `inner_join()` returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

The `janeaustenr` package cntains the full texts for Jane Austen's 6 completed novels, Sense and Sensibility", "Pride and Prejudice", "Mansfield Park", "Emma", "Northanger Abbey", and "Persuasion", ready for text analysis. 

```{r}
library(janeaustenr)
library(dplyr)
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
```

Let's look at the most frequent joyful words in "Emma". 

```{r eval=FALSE}
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)
```

"Good" is the most common joyful word, followed by "young", "friend", "hope". All make sense until you see "found". Is found a joyful word?

#### Your turn

- What are the most common "anger" words used in Emma?
- What are the most common "surprise" words used in Emma?

```{r eval=FALSE, echo=FALSE}
nrc_anger <- get_sentiments("nrc") %>% 
  filter(sentiment == "anger")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_anger) %>%
  count(word, sort = TRUE)

nrc_surprise <- get_sentiments("nrc") %>% 
  filter(sentiment == "surprise")

tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_surprise) %>%
  count(word, sort = TRUE)
```

### Comparing lexicons

All of the lexicons have a measure of positive or negative. We can tag the words in Emma by each lexicon, and see if they agree. 

```{r eval=FALSE}
nrc_pn <- get_sentiments("nrc") %>% 
  filter(sentiment %in% c("positive", "negative"))

emma_nrc <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_pn)
emma_bing <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(get_sentiments("bing")) 
emma_afinn <- tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(get_sentiments("afinn")) 
emma_nrc %>% count(sentiment) %>% mutate(n/sum(n))
emma_bing %>% count(sentiment) %>% mutate(n/sum(n))
emma_afinn %>% mutate(sentiment = ifelse(value>0, "positive", "negative")) %>% count(sentiment) %>% mutate(n/sum(n))
```

#### Your turn

- Using your choice of lexicon (nrc, bing, or afinn) compute the proportion of postive words in each of Austen's books.
- Which book is the most positive? negative?

```{r eval=FALSE}
pos_nrc <- tidy_books %>%
  #filter(book == "Emma") %>%
  group_by(book) %>%
  inner_join(nrc_pn) %>% 
  count(sentiment) %>% 
  mutate(pos = n/sum(n)) %>%
  filter(sentiment == "positive")
pos_nrc %>% arrange(desc(pos))
```

### Example: Simpsons

Data from the popular animated TV series, The Simpsons, has been made available on [kaggle](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data/data). This analysis builds on Maria Prokofieva's R Ladies workshop.

- `simpsons_script_lines.csv`: Contains the text spoken during each episode (including details about which character said it and where)
- `simpsons_characters.csv`: Contains character names and a character id

```{r}
scripts <- read_csv("data/simpsons_script_lines.csv")
chs <- read_csv("data/simpsons_characters.csv")
sc <- left_join(scripts, chs, by = c("character_id" = "id") )
sc %>% count(name, sort=TRUE)
```

#### Pre-process the text

```{r}
sc %>%
  unnest_tokens(word, normalized_text) %>%
  filter(!word %in% stop_words$word) %>%
  count(word, sort = TRUE) %>%
  ungroup() %>%
  filter(!is.na(word)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  head(20) %>%
  ggplot(aes(x = word, y = n)) +
  geom_bar(stat='identity', colour="white") +
  labs(x = '', y = 'count', 
       title = 'Top 20 words') +
  coord_flip() + 
  theme_bw()
```

#### Tag the words with sentiments

Using AFINN words will be tagged on a negative to positive scale of -1 to 5.

```{r}
sc_word <- sc %>%
  unnest_tokens(word, normalized_text) %>%
  filter(!word %in% stop_words$word) %>%
  count(name, word) %>%
  ungroup() %>%
  filter(!is.na(word))
sc_s <- sc_word %>% 
  inner_join(get_sentiments("afinn"), by = "word")
```

#### Examine characters

```{r}
sc_s %>% group_by(name) %>% 
  summarise(m=mean(value)) %>% 
  arrange(desc(m))
```

Oh, maybe we want to only focus on the main characters.

```{r}
keep <- sc %>% count(name, sort=TRUE) %>%
  filter(!is.na(name)) %>%
  filter(n > 999)
sc_s %>% filter(name %in% keep$name) %>% 
  group_by(name) %>% 
  summarise(m=mean(value)) %>% 
  arrange(m)
```

#### Your turn

1. Bart Simpson is featured at various ages. How has the sentiment of his words changed over his life?

```{r eval=FALSE}
sc %>% count(name, sort=TRUE) %>%
  filter(grepl("Bart", name)) %>% print(n=50)
bart_names <- c("Bart Simpson", "Baby Bart", 
                "1-Year-Old Bart", "2-Year-Old Bart", 
                "5-Year-Old Bart", "80-Year-Old Bart")
bart <- sc %>% filter(name %in% bart_names)
bart_word <- bart %>%
  unnest_tokens(word, normalized_text) %>%
  filter(!word %in% stop_words$word) %>%
  count(name, word) %>%
  ungroup() %>%
  filter(!is.na(word))
bart_s <- bart_word %>% 
  inner_join(get_sentiments("afinn"), by = "word")
bart_s %>% group_by(name) %>% 
  summarise(m=mean(value)) %>% 
  arrange(desc(m))
```

2. Repeat the sentiment analysis with the NRC lexicon. What character is the most "angry"? "joyful"?

```{r eval=FALSE}
sc_s <- sc_word %>% 
  inner_join(get_sentiments("nrc"), by = "word")
statmode = function(x){ 
    ta = table(x)
    tam = max(ta)
    if (all(ta == tam))
         mod = NA
    else
         if(is.numeric(x))
    mod = as.numeric(names(ta)[ta == tam])
    else
         mod = names(ta)[ta == tam]
    return(mod)
}
sc_s %>% filter(name %in% keep$name) %>% 
  group_by(name) %>% 
  summarise(angry = sum(n[sentiment=="anger"]/sum(n))) %>%
  arrange(desc(angry))
```

### Example: AFL Finals tweets

The `rtweet` package allows you to pull tweets from the archive. It gives only the last 6-9 days worth of data. Youneed ot have a twitter account, and you need to create an app (its really basic) in order to pull twitter data. The instructions that come from this package (https://rtweet.info) are pretty simple to follow.

Given that it is AFL final week, I thought it might be interesting to look at tweets that use the hashtag "#AFLFinals". Once you have a developer account, this is as simple as 

```
afl <- search_tweets(
  "#AFLFinals", n = 20000, include_rts = FALSE
)
```

I have provide the data collected in the last week. You can load it with 

```{r}
load("data/afl_twitter.rda")
afl
```

#### Your turn

- What is the range of dates of this data?
- Who is the most frequent tweeter using this hashtag?
- Are there some days that have more tweets than others?
- Are there some hours f the day that are more common tweet times?

#### Analysis

We need to break the text of each tweet into words, tag the words with sentiments, and make a cumulative score for each tweet.

```{r}
library(lubridate)
afl_sentiment <- afl %>% 
  group_by(status_id) %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("afinn")) %>%
  summarise(sentiment = mean(value, na.rm=T)) 
afl_sentiment <- afl %>% select(status_id, screen_name, created_at, text) %>% 
  left_join(afl_sentiment, by="status_id")
afl_sentiment %>% group_by(screen_name) %>%
  summarise(s = mean(sentiment, na.rm=TRUE)) %>% arrange(desc(s))
afl_sentiment %>% mutate(day = ymd(as.Date(created_at))) %>% ggplot(aes(x=day, y=sentiment)) + geom_point() + geom_smooth(se=FALSE)
afl_sentiment %>% filter(screen_name == "aflratings") %>% 
  mutate(day = ymd(as.Date(created_at))) %>%
  ggplot(aes(x=day, y=sentiment)) + geom_point() + geom_smooth()
```

- Which tweeter is the most positive? negative?
- Is there a day that spirits were higher in the tweets? Or when tweets were more negative?
- Does the tweeter `aflratings` have a trend in positivity or negativity?