-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path04b-text-data-and-strings.Rmd
472 lines (370 loc) · 15.6 KB
/
04b-text-data-and-strings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
---
knit: bookdown::preview_chapter
---
# Text data and strings
<img src="images/text_analysis.png" height=300>
## Why?
- To use the realtors text description to improve the Melbourne housing price model
- Determine the extent of public discontent with train stoppages in Melbourne
- The differences between Darwin's first edition of the Origin of the Species and the 6th edition
- Does the sentiment of posts on Newcastle Jets public facebook page reflect their win/los record?
## Process
1. read in text
2. pre-processing: remove punctuation signs, remove numbers, stop words, stem words
3. tokenise: words, sentences, ngrams, chapters
4. summarise, model
## Resource
[Text Mining with R by Julia Silge and David Robinson](https://www.tidytextmining.com)
## Getting some books to study
The [Gutenberg project](http://www.gutenberg.org/wiki/Main_Page) provides the text of over 57000 books free online.
Navigate to The Origin of the Species by Charles Darwin. There are two versions available.
You can either download the text directly or use the `gutenbergr` package. To use the package you need to know the `id` of the book, which means looking this up online anyway. (The first edition is `1228`, and the 6th edition is `2009`)
```{r}
# install.packages("tidytext")
library(tidyverse)
library(tidytext)
library(textdata)
# The tm package is needed because the book has numbers
# in the text, that need to be removed, and the
# tidytext package doesn't have a function to do that
# install.packages("tm")
library(tm)
# install.packages("gutenbergr")
library(gutenbergr)
darwin1 <- gutenberg_download(1228)
darwin1
```
## Tokenize
```{r}
darwin1$text <- removeNumbers(darwin1$text)
darwin1_words <- darwin1 %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE) %>%
mutate(len = str_length(word))
quantile(darwin1_words$n, probs = seq(0.9, 1, 0.01))
darwin1_words %>% filter(n > 90) %>%
ggplot(aes(x=n, y=fct_reorder(word, n))) + geom_point() +
ylab("")
```
## Your turn
Download and tokenize the 6th edition.
```{r results='hide', fig.show='hide'}
darwin6 <- gutenberg_download(2009)
darwin6$text <- removeNumbers(darwin6$text)
darwin6_words <- darwin6 %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE) %>%
mutate(len = str_length(word))
quantile(darwin6_words$n, probs = seq(0.9, 1, 0.01))
darwin6_words %>% filter(n > 90) %>%
ggplot(aes(x=n, y=fct_reorder(word, n))) + geom_point() +
ylab("")
```
## Compare the word frequency
```{r}
library(plotly)
darwin <- full_join(darwin1_words, darwin6_words, by = "word") %>%
rename(n_ed1 = n.x, len_ed1 = len.x, n_ed6 = n.y, len_ed6 = len.y)
p <- ggplot(darwin, aes(x=n_ed1, y=n_ed6, label=word)) +
geom_abline(intercept=0, slope = 1) +
geom_point(alpha=0.5) +
xlab("First edition") + ylab("6th edition") +
scale_x_log10() + scale_y_log10() + theme(aspect.ratio=1)
ggplotly(p)
```
#### Your turn
- Does it look like the 6th edition was an expanded version of the first?
- What word is most frequent in both editions?
- Find some words that are not in the first edition but appear in the 6th.
- Find some words that are used the first edition but not in the 6th.
- Using a linear regression model find the top few words that appear more often than expected, based on the frequency in the first edition. Find the top few words that appear less often than expected.
```{r eval=FALSE, echo=FALSE}
# This code is a guide, but you need to think about transformations
# and perhaps relative increase, or filtering outliers
library(broom)
mod <- lm(n_ed6~n_ed1, data=darwin)
tidy(mod)
fit <- augment(mod, darwin)
ggplot(fit, aes(x=n_ed1, y=.resid, label=word)) + geom_point(alpha=0.5)
ggplotly()
fit %>% arrange(desc(.resid)) %>% top_n(10)
```
## Book comparison
The idea is to find the important words for the content of each document by decreasing the weight of commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.
The statistic *term frequency, inverse document frequency*, `tf-idf`, is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.
$tf_{word} = \frac{Number~of~times~word~t~appears~in~a~document}{Total~number~of~words~in~the~document}$
$idf_{word} = log \frac{number~of~documents}{number~of~documents~word~appears~in}$
$td-idf = tf\times idf$
```{r}
darwin1_words <- darwin1_words %>%
mutate(edition = "1")
darwin6_words <- darwin6_words %>%
mutate(edition = "6")
darwin <- bind_rows(darwin1_words, darwin6_words)
darwin <- darwin %>% bind_tf_idf(word, edition, n)
darwin %>% arrange(desc(tf_idf))
darwin %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(edition) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = edition)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~edition, ncol = 2, scales = "free") +
coord_flip() + scale_fill_brewer(palette="Dark2")
```
What do we learn?
- Mr Mivart appears in the 6th edition, multiple times
```{r}
grep("Mivart", darwin6$text)
darwin6[5435,]
```
- Prof title is used more often in the 6th edition
- There is a tendency for latin names
- Mistletoe was mispelled in the 1st edition
## Exercise
Text Mining with R has an example comparing historical physics textbooks: *Discourse on Floating Bodies* by Galileo Galilei, *Treatise on Light* by Christiaan Huygens, *Experiments with Alternate Currents of High Potential and High Frequency* by Nikola Tesla, and *Relativity: The Special and General Theory* by Albert Einstein. All are available on the Gutenberg project.
Work your way through the [comparison of physics books](https://www.tidytextmining.com/tfidf.html#a-corpus-of-physics-texts). It is section 3.4.
## Sentiment Analysis
Sentiment analysis tags words or phrases with an emotion, and summarises these, often as the positive or negative state, over a body of text.
Examples of use are:
- Examining effect of emotional state in twitter posts
- Determining public reactions to government policy, or new product releases
- Trying to make money in the stock market by modeling social media posts on listed companies
- Evaluating product reviews on Amazon, restaurants on zomato, or travel options on TripAdvisor
### Lexicons
The `tidytext` package has a lexicon of sentiments, based on four major sources: [AFINN](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010), [bing](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), [Loughran](https://sraf.nd.edu/textual-analysis/resources/#LM%20Sentiment%20Word%20Lists), [nrc](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm) [@mohammad13]
XXX Cite these lexicon sources
```{r}
sentiments %>% count(sentiment, sort=TRUE)
```
```{r}
sentiments
```
What emotion do these words elicit in you?
- summer
- hot chips
- hug
- lose
- stolen
- smile
#### Different sources
- The `nrc` lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
- The `bing` lexicon categorizes words in a binary fashion into positive and negative categories.
- The `AFINN` lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
```{r}
get_sentiments("afinn")
```
### Analysis
Once you have a bag of words, you need to join the sentiments dictionary to the words data. Particularly the lexicon `nrc` has multiple tags per word, so you may need ot use an "inner_join". `inner_join()` returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
The `janeaustenr` package cntains the full texts for Jane Austen's 6 completed novels, Sense and Sensibility", "Pride and Prejudice", "Mansfield Park", "Emma", "Northanger Abbey", and "Persuasion", ready for text analysis.
```{r}
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
```
Let's look at the most frequent joyful words in "Emma".
```{r eval=FALSE}
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
```
"Good" is the most common joyful word, followed by "young", "friend", "hope". All make sense until you see "found". Is found a joyful word?
#### Your turn
- What are the most common "anger" words used in Emma?
- What are the most common "surprise" words used in Emma?
```{r eval=FALSE, echo=FALSE}
nrc_anger <- get_sentiments("nrc") %>%
filter(sentiment == "anger")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_anger) %>%
count(word, sort = TRUE)
nrc_surprise <- get_sentiments("nrc") %>%
filter(sentiment == "surprise")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_surprise) %>%
count(word, sort = TRUE)
```
### Comparing lexicons
All of the lexicons have a measure of positive or negative. We can tag the words in Emma by each lexicon, and see if they agree.
```{r eval=FALSE}
nrc_pn <- get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative"))
emma_nrc <- tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_pn)
emma_bing <- tidy_books %>%
filter(book == "Emma") %>%
inner_join(get_sentiments("bing"))
emma_afinn <- tidy_books %>%
filter(book == "Emma") %>%
inner_join(get_sentiments("afinn"))
emma_nrc %>% count(sentiment) %>% mutate(n/sum(n))
emma_bing %>% count(sentiment) %>% mutate(n/sum(n))
emma_afinn %>% mutate(sentiment = ifelse(value>0, "positive", "negative")) %>% count(sentiment) %>% mutate(n/sum(n))
```
#### Your turn
- Using your choice of lexicon (nrc, bing, or afinn) compute the proportion of postive words in each of Austen's books.
- Which book is the most positive? negative?
```{r eval=FALSE}
pos_nrc <- tidy_books %>%
#filter(book == "Emma") %>%
group_by(book) %>%
inner_join(nrc_pn) %>%
count(sentiment) %>%
mutate(pos = n/sum(n)) %>%
filter(sentiment == "positive")
pos_nrc %>% arrange(desc(pos))
```
### Example: Simpsons
Data from the popular animated TV series, The Simpsons, has been made available on [kaggle](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data/data). This analysis builds on Maria Prokofieva's R Ladies workshop.
- `simpsons_script_lines.csv`: Contains the text spoken during each episode (including details about which character said it and where)
- `simpsons_characters.csv`: Contains character names and a character id
```{r}
scripts <- read_csv("data/simpsons_script_lines.csv")
chs <- read_csv("data/simpsons_characters.csv")
sc <- left_join(scripts, chs, by = c("character_id" = "id") )
sc %>% count(name, sort=TRUE)
```
#### Pre-process the text
```{r}
sc %>%
unnest_tokens(word, normalized_text) %>%
filter(!word %in% stop_words$word) %>%
count(word, sort = TRUE) %>%
ungroup() %>%
filter(!is.na(word)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
head(20) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat='identity', colour="white") +
labs(x = '', y = 'count',
title = 'Top 20 words') +
coord_flip() +
theme_bw()
```
#### Tag the words with sentiments
Using AFINN words will be tagged on a negative to positive scale of -1 to 5.
```{r}
sc_word <- sc %>%
unnest_tokens(word, normalized_text) %>%
filter(!word %in% stop_words$word) %>%
count(name, word) %>%
ungroup() %>%
filter(!is.na(word))
sc_s <- sc_word %>%
inner_join(get_sentiments("afinn"), by = "word")
```
#### Examine characters
```{r}
sc_s %>% group_by(name) %>%
summarise(m=mean(value)) %>%
arrange(desc(m))
```
Oh, maybe we want to only focus on the main characters.
```{r}
keep <- sc %>% count(name, sort=TRUE) %>%
filter(!is.na(name)) %>%
filter(n > 999)
sc_s %>% filter(name %in% keep$name) %>%
group_by(name) %>%
summarise(m=mean(value)) %>%
arrange(m)
```
#### Your turn
1. Bart Simpson is featured at various ages. How has the sentiment of his words changed over his life?
```{r eval=FALSE}
sc %>% count(name, sort=TRUE) %>%
filter(grepl("Bart", name)) %>% print(n=50)
bart_names <- c("Bart Simpson", "Baby Bart",
"1-Year-Old Bart", "2-Year-Old Bart",
"5-Year-Old Bart", "80-Year-Old Bart")
bart <- sc %>% filter(name %in% bart_names)
bart_word <- bart %>%
unnest_tokens(word, normalized_text) %>%
filter(!word %in% stop_words$word) %>%
count(name, word) %>%
ungroup() %>%
filter(!is.na(word))
bart_s <- bart_word %>%
inner_join(get_sentiments("afinn"), by = "word")
bart_s %>% group_by(name) %>%
summarise(m=mean(value)) %>%
arrange(desc(m))
```
2. Repeat the sentiment analysis with the NRC lexicon. What character is the most "angry"? "joyful"?
```{r eval=FALSE}
sc_s <- sc_word %>%
inner_join(get_sentiments("nrc"), by = "word")
statmode = function(x){
ta = table(x)
tam = max(ta)
if (all(ta == tam))
mod = NA
else
if(is.numeric(x))
mod = as.numeric(names(ta)[ta == tam])
else
mod = names(ta)[ta == tam]
return(mod)
}
sc_s %>% filter(name %in% keep$name) %>%
group_by(name) %>%
summarise(angry = sum(n[sentiment=="anger"]/sum(n))) %>%
arrange(desc(angry))
```
### Example: AFL Finals tweets
The `rtweet` package allows you to pull tweets from the archive. It gives only the last 6-9 days worth of data. Youneed ot have a twitter account, and you need to create an app (its really basic) in order to pull twitter data. The instructions that come from this package (https://rtweet.info) are pretty simple to follow.
Given that it is AFL final week, I thought it might be interesting to look at tweets that use the hashtag "#AFLFinals". Once you have a developer account, this is as simple as
```
afl <- search_tweets(
"#AFLFinals", n = 20000, include_rts = FALSE
)
```
I have provide the data collected in the last week. You can load it with
```{r}
load("data/afl_twitter.rda")
afl
```
#### Your turn
- What is the range of dates of this data?
- Who is the most frequent tweeter using this hashtag?
- Are there some days that have more tweets than others?
- Are there some hours f the day that are more common tweet times?
#### Analysis
We need to break the text of each tweet into words, tag the words with sentiments, and make a cumulative score for each tweet.
```{r}
library(lubridate)
afl_sentiment <- afl %>%
group_by(status_id) %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("afinn")) %>%
summarise(sentiment = mean(value, na.rm=T))
afl_sentiment <- afl %>% select(status_id, screen_name, created_at, text) %>%
left_join(afl_sentiment, by="status_id")
afl_sentiment %>% group_by(screen_name) %>%
summarise(s = mean(sentiment, na.rm=TRUE)) %>% arrange(desc(s))
afl_sentiment %>% mutate(day = ymd(as.Date(created_at))) %>% ggplot(aes(x=day, y=sentiment)) + geom_point() + geom_smooth(se=FALSE)
afl_sentiment %>% filter(screen_name == "aflratings") %>%
mutate(day = ymd(as.Date(created_at))) %>%
ggplot(aes(x=day, y=sentiment)) + geom_point() + geom_smooth()
```
- Which tweeter is the most positive? negative?
- Is there a day that spirits were higher in the tweets? Or when tweets were more negative?
- Does the tweeter `aflratings` have a trend in positivity or negativity?