Skip to content

Commit 466ed85

Browse files
author
Klaus Strauch
committed
merge branch with my old one
2 parents 06b3f0e + d903ec7 commit 466ed85

File tree

38 files changed

+64651
-371
lines changed

38 files changed

+64651
-371
lines changed

.gitignore

+20-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,26 @@ __pycache__/
44
*$py.class
55

66
# data folder
7-
data/
7+
data/arg-lexicon
8+
data/hash-sentiments
9+
data/50mpaths2
10+
data/bingliunegs.txt
11+
data/bingliuposs.txt
12+
data/hashtag-emotion-0.2.txt
13+
data/subj_score.txt
14+
data/dataset/TweeboParser/
15+
data/dataset/spell_checked_parsed.txt
16+
data/dataset/working_dir/
17+
data/dataset/spell_checked_parsed.txt
18+
data/dataset/tweet_data_complete.2tsv
19+
data/dataset/P3_vaccine-twitter-data.tsv
20+
data/dataset/tweet_for_dp.txt
21+
data/dataset/tweet_for_dp.txt.predict
22+
data/dataset/TweetsAnnotation.txt
23+
run_tests/
24+
sistematic_results/
25+
26+
827

928
# C extensions
1029
*.so

README.md

+12
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
11
# HPVTweets
22

3+
All the resources (such as: annotation file, parsed tweet file, the bing liu sentiment lexicons...) should be placed in a folder named `data` inside this project.
4+
This is because it is the default value in the parser. It is annoying to add all those options!
5+
36
## Dependencies
47

8+
python >= 3x
9+
510
pandas >= 0.22.0
11+
612
nltk >= 3.2.5
13+
14+
hunspell >= 0.5.3
15+
16+
sklaern >= 0.19.1
17+
18+

classify_tweets.py

+305-273
Large diffs are not rendered by default.

data/README.md

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
## RESOURCES
2+
3+
Here you should place all the external resource used for feature extraction methods (files), namely:
4+
5+
- tweet clusters (file): http://www.cs.cmu.edu/~ark/TweetNLP/#resources
6+
- bingliu sentiment lexicon (files): https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
7+
- subjectivity score lexicon (file) : http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
8+
- NRC Tweeter Sentiment Lexicon aka Sentiment140 Lexicon (folder) : http://saifmohammad.com/WebPages/lexicons.html (section 4.c)
9+
- argument lexicon (folder) : http://mpqa.cs.pitt.edu/lexicons/arg_lexicon/

data/dataset/README.md

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
## DATASET
2+
3+
Here you should place the downloaded tweet and annotations ( https://sbmi.uth.edu/ontology/files/TweetsAnnotationResults.zip ) .
4+
5+
## STANDARD
6+
7+
Use the script `generate_data_file.py`:
8+
9+
$ python3 generate_data_file.py -r tweet_file -o output_path
10+
11+
to generate file to be then fed as input to `TweeboParser` ( https://github.com/ikekonglp/TweeboParser ) used to create annotated data set (PoS, Dependency Parse).
12+
The intermediate step is done in order to get rid of missing tweets ( and tweets with expired accounts) that make the parser crash.
13+
14+
## SPELL CHECKED
15+
16+
Once you created the the file with the dependency it is possible to apply spell checking. For complete reproducibility the files using for spell check are provided.
17+
18+
$ python3 generate_data_file.py -p tweet_file_parsed -o output_path
19+
20+
This step is performed now for the following reason:
21+
- need for tokenization
22+
- need for pos tags (provided by parsing) for avoiding parsing urls and emoticons
23+
- spell check is expensive

0 commit comments

Comments
 (0)