Skip to content

Commit 38925d7

Browse files
author
sgarda
committed
add spell checking
1 parent ae6f1f5 commit 38925d7

File tree

6 files changed

+62680
-14
lines changed

6 files changed

+62680
-14
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@ data/bingliuposs.txt
1212
data/hashtag-emotion-0.2.txt
1313
data/subj_score.txt
1414
data/dataset/TweeboParser/
15+
data/dataset/spell_checked_parsed.txt
1516
data/dataset/working_dir/
17+
data/dataset/spell_checked_parsed.txt
1618
data/dataset/tweet_data_complete.2tsv
1719
data/dataset/P3_vaccine-twitter-data.tsv
1820
data/dataset/tweet_for_dp.txt

data/dataset/README.md

+13-2
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,20 @@
11
## DATASET
22

3-
Here you should place the downloaded tweet and annotations ( https://sbmi.uth.edu/ontology/files/TweetsAnnotationResults.zip ) .
3+
Here you should place the downloaded tweet and annotations ( https://sbmi.uth.edu/ontology/files/TweetsAnnotationResults.zip ) .
4+
5+
## STANDARD
6+
47
Use the script `generate_data_file.py`:
58

6-
$ python3 generate_data_file.py -t tweet_file -o output_path
9+
$ python3 generate_data_file.py -r tweet_file -o output_path
710

811
to generate file to be then fed as input to `TweeboParser` ( https://github.com/ikekonglp/TweeboParser ) used to create annotated data set (PoS, Dependency Parse).
912
The intermediate step is done in order to get rid of missing tweets ( and tweets with expired accounts) that make the parser crash.
13+
14+
## SPELL CHECKED
15+
16+
Once you created the the file with the dependency it is possible to apply spell checking. For complete reproducibility the files using for spell check are provided.
17+
This step is performed now for the following reason:
18+
- need for tokenization
19+
- need for pos tags (provided by parsing) for avoiding parsing urls and emoticons
20+
- spell check is expensive

0 commit comments

Comments
 (0)