kstrauch94
diff --git a/‎.gitignore
+20-1 b/‎.gitignore
+20-1
diff --git a/‎README.md
+12 b/‎README.md
+12
diff --git a/‎classify_tweets.py
+305-273 b/‎classify_tweets.py
+305-273
diff --git a/‎data/README.md
+9 b/‎data/README.md
+9
diff --git a/‎data/dataset/README.md
+23 b/‎data/dataset/README.md
+23
@@ -4,7 +4,26 @@ __pycache__/
 *$py.class
 
 # data folder
-data/
+data/arg-lexicon
+data/hash-sentiments
+data/50mpaths2
+data/bingliunegs.txt
+data/bingliuposs.txt
+data/hashtag-emotion-0.2.txt
+data/subj_score.txt
+data/dataset/TweeboParser/
+data/dataset/spell_checked_parsed.txt
+data/dataset/working_dir/
+data/dataset/spell_checked_parsed.txt
+data/dataset/tweet_data_complete.2tsv
+data/dataset/P3_vaccine-twitter-data.tsv
+data/dataset/tweet_for_dp.txt
+data/dataset/tweet_for_dp.txt.predict
+data/dataset/TweetsAnnotation.txt
+run_tests/
+sistematic_results/
+
+
 
 # C extensions
 *.so
 
@@ -1,6 +1,18 @@
 # HPVTweets
 
+All the resources (such as: annotation file, parsed tweet file, the bing liu sentiment lexicons...) should be placed in a folder named `data` inside this project.
+This is because it is the default value in the parser. It is annoying to add all those options!
+
 ## Dependencies
 
+python >= 3x
+
 pandas >= 0.22.0
+
 nltk >= 3.2.5
+
+hunspell >= 0.5.3
+
+sklaern >= 0.19.1
+
+
@@ -0,0 +1,9 @@
+## RESOURCES
+
+Here you should place all the external resource used for feature extraction methods (files), namely:
+
+- tweet clusters (file): http://www.cs.cmu.edu/~ark/TweetNLP/#resources
+- bingliu sentiment lexicon (files): https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
+- subjectivity score lexicon (file) : http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
+- NRC Tweeter Sentiment Lexicon aka Sentiment140 Lexicon (folder) : http://saifmohammad.com/WebPages/lexicons.html (section 4.c)
+- argument lexicon (folder) : http://mpqa.cs.pitt.edu/lexicons/arg_lexicon/
@@ -0,0 +1,23 @@
+## DATASET
+
+Here you should place the downloaded tweet and annotations ( https://sbmi.uth.edu/ontology/files/TweetsAnnotationResults.zip ) .
+
+## STANDARD
+
+Use the script `generate_data_file.py`:
+
+	$ python3 generate_data_file.py -r tweet_file -o output_path
+
+to generate file to be then fed as input to `TweeboParser` ( https://github.com/ikekonglp/TweeboParser ) used to create annotated data set (PoS, Dependency Parse).
+The intermediate step is done in order to get rid of missing tweets ( and tweets with expired accounts) that make the parser crash. 
+
+## SPELL CHECKED
+
+Once you created the the file with the dependency it is possible to apply spell checking. For complete reproducibility the files using for spell check are provided.
+
+    $ python3 generate_data_file.py -p tweet_file_parsed -o output_path
+
+This step is performed now for the following reason:
+- need for tokenization
+- need for pos tags (provided by parsing) for avoiding parsing urls and emoticons
+- spell check is expensive