Plz hire me. Ty ty.

Tokenizer now returns the best paragraph or setence which matches the claim. Fixed the issues in the previous version where the claim didn't seem to match the returned value very well. TODO: set up the database. (Can I make someone else do it?)
connorjoleary · Oct 27, 2019 · dea5c80 · dea5c80
1 parent 6ae361d
commit dea5c80
Show file tree

Hide file tree

Showing 4 changed files with 140 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,41 @@
 # DeepCite
 CS506 Project
+<p> Google Chrome extension that finds the source of a claim using BeautifulSoup, spaCy, and gensim libraries. Please look at documentation for implemenation details. :trollface:</p>
 
-## Frontend Testing
+## Table of Contents
+* [Installation](#installation)
+* [Testing](#testing)
+* [Tasks](#tasks)
+* [Authors](#authors)
+
+## Installation
+Installations and downloads required before running the application
+### Downloads
+* (optional test data) Reddit World News Database: https://www.kaggle.com/rootuser/worldnews-on-reddit 
+
+  <small> Currently looking at Google News vector space </small>
+
+### Library installs
+* `pip install beautifulsoup4`
+* `pip install spacy`
+* `python -m spacy download en_core_web_sm` 
+* `pip install --upgrade gensim`
+
+<small>Note: 'en_core_web_sm' installation is subject to change for higher accuracy</small>
+
+## Testing
+### Frontend Testing
 To get the testing framework set up, run `npm install mocha`
 Then run `npm test`
 
-## Backend 
+## Tasks
+### Iteration 1
+- [x] Add extension
+- [x] Enter Data
+- [x] View Results
+- [x] Web Scraper
+- [x] Word Tokenizer
+- [ ] Setting up database
 
-### library installs
-pip install spacy
-python -m spacy download en_core_web_sm
-pip install --upgrade gensim
+## Authors
+Shourya Goel, Jiahe Hu, Vinay Janardhanam, Dillion O'Leary, Noah SickLick, and Catherine Yan
diff --git a/extension/backend/tokenizer_files/Documentation.txt b/extension/backend/tokenizer_files/Documentation.txt
@@ -1,6 +1,18 @@
 tokenizer.py
 
-functions:
+class Paragraph:
+    extends object, used specifically for the PriorityQueue, where the paragraph with the
+    highest similarity has the highest priority 
+
+    Attributes
+    ----------
+    index : int
+        index within the paragraph/sentence the text is located
+    similarity : float
+        value of similarity the text has to the claim
+
+
+Methods:
 
     preprocessing(<type=spacy.Doc object> text):
         Lemmatize non-stop words, removes punctuation, lowercase all non proper nouns
@@ -13,6 +25,28 @@ functions:
         -------
         processed_text : str
             text that has been cleaned accordingly
+
+    def print_queue(queue):
+        prints the content within the queue
+        Side Effect: queue is now empty
+
+        Parameters
+        ----------
+        queue : PriorityQueue()
+            queue that should be printed
+
+    def sentence_parsing(text):
+        Splits a document into sentences
+
+        Parameters
+        ----------
+        text : List(str)
+            each text[x] is a paragraph
+
+        Returns
+        -------
+        sentences : List(str)
+            each sentence[x] is a sentence in the given text
 
 
     predict(claim, <type=List(str)> text):

diff --git a/extension/backend/tokenizer_files/test-file.txt b/extension/backend/tokenizer_files/test-file.txt
@@ -0,0 +1,36 @@
+('Gravity moves at the Speed of Light and is not Instantaneous. If the Sun were to disappear, we would continue our elliptical orbit for an additional 8 minutes and 20 seconds, the same time it would take us to stop seeing the light (according to General Relativity).\n', 'The rate of this damping can be computed, and one finds that it depends sensitively on the speed of gravity.  The fact that gravitational damping is measured at all is a strong indication that the propagation speed of gravity is not infinite.  If the calculational framework of general relativity is accepted, the damping can be used to calculate the speed, and the actual measurement confirms that the speed of gravity is equal to the speed of light to within 1%.  (Measurements of at least one other binary pulsar system, PSR B1534+12, confirm this result, although so far with less precision.)\n')
+
+
+
+
+
+('Draconian laws are named after the 1st Greek legislator, Draco, who meted out severe punishment for very minor offenses. These included enforced slavery for any debtor whose status was lower than that of his creditor and the death sentence for stealing a cabbage.\n', 'The laws were particularly harsh. For example, any debtor whose status was lower than that of his creditor was forced into slavery.[9] The punishment was more lenient for those owing a debt to a member of a lower class. The death penalty was the punishment for even minor offences, such as stealing a cabbage.[10] Concerning the liberal use of the death penalty in the Draconic code, Plutarch states: "It is said that Drakon himself, when asked why he had fixed the punishment of death for most offences, answered that he considered these lesser crimes to deserve it, and he had no greater punishment for more important ones".[11]\n')
+
+
+
+
+
+('Thomas Alva Edison did not invent the light bulb; electric light sources were being experimented since early 1800s and an English Physicist Sir Joseph Swan had a working prototype of electric incandescent light bulb more than a decade earlier than Edison.\n', 'In 1850 an English physicist named Joseph Wilson Swan created a “light bulb�? by enclosing carbonized paper filaments in an evacuated glass bulb. And by 1860 he had a working prototype, but the lack of a good vacuum and an adequate supply of electricity resulted in a bulb whose lifetime was much too short to be considered an effective prodcer of light. However, in the 1870’s better vacuum pumps became available and Swan continued experiments on light bulbs. In 1878, Swan developed a longer lasting light bulb using a treated cotton thread that also removed the problem of early bulb blackening.\n')
+
+
+
+
+
+('Albert G�ring, brother of Hermann G�ring. Unlike his brother, Albert was opposed to Nazism and helped many Jews and other persecuted minorities throughout the war. He was shunned in postwar Germany due to his name, and died without any public recognition for his humanitarian efforts.\n', 'In contrast to his brother, however, Albert was opposed to Nazism and helped Jews and others who were persecuted in Nazi Germany.[2] He was shunned in postwar Germany because of his family name, and he died without any public recognition for his humanitarian efforts.[3]')
+
+
+
+
+
+('it took Pixar almost 3 years of research to perfect Merida�s curly hair for their 2012 film Brave. They spent two months working on a scene where Merida removes her hood and the full volume of her hair is finally revealed.\n', '"It took us almost three years to get the final look for her hair and we spent two months working on the scene where Merida removes her hood and you see the full volume of her hair," said Chung. "')
+
+
+
+
+
+('real dead bodies were used on the set of “Apocalypse Now.�? The man who supplied them turned out to be a grave robber and was arrested', 'They\'d got the stiffs from a guy who supplied bodies to medical schools for autopsies. It turned out he was a grave robber. "The police showed up on our set and took all of our passports," says Frederickson. "They didn\'t know we hadn\'t killed these people because the bodies were unidentified. I was pretty damn worried for a few days. But they got to the truth and put the guy in jail."\n')
+
+
+
+
+
diff --git a/extension/backend/tokenizer_files/tokenizer.py b/extension/backend/tokenizer_files/tokenizer.py
@@ -1,5 +1,6 @@
 import spacy
 from spacy.parts_of_speech import  PUNCT, PROPN
+from spacy.lang.en import English
 import queue as q
 import os
 
@@ -11,6 +12,8 @@
 # en_core_web_lg
 # google news pre-trained network: https://code.google.com/archive/p/word2vec/ 
 nlp = spacy.load("en_core_web_sm")
+nlps = English()
+nlps.add_pipe(nlps.create_pipe('sentencizer'))
 # importing different vectors for similiarites - word2vec
 # training dataset
 # nlp = spacy.load('en_core_web_sm', vectors='<directory>') 
@@ -30,6 +33,10 @@ def __lt__(self, other):
     def __repr__(self):
         return "index: " + str(self.index) + " similarity: " + str(self.similarity)
 
+# prints priority queue - mostly for testing purposes
+def print_queue(queue):
+    while not queue.empty():
+        print( queue.get() )
 
 # any preprocessing of data if necessary
 # remove punctuation - all lower case, lemmanization
@@ -46,33 +53,50 @@ def preprocessing(doc):
         clean_claim.append(add)
     return " ".join(x for x in clean_claim if x != "")
 
-
+# parses through sentences in a given paragraph
+def sentence_parsing(text):
+    sentences = []
+    for num, paragraph in enumerate(text):
+        doc3 = nlps(paragraph)
+        for sent in doc3.sents:
+            sentence = sent.text.strip().replace('\t', '').replace("\n", '')
+            if len(sentence) > 0:
+                sentences.append(sentence)
+    return sentences
 
 # claim will the the claim for comparison
 # text is the text of the article, preferably paragraph by paragraph
 def predict(claim, text) :
 
     queue = q.PriorityQueue()
 
-    #print(claim)
     clean_claim = preprocessing(nlp(claim))
-    #print(clean_claim)
     doc1 = nlp(clean_claim)
 
+    # compares claim to individual paragraphs
     for num, paragraph in enumerate(text):
         clean_paragraph = preprocessing(nlp(paragraph))
         doc2 = nlp(clean_paragraph)
         queue.put(Paragraph(num, doc1.similarity(doc2)))
 
-    index = queue.get().index
-    predicted = text[index]
-    # soley for testing purposes
-    #test.append((claim, text[index]))
+    best_paragraph = queue.get()
+
+    # compares to individual sentences
+    sentence_queue = q.PriorityQueue()
+    sentences = sentence_parsing(text)
+
+    for num, sentence in enumerate(sentences):
+        clean_sent = preprocessing(nlp(sentence))
+        doc3 = nlp(clean_sent)
+        sentence_queue.put(Paragraph(num, doc1.similarity(doc3)))
+
+    best_sentence = sentence_queue.get()
+
+    predict = text[best_paragraph.index] if best_paragraph.similarity > best_sentence.similarity else sentences[best_sentence.index]
+    # test is solely for testing purposes, checks the return values
+    test.append((claim, predict))
 
-    # while not queue.empty():
-        # print( queue.get() )
-    # TODO: test for setence accuracy in paragraph(?)
-    return predicted
+    return predict