Skip to content

Commit

Permalink
Plz hire me. Ty ty.
Browse files Browse the repository at this point in the history
Tokenizer now returns the best paragraph or setence which matches the claim. Fixed the issues in the previous version where the claim didn't seem to match the returned value very well. TODO: set up the database. (Can I make someone else do it?)
  • Loading branch information
Chyan214 committed Oct 27, 2019
1 parent 6ae361d commit dea5c80
Show file tree
Hide file tree
Showing 4 changed files with 140 additions and 18 deletions.
40 changes: 34 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,41 @@
# DeepCite
CS506 Project
<p> Google Chrome extension that finds the source of a claim using BeautifulSoup, spaCy, and gensim libraries. Please look at documentation for implemenation details. :trollface:</p>

## Frontend Testing
## Table of Contents
* [Installation](#installation)
* [Testing](#testing)
* [Tasks](#tasks)
* [Authors](#authors)

## Installation
Installations and downloads required before running the application
### Downloads
* (optional test data) Reddit World News Database: https://www.kaggle.com/rootuser/worldnews-on-reddit

<small> Currently looking at Google News vector space </small>

### Library installs
* `pip install beautifulsoup4`
* `pip install spacy`
* `python -m spacy download en_core_web_sm`
* `pip install --upgrade gensim`

<small>Note: 'en_core_web_sm' installation is subject to change for higher accuracy</small>

## Testing
### Frontend Testing
To get the testing framework set up, run `npm install mocha`
Then run `npm test`

## Backend
## Tasks
### Iteration 1
- [x] Add extension
- [x] Enter Data
- [x] View Results
- [x] Web Scraper
- [x] Word Tokenizer
- [ ] Setting up database

### library installs
pip install spacy
python -m spacy download en_core_web_sm
pip install --upgrade gensim
## Authors
Shourya Goel, Jiahe Hu, Vinay Janardhanam, Dillion O'Leary, Noah SickLick, and Catherine Yan
36 changes: 35 additions & 1 deletion extension/backend/tokenizer_files/Documentation.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
tokenizer.py

functions:
class Paragraph:
extends object, used specifically for the PriorityQueue, where the paragraph with the
highest similarity has the highest priority

Attributes
----------
index : int
index within the paragraph/sentence the text is located
similarity : float
value of similarity the text has to the claim


Methods:

preprocessing(<type=spacy.Doc object> text):
Lemmatize non-stop words, removes punctuation, lowercase all non proper nouns
Expand All @@ -13,6 +25,28 @@ functions:
-------
processed_text : str
text that has been cleaned accordingly

def print_queue(queue):
prints the content within the queue
Side Effect: queue is now empty

Parameters
----------
queue : PriorityQueue()
queue that should be printed

def sentence_parsing(text):
Splits a document into sentences

Parameters
----------
text : List(str)
each text[x] is a paragraph

Returns
-------
sentences : List(str)
each sentence[x] is a sentence in the given text


predict(claim, <type=List(str)> text):
Expand Down
36 changes: 36 additions & 0 deletions extension/backend/tokenizer_files/test-file.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
('Gravity moves at the Speed of Light and is not Instantaneous. If the Sun were to disappear, we would continue our elliptical orbit for an additional 8 minutes and 20 seconds, the same time it would take us to stop seeing the light (according to General Relativity).\n', 'The rate of this damping can be computed, and one finds that it depends sensitively on the speed of gravity. The fact that gravitational damping is measured at all is a strong indication that the propagation speed of gravity is not infinite. If the calculational framework of general relativity is accepted, the damping can be used to calculate the speed, and the actual measurement confirms that the speed of gravity is equal to the speed of light to within 1%. (Measurements of at least one other binary pulsar system, PSR B1534+12, confirm this result, although so far with less precision.)\n')





('Draconian laws are named after the 1st Greek legislator, Draco, who meted out severe punishment for very minor offenses. These included enforced slavery for any debtor whose status was lower than that of his creditor and the death sentence for stealing a cabbage.\n', 'The laws were particularly harsh. For example, any debtor whose status was lower than that of his creditor was forced into slavery.[9] The punishment was more lenient for those owing a debt to a member of a lower class. The death penalty was the punishment for even minor offences, such as stealing a cabbage.[10] Concerning the liberal use of the death penalty in the Draconic code, Plutarch states: "It is said that Drakon himself, when asked why he had fixed the punishment of death for most offences, answered that he considered these lesser crimes to deserve it, and he had no greater punishment for more important ones".[11]\n')





('Thomas Alva Edison did not invent the light bulb; electric light sources were being experimented since early 1800s and an English Physicist Sir Joseph Swan had a working prototype of electric incandescent light bulb more than a decade earlier than Edison.\n', 'In 1850 an English physicist named Joseph Wilson Swan created a “light bulb�? by enclosing carbonized paper filaments in an evacuated glass bulb. And by 1860 he had a working prototype, but the lack of a good vacuum and an adequate supply of electricity resulted in a bulb whose lifetime was much too short to be considered an effective prodcer of light. However, in the 1870’s better vacuum pumps became available and Swan continued experiments on light bulbs. In 1878, Swan developed a longer lasting light bulb using a treated cotton thread that also removed the problem of early bulb blackening.\n')





('Albert G�ring, brother of Hermann G�ring. Unlike his brother, Albert was opposed to Nazism and helped many Jews and other persecuted minorities throughout the war. He was shunned in postwar Germany due to his name, and died without any public recognition for his humanitarian efforts.\n', 'In contrast to his brother, however, Albert was opposed to Nazism and helped Jews and others who were persecuted in Nazi Germany.[2] He was shunned in postwar Germany because of his family name, and he died without any public recognition for his humanitarian efforts.[3]')





('it took Pixar almost 3 years of research to perfect Merida�s curly hair for their 2012 film Brave. They spent two months working on a scene where Merida removes her hood and the full volume of her hair is finally revealed.\n', '"It took us almost three years to get the final look for her hair and we spent two months working on the scene where Merida removes her hood and you see the full volume of her hair," said Chung. "')





('real dead bodies were used on the set of “Apocalypse Now.�? The man who supplied them turned out to be a grave robber and was arrested', 'They\'d got the stiffs from a guy who supplied bodies to medical schools for autopsies. It turned out he was a grave robber. "The police showed up on our set and took all of our passports," says Frederickson. "They didn\'t know we hadn\'t killed these people because the bodies were unidentified. I was pretty damn worried for a few days. But they got to the truth and put the guy in jail."\n')





46 changes: 35 additions & 11 deletions extension/backend/tokenizer_files/tokenizer.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import spacy
from spacy.parts_of_speech import PUNCT, PROPN
from spacy.lang.en import English
import queue as q
import os

Expand All @@ -11,6 +12,8 @@
# en_core_web_lg
# google news pre-trained network: https://code.google.com/archive/p/word2vec/
nlp = spacy.load("en_core_web_sm")
nlps = English()
nlps.add_pipe(nlps.create_pipe('sentencizer'))
# importing different vectors for similiarites - word2vec
# training dataset
# nlp = spacy.load('en_core_web_sm', vectors='<directory>')
Expand All @@ -30,6 +33,10 @@ def __lt__(self, other):
def __repr__(self):
return "index: " + str(self.index) + " similarity: " + str(self.similarity)

# prints priority queue - mostly for testing purposes
def print_queue(queue):
while not queue.empty():
print( queue.get() )

# any preprocessing of data if necessary
# remove punctuation - all lower case, lemmanization
Expand All @@ -46,33 +53,50 @@ def preprocessing(doc):
clean_claim.append(add)
return " ".join(x for x in clean_claim if x != "")


# parses through sentences in a given paragraph
def sentence_parsing(text):
sentences = []
for num, paragraph in enumerate(text):
doc3 = nlps(paragraph)
for sent in doc3.sents:
sentence = sent.text.strip().replace('\t', '').replace("\n", '')
if len(sentence) > 0:
sentences.append(sentence)
return sentences

# claim will the the claim for comparison
# text is the text of the article, preferably paragraph by paragraph
def predict(claim, text) :

queue = q.PriorityQueue()

#print(claim)
clean_claim = preprocessing(nlp(claim))
#print(clean_claim)
doc1 = nlp(clean_claim)

# compares claim to individual paragraphs
for num, paragraph in enumerate(text):
clean_paragraph = preprocessing(nlp(paragraph))
doc2 = nlp(clean_paragraph)
queue.put(Paragraph(num, doc1.similarity(doc2)))

index = queue.get().index
predicted = text[index]
# soley for testing purposes
#test.append((claim, text[index]))
best_paragraph = queue.get()

# compares to individual sentences
sentence_queue = q.PriorityQueue()
sentences = sentence_parsing(text)

for num, sentence in enumerate(sentences):
clean_sent = preprocessing(nlp(sentence))
doc3 = nlp(clean_sent)
sentence_queue.put(Paragraph(num, doc1.similarity(doc3)))

best_sentence = sentence_queue.get()

predict = text[best_paragraph.index] if best_paragraph.similarity > best_sentence.similarity else sentences[best_sentence.index]
# test is solely for testing purposes, checks the return values
test.append((claim, predict))

# while not queue.empty():
# print( queue.get() )
# TODO: test for setence accuracy in paragraph(?)
return predicted
return predict



Expand Down

0 comments on commit dea5c80

Please sign in to comment.