Fake news has become one of the biggest problems of our age. It has serious impact on our online as well as offline discourse. More often we see conflicting facts for the same topic and wondering either it is true or not. The task to classify that either the news is fake or not can be done by using Python and Machine Learning.We can use classifier algorithm to train our model that can predict whether the news is fake or not.
This is the project I am working on while learning the concepts of Machine Learning and Data Science.
- Aim - The aim of the project is to build a fake news classifier using Natural Language Processing.
- We will take a dataset of labeled public-messages and apply classification techniques with frequency vectorizer like Tf-idf vectorizer and Count vectorizer.
- In NLP(Natural Language Processing) we face stop words which we will remove using Stemming Technique
- We can later test different model like Naive Bayes Model, Random Forest Model and K-NN (K-Nearest Neighbour) for accuracy and performance on unclassified public-messages using both Tf-idf vectorizer and count vectorizer.
I am using dataset from kaggle.com which contain following instances:
- id: unique id for a news article
- title: the title of a news article
- author: author of the news article
- text: the text of the article; could be incomplete
- label: a label that marks the article as potentially unreliable
Label 1-> represent News is fake
Label 0-> represent News is not fake
- Data Preprocessing - Before traing the model we have to do data preprocessing that is get the information of data structure and how many data values are null.
- Text Cleaning - Then after cleaning of inconsistence data. We have to do text claening i.e. removing all numbers which is attached to the letter, converting all uppercase to lowercase, replacing all \n to spaces and removing all non-Ascii characters.
- Removing stop words and stemming the text - In natural language processing, useless words are called stop words which on removing from the sentence does not affect the measning of sentence. Stop words like "a", "an", "the", "in", "on" etc. There is something called Porter Stemming Algorithm that is used to remove common morphological words. For more detail about the algorithm you can refer to the link
- Tf-idf Vectorizer - TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora.
- Count Vectorizer - The most straightforward one, it counts the number of times a token shows up in the document and uses this value as its weight.
For more details : Click Here
- We are using three models:
- Naive Bayes Model
- Random Forest Model
- K-NN
- We use both TfIdf Vectorizer and Count vectorizer to convert our text strings to numerical representations and initialize Naive Based Model, Random Forest Model, K-Nearest Neighbour to fit the model.
- At the end we are comapring all different models using Confusion Matrix and Acuuracy score of scikit-learn