Skip to content

Commit

Permalink
comments changes 2
Browse files Browse the repository at this point in the history
  • Loading branch information
unknown committed Sep 7, 2020
1 parent be70963 commit 203dbf3
Show file tree
Hide file tree
Showing 3 changed files with 3,447 additions and 4,121 deletions.
34 changes: 16 additions & 18 deletions SentimentAnalysisAmazon.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,7 @@
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
Expand All @@ -21,7 +19,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Our Data is in xml formate so we need to process each line to get values from inside tags\n",
"#### Our Data is in xml format so we need to process each line to get values from inside tags\n",
"- readLines in python returns lines from file inside a list so it's a useful tool"
]
},
Expand Down Expand Up @@ -1268,8 +1266,8 @@
"source": [
"## NLTK natural language Tool Kit\n",
"- NLTK is a leading platform for building Python programs to work with human language data(NLP)\n",
"- with ** word_tokenize ** we can extract the words from the text\n",
"- with ** sent_tokenize ** we can extract the sentences from the text"
"- with **word_tokenize** we can extract the words from the text\n",
"- with **sent_tokenize** we can extract the sentences from the text"
]
},
{
Expand Down Expand Up @@ -1311,7 +1309,7 @@
"metadata": {},
"source": [
"## Stop Words : usless words that we need to eliminate\n",
"- NLTk offers English most popular StopWords with ** stop_Words **"
"- NLTk offers English most popular StopWords with **stop_Words**"
]
},
{
Expand Down Expand Up @@ -1423,7 +1421,7 @@
"- pythoning\n",
"- pythoner\n",
"- ext ..\n",
"- so with the stemming we gain computation"
"- so with the stemming we gain in computation"
]
},
{
Expand Down Expand Up @@ -1456,7 +1454,7 @@
"metadata": {},
"source": [
"## Lemmatizing is like the Stemming\n",
"- instead of returning the same words with the last charcteres removed it returns the root of the word or another word synonymous so the returns are true English words\n"
"- Instead of returning the same words with the last characters removed, it returns the root of the word or another word synonymous so the returns are true English words\n"
]
},
{
Expand Down Expand Up @@ -1522,7 +1520,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## see the most redandent and important words"
"## Most redandent and important words"
]
},
{
Expand Down Expand Up @@ -1743,7 +1741,7 @@
"- You can find elements in a tuple, since this doesn’t change the tuple.\n",
"- You can also use the in operator to check if an element exists in the tuple.\n",
"\n",
"- ** Tuples are faster ** than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate through it, use a tuple instead of a list."
"- **Tuples are faster** than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate through it, use a tuple instead of a list."
]
},
{
Expand Down Expand Up @@ -1921,7 +1919,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** pickle works fine **"
"**pickle works fine**"
]
},
{
Expand All @@ -1948,7 +1946,7 @@
"metadata": {},
"source": [
"## A Quick Naive Bayes Classification Approche to see if our data preprocessing could give us good results to keep going to further more complex models\n",
"- we will try with the reviews of books only before moving to the product 25 so with the function ** model_Books **"
"- we will try with the reviews of books only before moving to the product 25 so with the function **model_Books**"
]
},
{
Expand Down Expand Up @@ -2105,9 +2103,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** let's test it with our sentences **\n",
"**let's test it with our sentences**\n",
"- our sentence needs to be passed by all functions toknize ,stop_words,lemmatize,irrelevant_words \n",
"- with ** best ** ** ever ** we see that this is a positif revie easy for the classifier "
"- with **best** **ever** we see that this is a positif revie easy for the classifier "
]
},
{
Expand Down Expand Up @@ -2164,7 +2162,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"- with ** readable ** it's less obvious but the classifier figure it out"
"- with **readable** it's less obvious but the classifier figure it out"
]
},
{
Expand Down Expand Up @@ -2200,7 +2198,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"- ** disappointing ** is a very negatif word"
"- **disappointing** is a very negatif word"
]
},
{
Expand Down Expand Up @@ -2285,7 +2283,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
"version": "3.7.6"
}
},
"nbformat": 4,
Expand Down
81 changes: 17 additions & 64 deletions SentimentAnalysisClassification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -454,8 +454,6 @@
"\n",
"where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value. \n",
"\n",
"<caption><center> **Figure 1**: The cosine of the angle between two vectors is a measure of how similar they are</center></caption>\n",
"\n",
"**Exercise**: Implement the function `cosine_similarity()` to evaluate similarity between word vectors.\n",
"\n",
"**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \\sqrt{\\sum_{i=1}^{n} u_i^2}$"
Expand Down Expand Up @@ -504,7 +502,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### VEry good we got 0.67 which means that this sentence are close to each other as they are positives comments"
"### Very good we got 0.67 which means that this sentence are close to each other as they are positives comments"
]
},
{
Expand All @@ -529,7 +527,7 @@
"metadata": {},
"source": [
"### Let's See with different class we take this time two negatif\n",
"- this example is tricky ['This', 'cd', 'storage', 'unit', 'isnt', 'greatest', 'cd', 'rack', 'want', 'something', 'job', 'isnt', 'costly', 'thing', 'made', 'plasti'] because we see that we a naive approche the model will catch the word greatest which is a positif word so it will consider this statment as positive were the truth is this comment containt words befor and after greatest like isnt that help the model to distingush the class this probleme will be delt with when we use the ** RNN ** because it takes into account past and future words in a sequence\n",
"- This example is tricky ['This', 'cd', 'storage', 'unit', 'isnt', 'greatest', 'cd', 'rack', 'want', 'something', 'job', 'isnt', 'costly', 'thing', 'made', 'plasti'] because we see that we a naive approche the model will catch the word greatest which is a positif word so it will consider this statment as positive were the truth is this comment containt words befor and after greatest like isnt that help the model to distingush the class this probleme will be delt with when we use the ** RNN ** because it takes into account past and future words in a sequence\n",
"- For know we focus on the similarity to see the average vectors how they handle this evalution test"
]
},
Expand Down Expand Up @@ -933,7 +931,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Random Forest Classifier **"
"**Random Forest Classifier**"
]
},
{
Expand Down Expand Up @@ -1461,21 +1459,9 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'clf_R' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-8-ae9065e6b652>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'RandomForestAvgModel.pickle'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'wb'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mpickle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdump\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclf_R\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mNameError\u001b[0m: name 'clf_R' is not defined"
]
}
],
"outputs": [],
"source": [
"with open('RandomForestAvgModel.pickle', 'wb') as f:\n",
" pickle.dump(clf_R, f)"
Expand All @@ -1490,21 +1476,9 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": null,
"metadata": {},
"outputs": [
{
"ename": "EOFError",
"evalue": "Ran out of input",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mEOFError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-19-69863b8f2682>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'RandomForestAvgModel.pickle'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'rb'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mclf_R\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpickle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mload\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mEOFError\u001b[0m: Ran out of input"
]
}
],
"outputs": [],
"source": [
"with open('RandomForestAvgModel.pickle', 'rb') as f:\n",
" clf_R = pickle.load(f)"
Expand Down Expand Up @@ -1537,7 +1511,7 @@
"### For our first approche with average word embeddings \n",
"- Best Model So far is the random forest with AUCROC ** 0.79 ** on positif reviews\n",
"- RAndom forest with ** 0.72 ** accuracy\n",
"For next with will try the ** LSTM ** neural network for sequence models that take in a sequence of words and remebers the order of the words we will try ** LSTM **,**GRU** for it's gates to handle the vanishing gradient problem with deep ** RNN ** ,next we will combine ** CNN + LSTM ** ,**LSTM + CNN** and see wihch one gives the better results"
"For next with will try the **LSTM** neural network for sequence models that take in a sequence of words and remebers the order of the words we will try **LSTM**,**GRU** for it's gates to handle the vanishing gradient problem with deep **RNN** ,next we will combine **CNN + LSTM** ,**LSTM + CNN** and see wihch one gives the better results"
]
},
{
Expand Down Expand Up @@ -1614,7 +1588,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### fix_sentence_length is a function to fix the size of sentences to a fixed size to feed them into the neural net with specific length sentences with length less than that will be extended with zeros this process don't affect the alogrithme and sentences with length more that the specified length will be truncated"
"- Fix_sentence_length is a function to fix the size of sentences to a fixed size, to feed them into the neural net with specific length.\n",
"- Sentences with a length less than 100 characters will be extended with zeros this process doesn't affect the learning of the model.\n",
"- Sentences with length more that the specified length will be truncated."
]
},
{
Expand Down Expand Up @@ -1689,32 +1665,9 @@
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"D:\\Users\\ala94\\Anaconda3\\envs\\DS\\lib\\site-packages\\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n",
" \n"
]
},
{
"ename": "NameError",
"evalue": "name 'fix_sentence_length' is not defined",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-12-e8f343fb0ae0>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mwords\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvocab\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mY\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mSent_Embeding_sequence\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mDocuments\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;32m<ipython-input-11-7c567f5da897>\u001b[0m in \u001b[0;36mSent_Embeding_sequence\u001b[1;34m(words, Documents)\u001b[0m\n\u001b[0;32m 13\u001b[0m \u001b[1;31m# Sent = Sent[:100]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 14\u001b[0m \u001b[1;31m#Add Sentence Vector to list\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 15\u001b[1;33m \u001b[0mSent\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfix_sentence_length\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m100\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mSent\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 16\u001b[0m \u001b[0mX_SentsEmb\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mSent\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 17\u001b[0m \u001b[1;31m#Add label to y_SentEmb\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mNameError\u001b[0m: name 'fix_sentence_length' is not defined"
]
}
],
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"words = list(model.wv.vocab)\n",
"X,Y = Sent_Embeding_sequence(words,Documents)"
Expand Down Expand Up @@ -1891,7 +1844,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### For next we will continue on google colabratory for better performence on GPU we will use X and Y\n",
"### Next we will continue on google colabratory for better performence on GPU we will use X and Y\n",
"- In this file you will find the LSTM neural networks architecture :\n",
"https://github.com/alaBay94/Sentiment-analysis-amazon-Products-Reviews/blob/master/SentimentAnalysisClassificationWithLSTM_GoogleColab.ipynb\n",
"- Load LSTM model with on layer We will compare 5 RNN models:\n",
Expand All @@ -1900,7 +1853,7 @@
"- LSTM3L_Model with 3 stacked layers\n",
"- BILSTM_Model with Biderctional LSTM\n",
"- CNNLSTM_Model with one convultional layer and LSTM layer\n",
"- ** We also load our test set that these models nerver saw before **"
"- **We also load our test set that these models nerver saw before**"
]
},
{
Expand Down Expand Up @@ -2367,7 +2320,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.4"
"version": "3.7.6"
},
"nbTranslate": {
"displayLangs": [
Expand Down
Loading

0 comments on commit 203dbf3

Please sign in to comment.