comments changes 2

alaBay94 · Sep 7, 2020 · 203dbf3 · 203dbf3
1 parent be70963
commit 203dbf3
Show file tree

Hide file tree

Showing 3 changed files with 3,447 additions and 4,121 deletions.
diff --git a/SentimentAnalysisAmazon.ipynb b/SentimentAnalysisAmazon.ipynb
@@ -3,9 +3,7 @@
   {
    "cell_type": "code",
    "execution_count": 1,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "import pandas as pd\n",
@@ -21,7 +19,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Our Data is in xml formate so we need to process each line to get values from inside tags\n",
+    "#### Our Data is in xml format so we need to process each line to get values from inside tags\n",
     "-  readLines in python returns lines from file inside a list so it's a useful tool"
    ]
   },
@@ -1268,8 +1266,8 @@
    "source": [
     "## NLTK natural language Tool Kit\n",
     "- NLTK is a leading platform for building Python programs to work with human language data(NLP)\n",
-    "- with ** word_tokenize ** we can extract the words from the text\n",
-    "- with ** sent_tokenize ** we can extract the sentences from the text"
+    "- with **word_tokenize** we can extract the words from the text\n",
+    "- with **sent_tokenize** we can extract the sentences from the text"
    ]
   },
   {
@@ -1311,7 +1309,7 @@
    "metadata": {},
    "source": [
     "## Stop Words : usless words that we need to eliminate\n",
-    "-  NLTk offers English most popular StopWords with ** stop_Words **"
+    "-  NLTk offers English most popular StopWords with **stop_Words**"
    ]
   },
   {
@@ -1423,7 +1421,7 @@
     "- pythoning\n",
     "- pythoner\n",
     "- ext ..\n",
-    "- so with the stemming we gain computation"
+    "- so with the stemming we gain in computation"
    ]
   },
   {
@@ -1456,7 +1454,7 @@
    "metadata": {},
    "source": [
     "## Lemmatizing is like the Stemming\n",
-    "- instead of returning the same words with the last charcteres removed it returns the root of the word or another word synonymous so the returns are true English words\n"
+    "- Instead of returning the same words with the last characters removed, it returns the root of the word or another word synonymous so the returns are true English words\n"
    ]
   },
   {
@@ -1522,7 +1520,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## see the most redandent and important words"
+    "## Most redandent and important words"
    ]
   },
   {
@@ -1743,7 +1741,7 @@
     "- You can find elements in a tuple, since this doesn’t change the tuple.\n",
     "- You can also use the in operator to check if an element exists in the tuple.\n",
     "\n",
-    "- ** Tuples are faster ** than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate through it, use a tuple instead of a list."
+    "- **Tuples are faster** than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate through it, use a tuple instead of a list."
    ]
   },
   {
@@ -1921,7 +1919,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "** pickle works fine **"
+    "**pickle works fine**"
    ]
   },
   {
@@ -1948,7 +1946,7 @@
    "metadata": {},
    "source": [
     "## A Quick Naive Bayes Classification Approche to see if our data preprocessing could give us good results to keep going to further more complex models\n",
-    "-  we will try with the reviews of books only before moving to the product 25 so with the function ** model_Books **"
+    "-  we will try with the reviews of books only before moving to the product 25 so with the function **model_Books**"
    ]
   },
   {
@@ -2105,9 +2103,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "** let's test it with our sentences **\n",
+    "**let's test it with our sentences**\n",
     "-  our sentence needs to be passed by all functions toknize ,stop_words,lemmatize,irrelevant_words \n",
-    "-  with ** best ** ** ever ** we see that this is a positif revie easy for the classifier "
+    "-  with **best** **ever** we see that this is a positif revie easy for the classifier "
    ]
   },
   {
@@ -2164,7 +2162,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "-  with ** readable ** it's less obvious but the classifier figure it out"
+    "-  with **readable** it's less obvious but the classifier figure it out"
    ]
   },
   {
@@ -2200,7 +2198,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "-  ** disappointing ** is a very negatif word"
+    "-  **disappointing** is a very negatif word"
    ]
   },
   {
@@ -2285,7 +2283,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.4"
+   "version": "3.7.6"
   }
  },
  "nbformat": 4,

diff --git a/SentimentAnalysisClassification.ipynb b/SentimentAnalysisClassification.ipynb
@@ -454,8 +454,6 @@
     "\n",
     "where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value. \n",
     "\n",
-    "<caption><center> **Figure 1**: The cosine of the angle between two vectors is a measure of how similar they are</center></caption>\n",
-    "\n",
     "**Exercise**: Implement the function `cosine_similarity()` to evaluate similarity between word vectors.\n",
     "\n",
     "**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \\sqrt{\\sum_{i=1}^{n} u_i^2}$"
@@ -504,7 +502,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### VEry good we got 0.67 which means that this sentence are close to each other as they are positives comments"
+    "### Very good we got 0.67 which means that this sentence are close to each other as they are positives comments"
    ]
   },
   {
@@ -529,7 +527,7 @@
    "metadata": {},
    "source": [
     "### Let's See with different class we take this time two negatif\n",
-    "- this example is tricky ['This', 'cd', 'storage', 'unit', 'isnt', 'greatest', 'cd', 'rack', 'want', 'something', 'job', 'isnt', 'costly', 'thing', 'made', 'plasti'] because we see that we a naive approche the model will catch the word greatest which is a positif word so it will consider this statment as positive were the truth is this comment containt words befor and after greatest like isnt that help the model to distingush the class this probleme will be delt with when we use the ** RNN ** because it takes into account past and future words in a sequence\n",
+    "- This example is tricky ['This', 'cd', 'storage', 'unit', 'isnt', 'greatest', 'cd', 'rack', 'want', 'something', 'job', 'isnt', 'costly', 'thing', 'made', 'plasti'] because we see that we a naive approche the model will catch the word greatest which is a positif word so it will consider this statment as positive were the truth is this comment containt words befor and after greatest like isnt that help the model to distingush the class this probleme will be delt with when we use the ** RNN ** because it takes into account past and future words in a sequence\n",
     "-  For know we focus on the similarity to see the average vectors how they handle this evalution test"
    ]
   },
@@ -933,7 +931,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "** Random Forest Classifier **"
+    "**Random Forest Classifier**"
    ]
   },
   {
@@ -1461,21 +1459,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "ename": "NameError",
-     "evalue": "name 'clf_R' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-8-ae9065e6b652>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'RandomForestAvgModel.pickle'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'wb'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m     \u001b[0mpickle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdump\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mclf_R\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[1;31mNameError\u001b[0m: name 'clf_R' is not defined"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "with open('RandomForestAvgModel.pickle', 'wb') as f:\n",
     "    pickle.dump(clf_R, f)"
@@ -1490,21 +1476,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "ename": "EOFError",
-     "evalue": "Ran out of input",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mEOFError\u001b[0m                                  Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-19-69863b8f2682>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;32mwith\u001b[0m \u001b[0mopen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'RandomForestAvgModel.pickle'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m'rb'\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m      \u001b[0mclf_R\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpickle\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mload\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mf\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[1;31mEOFError\u001b[0m: Ran out of input"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "with open('RandomForestAvgModel.pickle', 'rb') as f:\n",
     "     clf_R = pickle.load(f)"
@@ -1537,7 +1511,7 @@
     "### For our first approche with average word embeddings \n",
     "-  Best Model So far is the random forest with AUCROC ** 0.79 ** on positif reviews\n",
     "-  RAndom forest  with ** 0.72 **  accuracy\n",
-    "For next with will try the ** LSTM ** neural network for sequence models that take in a sequence of words and remebers the order of the words we will try ** LSTM **,**GRU** for it's gates to handle the vanishing gradient problem with deep ** RNN ** ,next we will combine ** CNN + LSTM ** ,**LSTM + CNN** and see wihch one gives the better results"
+    "For next with will try the **LSTM** neural network for sequence models that take in a sequence of words and remebers the order of the words we will try **LSTM**,**GRU** for it's gates to handle the vanishing gradient problem with deep **RNN** ,next we will combine **CNN + LSTM** ,**LSTM + CNN** and see wihch one gives the better results"
    ]
   },
   {
@@ -1614,7 +1588,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### fix_sentence_length is a function to fix the size of sentences to a fixed size to feed them into the neural net with specific length sentences with length less than that will be extended with zeros this process don't affect the alogrithme and sentences with length more that the specified length will be truncated"
+    "- Fix_sentence_length is a function to fix the size of sentences to a fixed size, to feed them into the neural net with specific length.\n",
+    "- Sentences with a length less than 100 characters will be extended with zeros this process doesn't affect the learning of the model.\n",
+    "- Sentences with length more that the specified length will be truncated."
    ]
   },
   {
@@ -1689,32 +1665,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "D:\\Users\\ala94\\Anaconda3\\envs\\DS\\lib\\site-packages\\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n",
-      "  \n"
-     ]
-    },
-    {
-     "ename": "NameError",
-     "evalue": "name 'fix_sentence_length' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[1;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "\u001b[1;32m<ipython-input-12-e8f343fb0ae0>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[0mwords\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mvocab\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mX\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mY\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mSent_Embeding_sequence\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mDocuments\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[1;32m<ipython-input-11-7c567f5da897>\u001b[0m in \u001b[0;36mSent_Embeding_sequence\u001b[1;34m(words, Documents)\u001b[0m\n\u001b[0;32m     13\u001b[0m \u001b[1;31m#            Sent = Sent[:100]\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     14\u001b[0m         \u001b[1;31m#Add Sentence Vector to list\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 15\u001b[1;33m         \u001b[0mSent\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mfix_sentence_length\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m100\u001b[0m\u001b[1;33m,\u001b[0m\u001b[0mSent\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     16\u001b[0m         \u001b[0mX_SentsEmb\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mSent\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     17\u001b[0m         \u001b[1;31m#Add label to y_SentEmb\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
-      "\u001b[1;31mNameError\u001b[0m: name 'fix_sentence_length' is not defined"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "words = list(model.wv.vocab)\n",
     "X,Y = Sent_Embeding_sequence(words,Documents)"
@@ -1891,7 +1844,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### For next we will continue on google colabratory for better performence on GPU we will use X and Y\n",
+    "### Next we will continue on google colabratory for better performence on GPU we will use X and Y\n",
     "- In this file you will find the LSTM neural networks architecture :\n",
     "https://github.com/alaBay94/Sentiment-analysis-amazon-Products-Reviews/blob/master/SentimentAnalysisClassificationWithLSTM_GoogleColab.ipynb\n",
     "- Load LSTM model with on layer We will compare 5 RNN models:\n",
@@ -1900,7 +1853,7 @@
     "- LSTM3L_Model with 3 stacked layers\n",
     "- BILSTM_Model with Biderctional LSTM\n",
     "- CNNLSTM_Model with one convultional layer and LSTM layer\n",
-    "- ** We also load our test set that these models nerver saw before **"
+    "- **We also load our test set that these models nerver saw before**"
    ]
   },
   {
@@ -2367,7 +2320,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.4"
+   "version": "3.7.6"
   },
   "nbTranslate": {
    "displayLangs": [