Attempting to fix slow NaiveBayes #136

jcalbert · 2016-09-01T09:16:15Z

Three changes:

basic_extractor can accept a list of strings as well as a list of
('word','label') tuples.
BaseClassifier now has an instance variable _word_set which is a set
of tokens seen by the classifier.

1+2) BaseClassifier.extract_features passes _word_set to extractor
rather than the training set.

NLTKClassifier.update adds new words to the _word_set.

jcalbert · 2016-09-01T09:21:47Z

As mentioned in a few issues, NaiveBayes is slow.

Test script, using the movie_reviews dataset:

import os, sys
from random import shuffle
pos_train = [(open('../movie_reviews/pos/'+fname,'r').read(),'pos') for fname in os.listdir('../movie_reviews/pos/')]
neg_train = [(open('../movie_reviews/neg/'+fname,'r').read(),'neg') for fname in os.listdir('../movie_reviews/neg/')]
all_train = pos_train + neg_train
shuffle(all_train)

#Old version
from textblob.classifiers_OLD import NaiveBayesClassifier as NBC

counts = [2**j for j in range(1,7)]
print 'Baseline: '
baseline_res = []
for n in counts:
    t = %timeit -o -q -n 1 -r 1 classifier = NBC(all_train[:n])
    s = str(n)
    print s + ' reviews' +  ' '*(5-len(s)) + str(round(t.best,3)) + ' sec'
    baseline_res.append(t.best)

from textblob.classifiers import NaiveBayesClassifier as NBC
print 'Modified: '
modified_res = []
for n in counts:
    t = %timeit -o -q -n 1 -r 1 classifier = NBC(all_train[:n])
    s = str(n)
    print s + ' reviews' +  ' '*(5-len(s)) + str(round(t.best,3)) + ' sec'
    modified_res.append(t.best)

with output:

Baseline: 
2 reviews    0.05 sec
4 reviews    0.137 sec
8 reviews    0.569 sec
16 reviews   2.57 sec
32 reviews   10.669 sec
64 reviews   42.275 sec

Modified: 
2 reviews    0.032 sec
4 reviews    0.056 sec
8 reviews    0.137 sec
16 reviews   0.338 sec
32 reviews   0.759 sec
64 reviews   1.803 sec

jcalbert · 2016-09-01T09:35:11Z

Relates to #63, #77 and #123.

jcalbert · 2016-09-01T09:38:45Z

From the travis-ci failures, it looks like the docstrings on _get_words_from_dataset() and basic_extractor() ask for more restricted training data that test_classifiers.py provides (many are not iterable). Separate issue?

sloria · 2016-09-01T13:22:00Z

Thanks @jcalbert for the PR! I'll take a look at this over the weekend.

Three changes: 1) basic_extractor can accept a list of strings as well as a list of ('word','label') tuples. 2) BaseClassifier now has an instance variable _word_set which is a set of tokens seen by the classifier. 1+2) BaseClassifier.extract_features passes _word_set to extractor rather than the training set. 3) NLTKClassifier.update adds new words to the _word_set.

Now returns an empty dict if passed an empty training set. Also, cover some bases if train_set is consumed by .next()

Fixed bug where _word_set was based on train_set, even if train_set is filelike instead of iterable.

jcalbert · 2017-05-07T02:10:58Z

Chased down those old bugs. TravisCI is complaining about a translation error, but I think that's Google's problem. Added that issue here: #161.

iepathos

Line 83 isn't quite python 3 compatible. Needs to use next() to work with python 3. I think github made it just bold so in case it isn't clear, its double underscore on both sides of next needed for Python 3 like you would see around an init function.

komuher · 2017-08-10T11:03:09Z

So guys?

jcalbert · 2017-08-10T21:16:04Z

As I mentioned above, the failed CI build is because of #161. #162 fixes that. Once it is merged then this (#136), #163, #167, and #170 should all pass CI after a rebase.

sloria · 2017-08-16T03:36:17Z

Thanks for the PR! And sorry for the delayed response.

emmyzero added 2 commits May 6, 2017 18:09

Special-cased when train_set is the null set

7505da4

Now returns an empty dict if passed an empty training set. Also, cover some bases if train_set is consumed by .next()

jcalbert force-pushed the dev branch from 57797e0 to 7505da4 Compare May 6, 2017 22:10

Base_Classifier wasn't unicode-ready.

61c7e47

Fixed bug where _word_set was based on train_set, even if train_set is filelike instead of iterable.

jcalbert force-pushed the dev branch from 1198d7e to 61c7e47 Compare May 7, 2017 02:08

iepathos reviewed May 11, 2017

View reviewed changes

Fixed a .next() call that broke py3 compatibility.

57b8969

sloria merged commit 57b8969 into sloria:dev Aug 16, 2017

This was referenced Aug 16, 2017

NaiveBayesClassifier taking too long #63

Closed

Classifiers Taking Days To Create #77

Closed

Can classifier update() be faster than training from scratch? #123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempting to fix slow NaiveBayes #136

Attempting to fix slow NaiveBayes #136

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

sloria commented Sep 1, 2016

Uh oh!

jcalbert commented May 7, 2017 •

edited

Loading

Uh oh!

iepathos left a comment •

edited

Loading

Uh oh!

komuher commented Aug 10, 2017

Uh oh!

jcalbert commented Aug 10, 2017

Uh oh!

sloria commented Aug 16, 2017

Uh oh!

Uh oh!

Attempting to fix slow NaiveBayes #136

Attempting to fix slow NaiveBayes #136

Uh oh!

Conversation

jcalbert commented Sep 1, 2016

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

jcalbert commented Sep 1, 2016

Uh oh!

sloria commented Sep 1, 2016

Uh oh!

jcalbert commented May 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iepathos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

komuher commented Aug 10, 2017

Uh oh!

jcalbert commented Aug 10, 2017

Uh oh!

sloria commented Aug 16, 2017

Uh oh!

Uh oh!

jcalbert commented May 7, 2017 •

edited

Loading

iepathos left a comment •

edited

Loading