Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to fix slow NaiveBayes #136

Merged
merged 4 commits into from
Aug 16, 2017
Merged

Attempting to fix slow NaiveBayes #136

merged 4 commits into from
Aug 16, 2017

Conversation

jcalbert
Copy link

@jcalbert jcalbert commented Sep 1, 2016

Three changes:

  1. basic_extractor can accept a list of strings as well as a list of
    ('word','label') tuples.

  2. BaseClassifier now has an instance variable _word_set which is a set
    of tokens seen by the classifier.

1+2) BaseClassifier.extract_features passes _word_set to extractor
rather than the training set.

  1. NLTKClassifier.update adds new words to the _word_set.

@jcalbert
Copy link
Author

jcalbert commented Sep 1, 2016

As mentioned in a few issues, NaiveBayes is slow.

Test script, using the movie_reviews dataset:

import os, sys
from random import shuffle
pos_train = [(open('../movie_reviews/pos/'+fname,'r').read(),'pos') for fname in os.listdir('../movie_reviews/pos/')]
neg_train = [(open('../movie_reviews/neg/'+fname,'r').read(),'neg') for fname in os.listdir('../movie_reviews/neg/')]
all_train = pos_train + neg_train
shuffle(all_train)

#Old version
from textblob.classifiers_OLD import NaiveBayesClassifier as NBC

counts = [2**j for j in range(1,7)]
print 'Baseline: '
baseline_res = []
for n in counts:
    t = %timeit -o -q -n 1 -r 1 classifier = NBC(all_train[:n])
    s = str(n)
    print s + ' reviews' +  ' '*(5-len(s)) + str(round(t.best,3)) + ' sec'
    baseline_res.append(t.best)

from textblob.classifiers import NaiveBayesClassifier as NBC
print 'Modified: '
modified_res = []
for n in counts:
    t = %timeit -o -q -n 1 -r 1 classifier = NBC(all_train[:n])
    s = str(n)
    print s + ' reviews' +  ' '*(5-len(s)) + str(round(t.best,3)) + ' sec'
    modified_res.append(t.best)

with output:

Baseline: 
2 reviews    0.05 sec
4 reviews    0.137 sec
8 reviews    0.569 sec
16 reviews   2.57 sec
32 reviews   10.669 sec
64 reviews   42.275 sec

Modified: 
2 reviews    0.032 sec
4 reviews    0.056 sec
8 reviews    0.137 sec
16 reviews   0.338 sec
32 reviews   0.759 sec
64 reviews   1.803 sec

@jcalbert
Copy link
Author

jcalbert commented Sep 1, 2016

Relates to #63, #77 and #123.

@jcalbert
Copy link
Author

jcalbert commented Sep 1, 2016

From the travis-ci failures, it looks like the docstrings on _get_words_from_dataset() and basic_extractor() ask for more restricted training data that test_classifiers.py provides (many are not iterable). Separate issue?

@sloria
Copy link
Owner

sloria commented Sep 1, 2016

Thanks @jcalbert for the PR! I'll take a look at this over the weekend.

Joseph Albert added 2 commits May 6, 2017 18:09
Three changes:

1) basic_extractor can accept a list of strings as well as a list of
('word','label') tuples.

2) BaseClassifier now has an instance variable _word_set which is a set
of tokens seen by the classifier.

1+2) BaseClassifier.extract_features passes _word_set to extractor
rather than the training set.

3)  NLTKClassifier.update adds new words to the _word_set.
Now returns an empty dict if passed an empty training set.

Also, cover some bases if train_set is consumed by .next()
Fixed bug where _word_set was based on train_set, even if train_set
is filelike instead of iterable.
@jcalbert
Copy link
Author

jcalbert commented May 7, 2017

Chased down those old bugs. TravisCI is complaining about a translation error, but I think that's Google's problem. Added that issue here: #161.

Copy link

@iepathos iepathos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 83 isn't quite python 3 compatible. Needs to use next() to work with python 3. I think github made it just bold so in case it isn't clear, its double underscore on both sides of next needed for Python 3 like you would see around an init function.

@komuher
Copy link

komuher commented Aug 10, 2017

So guys?

@jcalbert
Copy link
Author

As I mentioned above, the failed CI build is because of #161. #162 fixes that. Once it is merged then this (#136), #163, #167, and #170 should all pass CI after a rebase.

@sloria sloria merged commit 57b8969 into sloria:dev Aug 16, 2017
@sloria
Copy link
Owner

sloria commented Aug 16, 2017

Thanks for the PR! And sorry for the delayed response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants