-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempting to fix slow NaiveBayes #136
Conversation
As mentioned in a few issues, NaiveBayes is slow. Test script, using the movie_reviews dataset:
with output:
|
From the travis-ci failures, it looks like the docstrings on |
Thanks @jcalbert for the PR! I'll take a look at this over the weekend. |
Three changes: 1) basic_extractor can accept a list of strings as well as a list of ('word','label') tuples. 2) BaseClassifier now has an instance variable _word_set which is a set of tokens seen by the classifier. 1+2) BaseClassifier.extract_features passes _word_set to extractor rather than the training set. 3) NLTKClassifier.update adds new words to the _word_set.
Now returns an empty dict if passed an empty training set. Also, cover some bases if train_set is consumed by .next()
Fixed bug where _word_set was based on train_set, even if train_set is filelike instead of iterable.
Chased down those old bugs. TravisCI is complaining about a translation error, but I think that's Google's problem. Added that issue here: #161. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 83 isn't quite python 3 compatible. Needs to use next() to work with python 3. I think github made it just bold so in case it isn't clear, its double underscore on both sides of next needed for Python 3 like you would see around an init function.
So guys? |
Thanks for the PR! And sorry for the delayed response. |
Three changes:
basic_extractor can accept a list of strings as well as a list of
('word','label') tuples.
BaseClassifier now has an instance variable _word_set which is a set
of tokens seen by the classifier.
1+2) BaseClassifier.extract_features passes _word_set to extractor
rather than the training set.