Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaiveBayesClassifier taking too long #63

Closed
canivel opened this issue May 29, 2014 · 12 comments
Closed

NaiveBayesClassifier taking too long #63

canivel opened this issue May 29, 2014 · 12 comments

Comments

@canivel
Copy link

canivel commented May 29, 2014

Hi, I've a small dataset of 1000 tweets which I've classify in pos/neg for training. When I tried to use it at the NaiveBayesClassifier() it tooks like 10-15min to return a result...
Is there a way to save the result of the classifier like a dump and reuse that for further classifications ?

Thanks

@Coaden
Copy link

Coaden commented May 29, 2014

It can take a while to Train the classifier.

Once the classifier is trained, It seems to return polarity fairly quickly.
You could pickle the trained classifier, and just read it in later.
This is also good to maintain a restful state on the HTTP server.

http://pymotw.com/2/pickle/

On Thu, May 29, 2014 at 10:28 AM, canivel [email protected] wrote:

Hi, I've a small dataset of 1000 tweets which I've classify in pos/neg for
training. When I tried to use it at the NaiveBayesClassifier() it tooks
like 10-15min to return a result...
Is there a way to save the result of the classifier like a dump and reuse
that for further classifications ?

Thanks


Reply to this email directly or view it on GitHubhttps://github.com//issues/63
.

@Coaden
Copy link

Coaden commented May 29, 2014

Also, is you time issue when Training, or When using it to classify a new
tweet?

On Thu, May 29, 2014 at 10:28 AM, canivel [email protected] wrote:

Hi, I've a small dataset of 1000 tweets which I've classify in pos/neg for
training. When I tried to use it at the NaiveBayesClassifier() it tooks
like 10-15min to return a result...
Is there a way to save the result of the classifier like a dump and reuse
that for further classifications ?

Thanks


Reply to this email directly or view it on GitHubhttps://github.com//issues/63
.

@canivel
Copy link
Author

canivel commented May 29, 2014

when using the training, look:

def save_classifier(classifier):
f = open('semtiment_classifier.pickle', 'wb')
pickle.dump(classifier, f, -1)
f.close()

def load_classifier():
f = open('semtiment_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
return classifier

if os.path.isfile('semtiment_classifier.pickle'):
cl = load_classifier()
else:
cl = NaiveBayesClassifier(train)
save_classifier(cl)

c = cl.classify("This is a fantastic api!")
print "Classify: {}".format(c)

it saves the classifier, but when I try to run again returns:
c = cl.classify("This is a fantastic api!")
AttributeError: 'function' object has no attribute 'classify' ...

Thanks for the help

@Coaden
Copy link

Coaden commented May 29, 2014

What you put above is virtually identical to what I have. Are you doing
this as a TextBlob user, pickling the TextBlob classifier, or, as a
TextBlob dev, and you're pickling an NLTK or pattern.en classifier inside
TextBlob code?

I use this method with NLTK classifiers and with SkLearn classifiers, and
it works fine.

note:
I'm using cPickle.

You might try to train classifier C1, then pickle it and load it as C2.
Then compare C1 to C2. See if pickle isn't serializing it correctly?
You could also assert isinstance(C2, TextBlob. NaiveBayesClassifier)
or assert isinstance(C2, type(C1))
or assert type(C1) is type(C2)
or alike = type(C1) is type(C2)
etc..

On Thu, May 29, 2014 at 12:29 PM, canivel [email protected] wrote:

when using the training, look:

def save_classifier(classifier):
f = open('semtiment_classifier.pickle', 'wb')
pickle.dump(classifier, f, -1)
f.close()

def load_classifier():
f = open('semtiment_classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
return classifier

if os.path.isfile('semtiment_classifier.pickle'):
cl = load_classifier()
else:
cl = NaiveBayesClassifier(train)
save_classifier(cl)

c = cl.classify("This is a fantastic api!")
print "Classify: {}".format(c)

it saves the classifier, but when I try to run again returns:
c = cl.classify("This is a fantastic api!")
AttributeError: 'function' object has no attribute 'classify' ...

Thanks for the help


Reply to this email directly or view it on GitHubhttps://github.com//issues/63#issuecomment-44559357
.

@canivel
Copy link
Author

canivel commented May 30, 2014

Thanks I got it working now... have to work out a lot of unicode characters in the dataset... just to let you now for a 942 tweets (pos/neg) it takes 31s to classify in a i7 new imac:
c = classifier.classify("This is a good api!")
Classify: pos
31.1534640789 seconds

Is there anything else to improve the execution time that you recommend... tks again!

@shackra
Copy link

shackra commented May 8, 2015

I'm experiencing this, but I cannot get rid of the training set because is in French, which means every string is UTF-8 encoded.

I actually created a training set encoded as ASCII, ignoring the characters outside that encoding (which means losing a lot of data), and the training phase was still taking too much time.

What can I do?

@cschwem2er
Copy link

Hi, I also noticed the TextBlob classifier to take very long in comparison to sklearn and the NLTK-classifier. But I don't understand why there is a difference, especially for NLTK. Isn't TextBlob using the NLTK version?

@chandra589
Copy link

Can I know how to get the accuracy and most informative features of the naive bayes classifier which is already trained using movie corpus.

Thanks

@DSA101
Copy link

DSA101 commented Jul 28, 2016

Experiencing similar performance issues. Using custom NaiveBayes classifier to train on 1500 article titles (pos, neg, NA). Takes about 10 minutes to train on core i7 and not much faster to classify the same number of titles with a pickled classifier (loading is fast, but classification is slow). Any ideas how to speed this up? I can understand that training is slow, but why is classification slow too?

@jcalbert
Copy link

jcalbert commented Sep 1, 2016

I ran into the same trouble. It seems that basic_extractor re-tokenizes the training set with every update AND for each item in the original training set (so training is O(n^2)).

I'm working on a fix that should also fix #77 and #123.

@gcmicah
Copy link

gcmicah commented Aug 15, 2017

any update on this issue?

@sloria
Copy link
Owner

sloria commented Aug 16, 2017

#136 is now merged and released to PyPI.

@sloria sloria closed this as completed Aug 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants