Simple Indonesian text classifier using Sklearn Pipeline. Two models are currently available:
- TF-IDF vectorization with SGD Classifier (linear SVM) (default) [3]: 61% F1-score with small model size (< 800 KB)
- Word2Vec x TF-IDF vectorization with RBF SVM [4]: 77% F1-score but with huge model size (> 1 GB) due to encapsulation of FastText pretrained word vectors (> 750 MB)
- Python >= 3.x
- Sqlite3
- Install dependencies
pip install -r requirements.txt
- Put datasets in
dataset_labeled.csv
- Run
python train.py
. Example:
$ python train.py
Cross-validating..
Classifier: SGDClassifier
No test_pos_f1 test_pos_precision test_pos_recall test_neg_f1 test_neg_precision test_neg_recall
0 0.625 1.0 0.4545 0.9 0.8182 1.0
1 0.7778 1.0 0.6364 0.931 0.871 1.0
2 0.6667 0.8571 0.5455 0.8966 0.8387 0.963
3 0.5333 1.0 0.3636 0.8852 0.7941 1.0
4 0.625 1.0 0.4545 0.9 0.8182 1.0
5 0.8182 0.8182 0.8182 0.9259 0.9259 0.9259
6 0.7619 0.8889 0.6667 0.9091 0.8621 0.9615
7 0.1333 0.25 0.0909 0.7797 0.697 0.8846
Avg 0.6176 0.8518 0.5038 0.8909 0.8281 0.9669
- Download Indonesian Wikipedia word2vec model from Facebook Research wiki.id.vec and put it in the same directory as
train.py
- Put datasets in
dataset_labeled.csv
- Run
python train.py -m w2v-rbfsvm
- Run
python train.py -m w2v-rbfsvm -e dataset_labeled.vec
to use custom embedding file trained usingword2vec.py
Cross-validating..
Classifier: SVC
No test_pos_f1 test_pos_precision test_pos_recall test_neg_f1 test_neg_precision test_neg_recall
0 0.8421 1.0 0.7273 0.9474 0.9 1.0
1 0.8421 1.0 0.7273 0.9474 0.9 1.0
2 0.8 0.8889 0.7273 0.9286 0.8966 0.963
3 0.625 1.0 0.4545 0.9 0.8182 1.0
4 0.9 1.0 0.8182 0.9643 0.931 1.0
5 0.8696 0.8333 0.9091 0.9434 0.9615 0.9259
6 0.8696 0.9091 0.8333 0.9434 0.9259 0.9615
7 0.4286 1.0 0.2727 0.8667 0.7647 1.0
Avg 0.7721 0.9539 0.6837 0.9301 0.8872 0.9813
You can train your own word2vec using
python word2vec.py
by putting training data (sentences) in first column ofdataset_labeled.csv
- Do training or obtain
model.pkl
(and put it in same location astest.py
) - Install dependencies (if you have not)
pip install -r requirements.txt
- Run
python test.py "sentences_1" "sentences_2" "sentences_n"
. Example:
$ python test.py "Harga Gabah Jatuh karena Hujan Berkepanjangan" "Donatella Klaim Film Serial Pembunuhan Gianni Versace Fiktif"
Preprocessing..
100% (2 of 2) |########################################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Prediction(s):
Harga Gabah Jatuh karena Hujan Berkepanjangan (1)
Donatella Klaim Film Serial Pembunuhan Gianni Versace Fiktif (0)
To automatically crawl titles (and links) from website, run classification and store positive results use crawler.py
:
- Copy
.config.example
to.config
- Run
python crawler.py
To automatically post crawled links from website, use fb.py
:
- Copy
.config.example
to.config
and replace all Facebook config with valid values - Run
python fb.py
To get token without expiry time follow suggestion from documentation :
"To get a longer-lived page access token, exchange the User access token for a long-lived one, as above, and then request the Page access token. The resulting page access token will not have any expiry time."
$ python crawl.py
- Store word2vec model in database to save RAM
- Replace stemming with lemmatizer
- Add more crawling sources
- Try reinforcement learning instead of supervised learning
- Stemmer https://github.com/har07/PySastrawi
- Stopwords list https://github.com/stopwords-iso/stopwords-id
- Text classification with Sklearn Pipeline https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html
- Text classification with Sklearn and Gensim Word2Vec http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
- Facebook FastText pretrained word vectors https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md