From 9cd0bff7ab012ce6c25f391a2100d1d744aa5f6a Mon Sep 17 00:00:00 2001 From: wannaphong Date: Tue, 19 Dec 2023 17:17:59 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20=20@=202c2ac?= =?UTF-8?q?85ac80712f4609fabf3ddf49577cb58c557=20=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- index.html | 2 +- search/search_index.json | 2 +- sitemap.xml.gz | Bin 127 -> 127 bytes tokenizer/index.html | 1 + 4 files changed, 3 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index dfb9b18..c9ba7fe 100644 --- a/index.html +++ b/index.html @@ -195,5 +195,5 @@ diff --git a/search/search_index.json b/search/search_index.json index 2e3bd2e..74b5d8f 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Model Cards These model cards contain technical details of the models developed and used in PyThaiNLP. PyThaiNLP Homepages: https://pythainlp.github.io/ . GitHub: PyThaiNLP/Model-Cards Cite Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency 2019 Jan 29 (pp. 220-229).","title":"Model Cards"},{"location":"#model-cards","text":"These model cards contain technical details of the models developed and used in PyThaiNLP. PyThaiNLP Homepages: https://pythainlp.github.io/ . GitHub: PyThaiNLP/Model-Cards Cite Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency 2019 Jan 29 (pp. 220-229).","title":"Model Cards"},{"location":"CLS/","text":"CLS Blackboard CLS V1.0 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: pythainlp/corpus/blackboard-cls_v1.0.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/issues/729 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support B_CLS 1.00 1.00 1.00 91698 E_CLS 1.00 1.00 1.00 91700 I_CLS 1.00 1.00 1.00 707795 micro avg 1.00 1.00 1.00 891193 macro avg 1.00 1.00 1.00 891193 weighted avg 1.00 1.00 1.00 891193 samples avg 1.00 1.00 1.00 891193 Ethical Considerations It trains from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only LST20 CLS v0.2 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-10-03 Model version: 0.2 Used in PyThaiNLP version: 2.2.4 + Filename: ~/pythainlp-data/cls-v0.2.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set (news domain) Evaluation Data LST20 Corpus Test set (news domain) Quantitative Analyses precision recall f1-score support B_CLS 0.90 0.94 0.92 16111 E_CLS 0.90 0.94 0.92 15947 I_CLS 0.99 0.97 0.98 169565 micro avg 0.97 0.97 0.97 201623 macro avg 0.93 0.95 0.94 201623 weighted avg 0.97 0.97 0.97 201623 samples avg 0.94 0.94 0.94 201623 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only","title":"CLS"},{"location":"CLS/#cls","text":"","title":"CLS"},{"location":"CLS/#blackboard-cls","text":"","title":"Blackboard CLS"},{"location":"CLS/#v10","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: pythainlp/corpus/blackboard-cls_v1.0.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/issues/729 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support B_CLS 1.00 1.00 1.00 91698 E_CLS 1.00 1.00 1.00 91700 I_CLS 1.00 1.00 1.00 707795 micro avg 1.00 1.00 1.00 891193 macro avg 1.00 1.00 1.00 891193 weighted avg 1.00 1.00 1.00 891193 samples avg 1.00 1.00 1.00 891193 Ethical Considerations It trains from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only","title":"V1.0"},{"location":"CLS/#lst20-cls","text":"","title":"LST20 CLS"},{"location":"CLS/#v02","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-10-03 Model version: 0.2 Used in PyThaiNLP version: 2.2.4 + Filename: ~/pythainlp-data/cls-v0.2.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set (news domain) Evaluation Data LST20 Corpus Test set (news domain) Quantitative Analyses precision recall f1-score support B_CLS 0.90 0.94 0.92 16111 E_CLS 0.90 0.94 0.92 15947 I_CLS 0.99 0.97 0.98 169565 micro avg 0.97 0.97 0.97 201623 macro avg 0.93 0.95 0.94 201623 weighted avg 0.97 0.97 0.97 201623 samples avg 0.94 0.94 0.94 201623 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only","title":"v0.2"},{"location":"Chunk%20Parser/","text":"Chunk Parser CRFChunk orchidpp v0.2 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-01-21 Model version: 0.2 Used in PyThaiNLP version: 2.3 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/524 License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/pull/1 Dataset: ORCHID++ from Thai Treebanks Dataset . We extract sentence subtree from tree to train data. (5,000 tree up to 5,935 tree) Intended Use Parser thai sentence to phrase structure Not suitable for other languages or other domains of orchid corpus Factors Based on thai chunk parser problems. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ORCHID++ (90%) from Thai Treebanks Dataset Evaluation Data ORCHID++ (10%) from Thai Treebanks Dataset Quantitative Analyses precision recall f1-score support B-NP 0.95 0.98 0.96 518 I-NP 0.86 0.91 0.88 2128 O 0.87 0.91 0.89 280 B-PP 0.91 0.77 0.83 65 I-PP 0.66 0.52 0.59 252 B-S 0.65 0.49 0.56 90 I-S 0.67 0.49 0.56 1082 B-VP 0.86 0.89 0.88 515 I-VP 0.90 0.94 0.92 4565 micro avg 0.86 0.86 0.86 9495 macro avg 0.81 0.77 0.79 9495 weighted avg 0.86 0.86 0.86 9495 samples avg 0.86 0.86 0.86 9495 Ethical Considerations It trains from the orchid++ corpus. It is possible to have a bias from the orchid++ corpus. Caveats and Recommendations 1 Thai sentence with [(word,part-of-speech)] (part-of-speech model trained from orchid corpus)","title":"Chunk Parser"},{"location":"Chunk%20Parser/#chunk-parser","text":"","title":"Chunk Parser"},{"location":"Chunk%20Parser/#crfchunk-orchidpp","text":"","title":"CRFChunk orchidpp"},{"location":"Chunk%20Parser/#v02","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-01-21 Model version: 0.2 Used in PyThaiNLP version: 2.3 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/524 License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/pull/1 Dataset: ORCHID++ from Thai Treebanks Dataset . We extract sentence subtree from tree to train data. (5,000 tree up to 5,935 tree) Intended Use Parser thai sentence to phrase structure Not suitable for other languages or other domains of orchid corpus Factors Based on thai chunk parser problems. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ORCHID++ (90%) from Thai Treebanks Dataset Evaluation Data ORCHID++ (10%) from Thai Treebanks Dataset Quantitative Analyses precision recall f1-score support B-NP 0.95 0.98 0.96 518 I-NP 0.86 0.91 0.88 2128 O 0.87 0.91 0.89 280 B-PP 0.91 0.77 0.83 65 I-PP 0.66 0.52 0.59 252 B-S 0.65 0.49 0.56 90 I-S 0.67 0.49 0.56 1082 B-VP 0.86 0.89 0.88 515 I-VP 0.90 0.94 0.92 4565 micro avg 0.86 0.86 0.86 9495 macro avg 0.81 0.77 0.79 9495 weighted avg 0.86 0.86 0.86 9495 samples avg 0.86 0.86 0.86 9495 Ethical Considerations It trains from the orchid++ corpus. It is possible to have a bias from the orchid++ corpus. Caveats and Recommendations 1 Thai sentence with [(word,part-of-speech)] (part-of-speech model trained from orchid corpus)","title":"v0.2"},{"location":"NER/","text":"NER models This page will collect the Model Cards for NER in PyThaiNLP. Thai NER v1.4 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-5-21 Model version: 1.4 Used in PyThaiNLP version: 2.2 + Filename: ~/pythainlp-data/thai-ner-1-4.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.4 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.4 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.3 Corpus Train set Evaluation Data ThaiNER 1.3 Corpus Test set Quantitative Analyses precision recall f1-score support B-DATE 0.92 0.86 0.89 375 I-DATE 0.94 0.94 0.94 747 B-EMAIL 1.00 1.00 1.00 5 I-EMAIL 1.00 1.00 1.00 28 B-LAW 0.71 0.56 0.62 43 I-LAW 0.74 0.70 0.72 154 B-LEN 0.96 0.93 0.95 29 I-LEN 0.98 0.94 0.96 69 B-LOCATION 0.88 0.77 0.82 864 I-LOCATION 0.86 0.73 0.79 852 B-MONEY 0.98 0.85 0.91 105 I-MONEY 0.96 0.95 0.95 239 B-ORGANIZATION 0.90 0.78 0.84 1166 I-ORGANIZATION 0.84 0.77 0.81 1338 B-PERCENT 1.00 0.97 0.99 34 I-PERCENT 1.00 0.96 0.98 51 B-PERSON 0.96 0.82 0.88 676 I-PERSON 0.94 0.92 0.93 2424 B-PHONE 1.00 0.72 0.84 29 I-PHONE 0.96 0.92 0.94 78 B-TIME 0.87 0.73 0.79 172 I-TIME 0.94 0.83 0.88 336 B-URL 0.89 1.00 0.94 24 I-URL 0.96 1.00 0.98 371 B-ZIP 1.00 1.00 1.00 4 micro avg 0.91 0.84 0.87 10213 macro avg 0.93 0.87 0.89 10213 weighted avg 0.91 0.84 0.87 10213 samples avg 0.17 0.17 0.17 10213 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only v1.5 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-1-16 Model version: 1.5 Used in PyThaiNLP version: 2.3 + Filename: ~/pythainlp-data/thai-ner-1-5-newmm-lst20.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.5 (Data and train notebook): thai-ner-1-5-newmm-lst20.ipynb https://github.com/wannaphong/thai-ner/tree/master/model/1.5 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.82 0.87 350 I-DATE 0.95 0.94 0.95 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.85 0.64 0.73 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.81 0.70 0.75 620 I-LOCATION 0.74 0.72 0.73 533 B-MONEY 1.00 0.91 0.95 131 I-MONEY 0.99 0.95 0.97 321 B-ORGANIZATION 0.92 0.70 0.80 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.90 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only v1.5.1 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-6-21 Model version: 1.5.1 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/thainer_crf_1_5_1.model CRF Model License: CC0 GitHub for Thai NER 1.5.1 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.5.1 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.81 0.87 350 I-DATE 0.94 0.94 0.94 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.87 0.65 0.74 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.80 0.70 0.75 620 I-LOCATION 0.75 0.72 0.73 533 B-MONEY 1.00 0.90 0.95 131 I-MONEY 0.99 0.94 0.97 321 B-ORGANIZATION 0.91 0.70 0.79 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.89 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only v2.0 Host: https://huggingface.co/pythainlp/thainer-corpus-v2-base-model","title":"NER models"},{"location":"NER/#ner-models","text":"This page will collect the Model Cards for NER in PyThaiNLP.","title":"NER models"},{"location":"NER/#thai-ner","text":"","title":"Thai NER"},{"location":"NER/#v14","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-5-21 Model version: 1.4 Used in PyThaiNLP version: 2.2 + Filename: ~/pythainlp-data/thai-ner-1-4.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.4 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.4 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.3 Corpus Train set Evaluation Data ThaiNER 1.3 Corpus Test set Quantitative Analyses precision recall f1-score support B-DATE 0.92 0.86 0.89 375 I-DATE 0.94 0.94 0.94 747 B-EMAIL 1.00 1.00 1.00 5 I-EMAIL 1.00 1.00 1.00 28 B-LAW 0.71 0.56 0.62 43 I-LAW 0.74 0.70 0.72 154 B-LEN 0.96 0.93 0.95 29 I-LEN 0.98 0.94 0.96 69 B-LOCATION 0.88 0.77 0.82 864 I-LOCATION 0.86 0.73 0.79 852 B-MONEY 0.98 0.85 0.91 105 I-MONEY 0.96 0.95 0.95 239 B-ORGANIZATION 0.90 0.78 0.84 1166 I-ORGANIZATION 0.84 0.77 0.81 1338 B-PERCENT 1.00 0.97 0.99 34 I-PERCENT 1.00 0.96 0.98 51 B-PERSON 0.96 0.82 0.88 676 I-PERSON 0.94 0.92 0.93 2424 B-PHONE 1.00 0.72 0.84 29 I-PHONE 0.96 0.92 0.94 78 B-TIME 0.87 0.73 0.79 172 I-TIME 0.94 0.83 0.88 336 B-URL 0.89 1.00 0.94 24 I-URL 0.96 1.00 0.98 371 B-ZIP 1.00 1.00 1.00 4 micro avg 0.91 0.84 0.87 10213 macro avg 0.93 0.87 0.89 10213 weighted avg 0.91 0.84 0.87 10213 samples avg 0.17 0.17 0.17 10213 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only","title":"v1.4"},{"location":"NER/#v15","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-1-16 Model version: 1.5 Used in PyThaiNLP version: 2.3 + Filename: ~/pythainlp-data/thai-ner-1-5-newmm-lst20.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.5 (Data and train notebook): thai-ner-1-5-newmm-lst20.ipynb https://github.com/wannaphong/thai-ner/tree/master/model/1.5 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.82 0.87 350 I-DATE 0.95 0.94 0.95 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.85 0.64 0.73 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.81 0.70 0.75 620 I-LOCATION 0.74 0.72 0.73 533 B-MONEY 1.00 0.91 0.95 131 I-MONEY 0.99 0.95 0.97 321 B-ORGANIZATION 0.92 0.70 0.80 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.90 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only","title":"v1.5"},{"location":"NER/#v151","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-6-21 Model version: 1.5.1 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/thainer_crf_1_5_1.model CRF Model License: CC0 GitHub for Thai NER 1.5.1 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.5.1 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.81 0.87 350 I-DATE 0.94 0.94 0.94 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.87 0.65 0.74 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.80 0.70 0.75 620 I-LOCATION 0.75 0.72 0.73 533 B-MONEY 1.00 0.90 0.95 131 I-MONEY 0.99 0.94 0.97 321 B-ORGANIZATION 0.91 0.70 0.79 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.89 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only","title":"v1.5.1"},{"location":"NER/#v20","text":"Host: https://huggingface.co/pythainlp/thainer-corpus-v2-base-model","title":"v2.0"},{"location":"Part%20of%20speech/","text":"Part of speech orchid perceptron Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-5-15 Model version: 1.0 Used in PyThaiNLP version: 1.7 + Filename: pythainlp/corpus/pos_orchid_perceptron.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_orchid_postag_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of orchid corpus. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Orchid Corpus Evaluation Data Orchid Corpus Quantitative Analyses No data (This corpus do not have the test set.) Ethical Considerations It trains from orchid Corpus. It is possible to have a bias from orchid Corpus. Caveats and Recommendations Thai word token only LST20 perceptron Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-8-11 Model version: 0.2.3 Used in PyThaiNLP version: 2.2.5 + Filename: pythainlp/corpus/pos_lst20_perceptron-v0.2.3.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of LST20 corpus. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set Evaluation Data LST20 Corpus Test set Quantitative Analyses precision recall f1-score support AJ 0.90 0.87 0.88 4403 AV 0.88 0.79 0.83 6722 AX 0.95 0.94 0.95 7556 CC 0.94 0.97 0.95 17613 CL 0.87 0.85 0.86 3739 FX 0.99 0.99 0.99 6918 IJ 1.00 0.25 0.40 4 NG 1.00 1.00 1.00 1694 NN 0.97 0.98 0.98 58568 NU 0.98 0.98 0.98 6256 PA 0.88 0.89 0.88 194 PR 0.88 0.85 0.86 2139 PS 0.94 0.93 0.94 10886 PU 1.00 1.00 1.00 37973 VV 0.95 0.97 0.96 42586 XX 0.00 0.00 0.00 27 accuracy 0.96 207278 macro avg 0.88 0.83 0.84 207278 weighted avg 0.96 0.96 0.96 207278 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations Thai word token only ^ Back to top UD_Thai-PUD Part-of-speech v0.1 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-4-15 Model version: 0.1 Used in PyThaiNLP version: 1.7 - 2.3 Filename: pythainlp/corpus/pos_ud_unigram.json and pythainlp/corpus/pos_ud_unigram.json unigram model & perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.2 Evaluation Data None Quantitative Analyses (This corpus do not have the test set.) Ethical Considerations no ideas Caveats and Recommendations Thai word token only v0.2 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-7-31 Model version: 0.2 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/pos_ud_unigram-v0.2.json and pythainlp/corpus/pos_ud_unigram-v0.2.json unigram model & perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/603 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_ud_thai_pud_pythainlp-v0.2.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.8 Evaluation Data None Quantitative Analyses None Ethical Considerations no ideas Caveats and Recommendations Thai word token only blackboard perceptron Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: blackboard_pt_tagger-v1.0_pythainlp.json perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/issues/731 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_blackboard_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of Blackboard treebank. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support AJ 0.90 0.90 0.90 16030 AV 0.92 0.91 0.91 38078 AX 0.97 0.96 0.97 44719 CC 0.98 0.99 0.99 127801 CL 0.93 0.87 0.90 6738 FX 1.00 1.00 1.00 28991 IJ 1.00 0.58 0.74 12 NG 1.00 1.00 1.00 12121 NN 0.99 0.99 0.99 283971 NU 0.98 0.97 0.98 19220 PA 0.98 0.88 0.93 1916 PR 0.93 0.89 0.91 12869 PS 0.96 0.96 0.96 39317 PU 1.00 1.00 1.00 1576 VV 0.98 0.98 0.98 257831 XX 1.00 0.50 0.67 4 accuracy 0.98 891194 macro avg 0.97 0.90 0.93 891194 weighted avg 0.98 0.98 0.98 891194 Ethical Considerations It trained from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations Thai word token only ^ Back to top","title":"Part of speech"},{"location":"Part%20of%20speech/#part-of-speech","text":"","title":"Part of speech"},{"location":"Part%20of%20speech/#orchid-perceptron","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-5-15 Model version: 1.0 Used in PyThaiNLP version: 1.7 + Filename: pythainlp/corpus/pos_orchid_perceptron.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_orchid_postag_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of orchid corpus. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Orchid Corpus Evaluation Data Orchid Corpus Quantitative Analyses No data (This corpus do not have the test set.) Ethical Considerations It trains from orchid Corpus. It is possible to have a bias from orchid Corpus. Caveats and Recommendations Thai word token only","title":"orchid perceptron"},{"location":"Part%20of%20speech/#lst20-perceptron","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-8-11 Model version: 0.2.3 Used in PyThaiNLP version: 2.2.5 + Filename: pythainlp/corpus/pos_lst20_perceptron-v0.2.3.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of LST20 corpus. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set Evaluation Data LST20 Corpus Test set Quantitative Analyses precision recall f1-score support AJ 0.90 0.87 0.88 4403 AV 0.88 0.79 0.83 6722 AX 0.95 0.94 0.95 7556 CC 0.94 0.97 0.95 17613 CL 0.87 0.85 0.86 3739 FX 0.99 0.99 0.99 6918 IJ 1.00 0.25 0.40 4 NG 1.00 1.00 1.00 1694 NN 0.97 0.98 0.98 58568 NU 0.98 0.98 0.98 6256 PA 0.88 0.89 0.88 194 PR 0.88 0.85 0.86 2139 PS 0.94 0.93 0.94 10886 PU 1.00 1.00 1.00 37973 VV 0.95 0.97 0.96 42586 XX 0.00 0.00 0.00 27 accuracy 0.96 207278 macro avg 0.88 0.83 0.84 207278 weighted avg 0.96 0.96 0.96 207278 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations Thai word token only ^ Back to top","title":"LST20 perceptron"},{"location":"Part%20of%20speech/#ud_thai-pud-part-of-speech","text":"","title":"UD_Thai-PUD Part-of-speech"},{"location":"Part%20of%20speech/#v01","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-4-15 Model version: 0.1 Used in PyThaiNLP version: 1.7 - 2.3 Filename: pythainlp/corpus/pos_ud_unigram.json and pythainlp/corpus/pos_ud_unigram.json unigram model & perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.2 Evaluation Data None Quantitative Analyses (This corpus do not have the test set.) Ethical Considerations no ideas Caveats and Recommendations Thai word token only","title":"v0.1"},{"location":"Part%20of%20speech/#v02","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-7-31 Model version: 0.2 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/pos_ud_unigram-v0.2.json and pythainlp/corpus/pos_ud_unigram-v0.2.json unigram model & perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/603 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_ud_thai_pud_pythainlp-v0.2.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.8 Evaluation Data None Quantitative Analyses None Ethical Considerations no ideas Caveats and Recommendations Thai word token only","title":"v0.2"},{"location":"Part%20of%20speech/#blackboard-perceptron","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: blackboard_pt_tagger-v1.0_pythainlp.json perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/issues/731 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_blackboard_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of Blackboard treebank. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support AJ 0.90 0.90 0.90 16030 AV 0.92 0.91 0.91 38078 AX 0.97 0.96 0.97 44719 CC 0.98 0.99 0.99 127801 CL 0.93 0.87 0.90 6738 FX 1.00 1.00 1.00 28991 IJ 1.00 0.58 0.74 12 NG 1.00 1.00 1.00 12121 NN 0.99 0.99 0.99 283971 NU 0.98 0.97 0.98 19220 PA 0.98 0.88 0.93 1916 PR 0.93 0.89 0.91 12869 PS 0.96 0.96 0.96 39317 PU 1.00 1.00 1.00 1576 VV 0.98 0.98 0.98 257831 XX 1.00 0.50 0.67 4 accuracy 0.98 891194 macro avg 0.97 0.90 0.93 891194 weighted avg 0.98 0.98 0.98 891194 Ethical Considerations It trained from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations Thai word token only ^ Back to top","title":"blackboard perceptron"},{"location":"WangChanGLM/","text":"WangChanGLM This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangChanGLM at pythainlp/wangchanglm-7.5B-sft-en .","title":"WangChanGLM"},{"location":"WangChanGLM/#wangchanglm","text":"This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangChanGLM at pythainlp/wangchanglm-7.5B-sft-en .","title":"WangChanGLM"},{"location":"encoder/","text":"Encoder models WangchanBERTa This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangchanBERTa at huggingface.co/airesearch/wangchanberta-base-att-spm-uncased .","title":"Encoder models"},{"location":"encoder/#encoder-models","text":"","title":"Encoder models"},{"location":"encoder/#wangchanberta","text":"This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangchanBERTa at huggingface.co/airesearch/wangchanberta-base-att-spm-uncased .","title":"WangchanBERTa"},{"location":"thai2fit/","text":"thai2fit v0.32 Model Details Developer: Charin Polpanumas This report author: Wannaphong Phatthiyaphaibun Model date: 2019-06-14 Model version: 0.32 Used in PyThaiNLP version: 2.0+ Filename: ~/pythainlp-data/itos_lstm.pkl and ~/pythainlp-data/thwiki_model_lstm.pth GitHub: https://github.com/cstorm125/thai2fit Notebook for training: https://github.com/cstorm125/thai2fit/blob/96fe40d1a9f270dfe0d3a61d2a93254df4078b0d/thwiki_lm/thwiki_lm.ipynb Language Model License: MIT License Intended Use Language Modeling for Thai text classification pretrained or more. Factors Based on known problems with Thai natural Language processing. Language Modeling for many tasks of Natural Language processing. Ep. text classification, text generation, and more. Metrics Evaluation metrics include Perplexity. Training Data Thai Wikipedia Dump last updated February 17, 2019 Evaluation Data Thai Wikipedia Dump by using 40M/200k/200k tokens of train-validation-test split Quantitative Analyses perplexity is 28.71067 with 60,005 embeddings at 400 dimensions Ethical Considerations This language model is based on the Thai Wikipedia Dump (include bias from Thai Wikipedia). Caveats and Recommendations It\u2019s want to have fastai 1.9 for using it or using it from pythainlp. It supports Thai Language only.","title":"thai2fit"},{"location":"thai2fit/#thai2fit","text":"","title":"thai2fit"},{"location":"thai2fit/#v032","text":"Model Details Developer: Charin Polpanumas This report author: Wannaphong Phatthiyaphaibun Model date: 2019-06-14 Model version: 0.32 Used in PyThaiNLP version: 2.0+ Filename: ~/pythainlp-data/itos_lstm.pkl and ~/pythainlp-data/thwiki_model_lstm.pth GitHub: https://github.com/cstorm125/thai2fit Notebook for training: https://github.com/cstorm125/thai2fit/blob/96fe40d1a9f270dfe0d3a61d2a93254df4078b0d/thwiki_lm/thwiki_lm.ipynb Language Model License: MIT License Intended Use Language Modeling for Thai text classification pretrained or more. Factors Based on known problems with Thai natural Language processing. Language Modeling for many tasks of Natural Language processing. Ep. text classification, text generation, and more. Metrics Evaluation metrics include Perplexity. Training Data Thai Wikipedia Dump last updated February 17, 2019 Evaluation Data Thai Wikipedia Dump by using 40M/200k/200k tokens of train-validation-test split Quantitative Analyses perplexity is 28.71067 with 60,005 embeddings at 400 dimensions Ethical Considerations This language model is based on the Thai Wikipedia Dump (include bias from Thai Wikipedia). Caveats and Recommendations It\u2019s want to have fastai 1.9 for using it or using it from pythainlp. It supports Thai Language only.","title":"v0.32"},{"location":"tokenizer/","text":"Tokenizer CRFcut v1.0 Model Details Developer: Chonlapat Patanajirasit This report author: Wannaphong Phatthiyaphaibun Model date: 2020-05-09 Model version: 1.0 Used in PyThaiNLP version: 2.2 + Filename: pythainlp/corpus/sentenceseg_crfcut.model GitHub: https://github.com/vistec-AI/crfcut CRF Model License: CC0 Intended Use - Segmenting Thai text into sentences. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Ted + Orchid + Fake review Evaluation Data Ted + Orchid + Fake review dataset validate Quantitative Analyses The result of CRF-Cut is trained by different datasets are as follows: dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82 Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73 Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78 Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71 Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87 Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70 Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56 Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67 Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97 Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78 Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82 Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96 Ethical Considerations no ideas Caveats and Recommendations Thai text only Han-solo \ud83e\udebf Han-solo: Thai syllable segmenter This work wants to create a Thai syllable segmenter that can work in the Thai social media domain. Model Details Developer: Wannaphong Phatthiyaphaibun Model date: 2023-07-30 Model version: 1.0 Used in PyThaiNLP version: 5.0 Filename: pythainlp/corpus/han_solo.crfsuite GitHub: https://github.com/PyThaiNLP/Han-solo CRF Model License: CC0 Intended Use Segmenting Thai text into syllables. Factors - Based on known problems with thai natural Language processing. Metrics F1-score Training Data Han-solo train set and Nutcha Dataset Evaluation Data Han-solo Testset Quantitative Analyses 1 is split, and 0 is not split. precision recall f1-score support 0 1.00 1.00 1.00 61078 1 1.00 0.99 0.99 29468 accuracy 1.00 90546 macro avg 1.00 1.00 1.00 90546 weighted avg 1.00 1.00 1.00 90546 Ethical Considerations The model trained on news and social network domain. It can has biase from human and domain. Caveats and Recommendations Thai text only","title":"Tokenizer"},{"location":"tokenizer/#tokenizer","text":"","title":"Tokenizer"},{"location":"tokenizer/#crfcut","text":"","title":"CRFcut"},{"location":"tokenizer/#v10","text":"Model Details Developer: Chonlapat Patanajirasit This report author: Wannaphong Phatthiyaphaibun Model date: 2020-05-09 Model version: 1.0 Used in PyThaiNLP version: 2.2 + Filename: pythainlp/corpus/sentenceseg_crfcut.model GitHub: https://github.com/vistec-AI/crfcut CRF Model License: CC0 Intended Use - Segmenting Thai text into sentences. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Ted + Orchid + Fake review Evaluation Data Ted + Orchid + Fake review dataset validate Quantitative Analyses The result of CRF-Cut is trained by different datasets are as follows: dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82 Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73 Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78 Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71 Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87 Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70 Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56 Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67 Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97 Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78 Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82 Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96 Ethical Considerations no ideas Caveats and Recommendations Thai text only","title":"v1.0"},{"location":"tokenizer/#han-solo","text":"\ud83e\udebf Han-solo: Thai syllable segmenter This work wants to create a Thai syllable segmenter that can work in the Thai social media domain. Model Details Developer: Wannaphong Phatthiyaphaibun Model date: 2023-07-30 Model version: 1.0 Used in PyThaiNLP version: 5.0 Filename: pythainlp/corpus/han_solo.crfsuite GitHub: https://github.com/PyThaiNLP/Han-solo CRF Model License: CC0 Intended Use Segmenting Thai text into syllables. Factors - Based on known problems with thai natural Language processing. Metrics F1-score Training Data Han-solo train set and Nutcha Dataset Evaluation Data Han-solo Testset Quantitative Analyses 1 is split, and 0 is not split. precision recall f1-score support 0 1.00 1.00 1.00 61078 1 1.00 0.99 0.99 29468 accuracy 1.00 90546 macro avg 1.00 1.00 1.00 90546 weighted avg 1.00 1.00 1.00 90546 Ethical Considerations The model trained on news and social network domain. It can has biase from human and domain. Caveats and Recommendations Thai text only","title":"Han-solo"},{"location":"transliteration/","text":"Transliteration Thai W2P Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-12-29 Model version: 0.1 Used in PyThaiNLP version: 2.3+ Filename: ~/pythainlp-data/w2p_0.1.npy GitHub: https://github.com/PyThaiNLP/pythainlp/pull/511 License: CC0 train notebook: https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb Intended Use Converter thai word to thai phoneme Not suitable for other language. Factors Based on thai word to thai phoneme problems. Metrics Evaluation metrics include phoneme error rate (number error / number phonemes) Training Data Thai W2P (80%) Evaluation Data Thai W2P (20%) Quantitative Analyses epoch: 100 step: 100, loss: 0.03179970383644104 step: 200, loss: 0.04126007482409477 step: 300, loss: 0.01877519115805626 step: 400, loss: 0.03311225399374962 per: 0.0432 per: 0.0419 Ethical Considerations This corpus is based on the website, such as wiktionary, Royal Institute et cetera and more. It may not be the dialect that you use in everyday life. Caveats and Recommendations 1 Thai word only Thai2Rom Thai romanization using LSTM encoder-decoder model with attention mechanism v0.1 Model Details Developer: Chakri Lowphansirikul This report author: Wannaphong Phatthiyaphaibun Model date: 2019-08-11 Model version: 0.1 Used in PyThaiNLP version: 2.1 + Filename: ~/pythainlp-data/thai2rom-pytorch-attn-v0.1.tar GitHub: https://github.com/PyThaiNLP/pythainlp/pull/246 Train Notebook: https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb LSTM Model Dataset: https://github.com/lalital/thai-romanization/blob/master/dataset/data.new License: CC0 Intended Use - conversion of thai text to the Roman. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Thai2Rom trainset Evaluation Data Thai2Rom testset Quantitative Analyses The model was evaluated with 3 metrics including F1-score, Exact match, Exact match at character level on the validation set (20% of the dataset or 129,642 examples). F1 (macro-average): 0.987 Exact match: 0.883 Exact match (Character-level): 0.949 Ethical Considerations no ideas Caveats and Recommendations Thai text only Thai G2P Thai Grapheme-to-Phoneme (Thai G2P) based on Deep Learning (Seq2Seq model) v0.1 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-08-20 Model version: 0.1 Used in PyThaiNLP version: 2.2+ Filename: ~/pythainlp-data/thaig2p-0.1.tar Pull request GitHub: https://github.com/PyThaiNLP/pythainlp/pull/377 GitHub: https://github.com/wannaphong/thai-g2p Train notebook: https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb Dataset: wiktionary-11-2-2020.tsv Seq2Seq model License: CC0 Intended Use Grapheme-to-Phoneme conversion tool. Factors Based on thai grapheme-to-phoneme conversion problems. Metrics f1-score. Training Data wiktionary trainset Evaluation Data wiktionary testset Quantitative Analyses F1 (macro-average) = 0.9415941561267093 EM = 0.71 EM (Character-level) = 0.8660247630539959 save best model em score=0.71 at epoch=1148 Save model at epoch 1148 Epoch: 1149 | Time: 2m 55s Train Loss: 0.352 | Train PPL: 1.422 Val. Loss: 0.512 | Val. PPL: 1.669 epoch=1149, teacher_forcing_ratio=0.4 Ethical Considerations This model is based on the Thai wiktionary Dump (include bias from Thai wiktionary). Caveats and Recommendations 1 Thai word only","title":"Transliteration"},{"location":"transliteration/#transliteration","text":"","title":"Transliteration"},{"location":"transliteration/#thai-w2p","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-12-29 Model version: 0.1 Used in PyThaiNLP version: 2.3+ Filename: ~/pythainlp-data/w2p_0.1.npy GitHub: https://github.com/PyThaiNLP/pythainlp/pull/511 License: CC0 train notebook: https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb Intended Use Converter thai word to thai phoneme Not suitable for other language. Factors Based on thai word to thai phoneme problems. Metrics Evaluation metrics include phoneme error rate (number error / number phonemes) Training Data Thai W2P (80%) Evaluation Data Thai W2P (20%) Quantitative Analyses epoch: 100 step: 100, loss: 0.03179970383644104 step: 200, loss: 0.04126007482409477 step: 300, loss: 0.01877519115805626 step: 400, loss: 0.03311225399374962 per: 0.0432 per: 0.0419 Ethical Considerations This corpus is based on the website, such as wiktionary, Royal Institute et cetera and more. It may not be the dialect that you use in everyday life. Caveats and Recommendations 1 Thai word only","title":"Thai W2P"},{"location":"transliteration/#thai2rom","text":"Thai romanization using LSTM encoder-decoder model with attention mechanism","title":"Thai2Rom"},{"location":"transliteration/#v01","text":"Model Details Developer: Chakri Lowphansirikul This report author: Wannaphong Phatthiyaphaibun Model date: 2019-08-11 Model version: 0.1 Used in PyThaiNLP version: 2.1 + Filename: ~/pythainlp-data/thai2rom-pytorch-attn-v0.1.tar GitHub: https://github.com/PyThaiNLP/pythainlp/pull/246 Train Notebook: https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb LSTM Model Dataset: https://github.com/lalital/thai-romanization/blob/master/dataset/data.new License: CC0 Intended Use - conversion of thai text to the Roman. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Thai2Rom trainset Evaluation Data Thai2Rom testset Quantitative Analyses The model was evaluated with 3 metrics including F1-score, Exact match, Exact match at character level on the validation set (20% of the dataset or 129,642 examples). F1 (macro-average): 0.987 Exact match: 0.883 Exact match (Character-level): 0.949 Ethical Considerations no ideas Caveats and Recommendations Thai text only","title":"v0.1"},{"location":"transliteration/#thai-g2p","text":"Thai Grapheme-to-Phoneme (Thai G2P) based on Deep Learning (Seq2Seq model)","title":"Thai G2P"},{"location":"transliteration/#v01_1","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-08-20 Model version: 0.1 Used in PyThaiNLP version: 2.2+ Filename: ~/pythainlp-data/thaig2p-0.1.tar Pull request GitHub: https://github.com/PyThaiNLP/pythainlp/pull/377 GitHub: https://github.com/wannaphong/thai-g2p Train notebook: https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb Dataset: wiktionary-11-2-2020.tsv Seq2Seq model License: CC0 Intended Use Grapheme-to-Phoneme conversion tool. Factors Based on thai grapheme-to-phoneme conversion problems. Metrics f1-score. Training Data wiktionary trainset Evaluation Data wiktionary testset Quantitative Analyses F1 (macro-average) = 0.9415941561267093 EM = 0.71 EM (Character-level) = 0.8660247630539959 save best model em score=0.71 at epoch=1148 Save model at epoch 1148 Epoch: 1149 | Time: 2m 55s Train Loss: 0.352 | Train PPL: 1.422 Val. Loss: 0.512 | Val. PPL: 1.669 epoch=1149, teacher_forcing_ratio=0.4 Ethical Considerations This model is based on the Thai wiktionary Dump (include bias from Thai wiktionary). Caveats and Recommendations 1 Thai word only","title":"v0.1"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Model Cards These model cards contain technical details of the models developed and used in PyThaiNLP. PyThaiNLP Homepages: https://pythainlp.github.io/ . GitHub: PyThaiNLP/Model-Cards Cite Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency 2019 Jan 29 (pp. 220-229).","title":"Model Cards"},{"location":"#model-cards","text":"These model cards contain technical details of the models developed and used in PyThaiNLP. PyThaiNLP Homepages: https://pythainlp.github.io/ . GitHub: PyThaiNLP/Model-Cards Cite Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, Spitzer E, Raji ID, Gebru T. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency 2019 Jan 29 (pp. 220-229).","title":"Model Cards"},{"location":"CLS/","text":"CLS Blackboard CLS V1.0 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: pythainlp/corpus/blackboard-cls_v1.0.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/issues/729 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support B_CLS 1.00 1.00 1.00 91698 E_CLS 1.00 1.00 1.00 91700 I_CLS 1.00 1.00 1.00 707795 micro avg 1.00 1.00 1.00 891193 macro avg 1.00 1.00 1.00 891193 weighted avg 1.00 1.00 1.00 891193 samples avg 1.00 1.00 1.00 891193 Ethical Considerations It trains from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only LST20 CLS v0.2 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-10-03 Model version: 0.2 Used in PyThaiNLP version: 2.2.4 + Filename: ~/pythainlp-data/cls-v0.2.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set (news domain) Evaluation Data LST20 Corpus Test set (news domain) Quantitative Analyses precision recall f1-score support B_CLS 0.90 0.94 0.92 16111 E_CLS 0.90 0.94 0.92 15947 I_CLS 0.99 0.97 0.98 169565 micro avg 0.97 0.97 0.97 201623 macro avg 0.93 0.95 0.94 201623 weighted avg 0.97 0.97 0.97 201623 samples avg 0.94 0.94 0.94 201623 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only","title":"CLS"},{"location":"CLS/#cls","text":"","title":"CLS"},{"location":"CLS/#blackboard-cls","text":"","title":"Blackboard CLS"},{"location":"CLS/#v10","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: pythainlp/corpus/blackboard-cls_v1.0.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/issues/729 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support B_CLS 1.00 1.00 1.00 91698 E_CLS 1.00 1.00 1.00 91700 I_CLS 1.00 1.00 1.00 707795 micro avg 1.00 1.00 1.00 891193 macro avg 1.00 1.00 1.00 891193 weighted avg 1.00 1.00 1.00 891193 samples avg 1.00 1.00 1.00 891193 Ethical Considerations It trains from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only","title":"V1.0"},{"location":"CLS/#lst20-cls","text":"","title":"LST20 CLS"},{"location":"CLS/#v02","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-10-03 Model version: 0.2 Used in PyThaiNLP version: 2.2.4 + Filename: ~/pythainlp-data/cls-v0.2.crfsuite GitHub: https://github.com/PyThaiNLP/pythainlp/pull/479 CRF Model License: CC0 Intended Use Segmenting Thai text into clauses (smaller than a sentence but bigger than a word) Not suitable for other language or non-news domains. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set (news domain) Evaluation Data LST20 Corpus Test set (news domain) Quantitative Analyses precision recall f1-score support B_CLS 0.90 0.94 0.92 16111 E_CLS 0.90 0.94 0.92 15947 I_CLS 0.99 0.97 0.98 169565 micro avg 0.97 0.97 0.97 201623 macro avg 0.93 0.95 0.94 201623 weighted avg 0.97 0.97 0.97 201623 samples avg 0.94 0.94 0.94 201623 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations The user must perform word segmentation first before using this model. Thai text only","title":"v0.2"},{"location":"Chunk%20Parser/","text":"Chunk Parser CRFChunk orchidpp v0.2 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-01-21 Model version: 0.2 Used in PyThaiNLP version: 2.3 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/524 License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/pull/1 Dataset: ORCHID++ from Thai Treebanks Dataset . We extract sentence subtree from tree to train data. (5,000 tree up to 5,935 tree) Intended Use Parser thai sentence to phrase structure Not suitable for other languages or other domains of orchid corpus Factors Based on thai chunk parser problems. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ORCHID++ (90%) from Thai Treebanks Dataset Evaluation Data ORCHID++ (10%) from Thai Treebanks Dataset Quantitative Analyses precision recall f1-score support B-NP 0.95 0.98 0.96 518 I-NP 0.86 0.91 0.88 2128 O 0.87 0.91 0.89 280 B-PP 0.91 0.77 0.83 65 I-PP 0.66 0.52 0.59 252 B-S 0.65 0.49 0.56 90 I-S 0.67 0.49 0.56 1082 B-VP 0.86 0.89 0.88 515 I-VP 0.90 0.94 0.92 4565 micro avg 0.86 0.86 0.86 9495 macro avg 0.81 0.77 0.79 9495 weighted avg 0.86 0.86 0.86 9495 samples avg 0.86 0.86 0.86 9495 Ethical Considerations It trains from the orchid++ corpus. It is possible to have a bias from the orchid++ corpus. Caveats and Recommendations 1 Thai sentence with [(word,part-of-speech)] (part-of-speech model trained from orchid corpus)","title":"Chunk Parser"},{"location":"Chunk%20Parser/#chunk-parser","text":"","title":"Chunk Parser"},{"location":"Chunk%20Parser/#crfchunk-orchidpp","text":"","title":"CRFChunk orchidpp"},{"location":"Chunk%20Parser/#v02","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-01-21 Model version: 0.2 Used in PyThaiNLP version: 2.3 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/524 License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/pull/1 Dataset: ORCHID++ from Thai Treebanks Dataset . We extract sentence subtree from tree to train data. (5,000 tree up to 5,935 tree) Intended Use Parser thai sentence to phrase structure Not suitable for other languages or other domains of orchid corpus Factors Based on thai chunk parser problems. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ORCHID++ (90%) from Thai Treebanks Dataset Evaluation Data ORCHID++ (10%) from Thai Treebanks Dataset Quantitative Analyses precision recall f1-score support B-NP 0.95 0.98 0.96 518 I-NP 0.86 0.91 0.88 2128 O 0.87 0.91 0.89 280 B-PP 0.91 0.77 0.83 65 I-PP 0.66 0.52 0.59 252 B-S 0.65 0.49 0.56 90 I-S 0.67 0.49 0.56 1082 B-VP 0.86 0.89 0.88 515 I-VP 0.90 0.94 0.92 4565 micro avg 0.86 0.86 0.86 9495 macro avg 0.81 0.77 0.79 9495 weighted avg 0.86 0.86 0.86 9495 samples avg 0.86 0.86 0.86 9495 Ethical Considerations It trains from the orchid++ corpus. It is possible to have a bias from the orchid++ corpus. Caveats and Recommendations 1 Thai sentence with [(word,part-of-speech)] (part-of-speech model trained from orchid corpus)","title":"v0.2"},{"location":"NER/","text":"NER models This page will collect the Model Cards for NER in PyThaiNLP. Thai NER v1.4 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-5-21 Model version: 1.4 Used in PyThaiNLP version: 2.2 + Filename: ~/pythainlp-data/thai-ner-1-4.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.4 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.4 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.3 Corpus Train set Evaluation Data ThaiNER 1.3 Corpus Test set Quantitative Analyses precision recall f1-score support B-DATE 0.92 0.86 0.89 375 I-DATE 0.94 0.94 0.94 747 B-EMAIL 1.00 1.00 1.00 5 I-EMAIL 1.00 1.00 1.00 28 B-LAW 0.71 0.56 0.62 43 I-LAW 0.74 0.70 0.72 154 B-LEN 0.96 0.93 0.95 29 I-LEN 0.98 0.94 0.96 69 B-LOCATION 0.88 0.77 0.82 864 I-LOCATION 0.86 0.73 0.79 852 B-MONEY 0.98 0.85 0.91 105 I-MONEY 0.96 0.95 0.95 239 B-ORGANIZATION 0.90 0.78 0.84 1166 I-ORGANIZATION 0.84 0.77 0.81 1338 B-PERCENT 1.00 0.97 0.99 34 I-PERCENT 1.00 0.96 0.98 51 B-PERSON 0.96 0.82 0.88 676 I-PERSON 0.94 0.92 0.93 2424 B-PHONE 1.00 0.72 0.84 29 I-PHONE 0.96 0.92 0.94 78 B-TIME 0.87 0.73 0.79 172 I-TIME 0.94 0.83 0.88 336 B-URL 0.89 1.00 0.94 24 I-URL 0.96 1.00 0.98 371 B-ZIP 1.00 1.00 1.00 4 micro avg 0.91 0.84 0.87 10213 macro avg 0.93 0.87 0.89 10213 weighted avg 0.91 0.84 0.87 10213 samples avg 0.17 0.17 0.17 10213 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only v1.5 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-1-16 Model version: 1.5 Used in PyThaiNLP version: 2.3 + Filename: ~/pythainlp-data/thai-ner-1-5-newmm-lst20.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.5 (Data and train notebook): thai-ner-1-5-newmm-lst20.ipynb https://github.com/wannaphong/thai-ner/tree/master/model/1.5 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.82 0.87 350 I-DATE 0.95 0.94 0.95 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.85 0.64 0.73 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.81 0.70 0.75 620 I-LOCATION 0.74 0.72 0.73 533 B-MONEY 1.00 0.91 0.95 131 I-MONEY 0.99 0.95 0.97 321 B-ORGANIZATION 0.92 0.70 0.80 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.90 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only v1.5.1 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-6-21 Model version: 1.5.1 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/thainer_crf_1_5_1.model CRF Model License: CC0 GitHub for Thai NER 1.5.1 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.5.1 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.81 0.87 350 I-DATE 0.94 0.94 0.94 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.87 0.65 0.74 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.80 0.70 0.75 620 I-LOCATION 0.75 0.72 0.73 533 B-MONEY 1.00 0.90 0.95 131 I-MONEY 0.99 0.94 0.97 321 B-ORGANIZATION 0.91 0.70 0.79 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.89 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only v2.0 Host: https://huggingface.co/pythainlp/thainer-corpus-v2-base-model","title":"NER models"},{"location":"NER/#ner-models","text":"This page will collect the Model Cards for NER in PyThaiNLP.","title":"NER models"},{"location":"NER/#thai-ner","text":"","title":"Thai NER"},{"location":"NER/#v14","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-5-21 Model version: 1.4 Used in PyThaiNLP version: 2.2 + Filename: ~/pythainlp-data/thai-ner-1-4.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.4 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.4 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.3 Corpus Train set Evaluation Data ThaiNER 1.3 Corpus Test set Quantitative Analyses precision recall f1-score support B-DATE 0.92 0.86 0.89 375 I-DATE 0.94 0.94 0.94 747 B-EMAIL 1.00 1.00 1.00 5 I-EMAIL 1.00 1.00 1.00 28 B-LAW 0.71 0.56 0.62 43 I-LAW 0.74 0.70 0.72 154 B-LEN 0.96 0.93 0.95 29 I-LEN 0.98 0.94 0.96 69 B-LOCATION 0.88 0.77 0.82 864 I-LOCATION 0.86 0.73 0.79 852 B-MONEY 0.98 0.85 0.91 105 I-MONEY 0.96 0.95 0.95 239 B-ORGANIZATION 0.90 0.78 0.84 1166 I-ORGANIZATION 0.84 0.77 0.81 1338 B-PERCENT 1.00 0.97 0.99 34 I-PERCENT 1.00 0.96 0.98 51 B-PERSON 0.96 0.82 0.88 676 I-PERSON 0.94 0.92 0.93 2424 B-PHONE 1.00 0.72 0.84 29 I-PHONE 0.96 0.92 0.94 78 B-TIME 0.87 0.73 0.79 172 I-TIME 0.94 0.83 0.88 336 B-URL 0.89 1.00 0.94 24 I-URL 0.96 1.00 0.98 371 B-ZIP 1.00 1.00 1.00 4 micro avg 0.91 0.84 0.87 10213 macro avg 0.93 0.87 0.89 10213 weighted avg 0.91 0.84 0.87 10213 samples avg 0.17 0.17 0.17 10213 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only","title":"v1.4"},{"location":"NER/#v15","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-1-16 Model version: 1.5 Used in PyThaiNLP version: 2.3 + Filename: ~/pythainlp-data/thai-ner-1-5-newmm-lst20.crfsuite CRF Model License: CC0 GitHub for Thai NER 1.5 (Data and train notebook): thai-ner-1-5-newmm-lst20.ipynb https://github.com/wannaphong/thai-ner/tree/master/model/1.5 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.82 0.87 350 I-DATE 0.95 0.94 0.95 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.85 0.64 0.73 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.81 0.70 0.75 620 I-LOCATION 0.74 0.72 0.73 533 B-MONEY 1.00 0.91 0.95 131 I-MONEY 0.99 0.95 0.97 321 B-ORGANIZATION 0.92 0.70 0.80 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.90 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only","title":"v1.5"},{"location":"NER/#v151","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-6-21 Model version: 1.5.1 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/thainer_crf_1_5_1.model CRF Model License: CC0 GitHub for Thai NER 1.5.1 (Data and train notebook): https://github.com/wannaphong/thai-ner/tree/master/model/1.5.1 Intended Use Named-Entity Tagging for Thai. Not suitable for other language or non-news domain. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data ThaiNER 1.5 Corpus Train set (5089 sent) Evaluation Data ThaiNER 1.5 Corpus Test set (1274 sent) Quantitative Analyses precision recall f1-score support B-DATE 0.93 0.81 0.87 350 I-DATE 0.94 0.94 0.94 665 B-LAW 0.85 0.54 0.66 87 I-LAW 0.87 0.65 0.74 253 B-LEN 1.00 0.75 0.86 12 I-LEN 1.00 0.69 0.82 26 B-LOCATION 0.80 0.70 0.75 620 I-LOCATION 0.75 0.72 0.73 533 B-MONEY 1.00 0.90 0.95 131 I-MONEY 0.99 0.94 0.97 321 B-ORGANIZATION 0.91 0.70 0.79 1334 I-ORGANIZATION 0.80 0.73 0.76 1198 B-PERCENT 0.94 0.88 0.91 17 I-PERCENT 0.91 0.95 0.93 22 B-PERSON 0.96 0.78 0.86 607 I-PERSON 0.94 0.88 0.91 2181 B-PHONE 1.00 0.50 0.67 2 I-PHONE 1.00 1.00 1.00 8 B-TIME 0.93 0.66 0.77 87 I-TIME 0.97 0.77 0.86 158 B-URL 0.91 0.83 0.87 12 I-URL 0.93 0.96 0.94 94 micro avg 0.89 0.79 0.84 8718 macro avg 0.92 0.79 0.84 8718 weighted avg 0.89 0.79 0.84 8718 samples avg 0.16 0.16 0.16 8718 Ethical Considerations This model has bias from corpus creator. (Wannaphong Phatthiyaphaibun) This model uses the part-of-speech model to build it, so It does have a bias from the part-of-speech model. Caveats and Recommendations Thai text only","title":"v1.5.1"},{"location":"NER/#v20","text":"Host: https://huggingface.co/pythainlp/thainer-corpus-v2-base-model","title":"v2.0"},{"location":"Part%20of%20speech/","text":"Part of speech orchid perceptron Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-5-15 Model version: 1.0 Used in PyThaiNLP version: 1.7 + Filename: pythainlp/corpus/pos_orchid_perceptron.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_orchid_postag_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of orchid corpus. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Orchid Corpus Evaluation Data Orchid Corpus Quantitative Analyses No data (This corpus do not have the test set.) Ethical Considerations It trains from orchid Corpus. It is possible to have a bias from orchid Corpus. Caveats and Recommendations Thai word token only LST20 perceptron Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-8-11 Model version: 0.2.3 Used in PyThaiNLP version: 2.2.5 + Filename: pythainlp/corpus/pos_lst20_perceptron-v0.2.3.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of LST20 corpus. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set Evaluation Data LST20 Corpus Test set Quantitative Analyses precision recall f1-score support AJ 0.90 0.87 0.88 4403 AV 0.88 0.79 0.83 6722 AX 0.95 0.94 0.95 7556 CC 0.94 0.97 0.95 17613 CL 0.87 0.85 0.86 3739 FX 0.99 0.99 0.99 6918 IJ 1.00 0.25 0.40 4 NG 1.00 1.00 1.00 1694 NN 0.97 0.98 0.98 58568 NU 0.98 0.98 0.98 6256 PA 0.88 0.89 0.88 194 PR 0.88 0.85 0.86 2139 PS 0.94 0.93 0.94 10886 PU 1.00 1.00 1.00 37973 VV 0.95 0.97 0.96 42586 XX 0.00 0.00 0.00 27 accuracy 0.96 207278 macro avg 0.88 0.83 0.84 207278 weighted avg 0.96 0.96 0.96 207278 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations Thai word token only ^ Back to top UD_Thai-PUD Part-of-speech v0.1 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-4-15 Model version: 0.1 Used in PyThaiNLP version: 1.7 - 2.3 Filename: pythainlp/corpus/pos_ud_unigram.json and pythainlp/corpus/pos_ud_unigram.json unigram model & perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.2 Evaluation Data None Quantitative Analyses (This corpus do not have the test set.) Ethical Considerations no ideas Caveats and Recommendations Thai word token only v0.2 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-7-31 Model version: 0.2 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/pos_ud_unigram-v0.2.json and pythainlp/corpus/pos_ud_unigram-v0.2.json unigram model & perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/603 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_ud_thai_pud_pythainlp-v0.2.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.8 Evaluation Data None Quantitative Analyses None Ethical Considerations no ideas Caveats and Recommendations Thai word token only blackboard perceptron Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: blackboard_pt_tagger-v1.0_pythainlp.json perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/issues/731 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_blackboard_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of Blackboard treebank. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support AJ 0.90 0.90 0.90 16030 AV 0.92 0.91 0.91 38078 AX 0.97 0.96 0.97 44719 CC 0.98 0.99 0.99 127801 CL 0.93 0.87 0.90 6738 FX 1.00 1.00 1.00 28991 IJ 1.00 0.58 0.74 12 NG 1.00 1.00 1.00 12121 NN 0.99 0.99 0.99 283971 NU 0.98 0.97 0.98 19220 PA 0.98 0.88 0.93 1916 PR 0.93 0.89 0.91 12869 PS 0.96 0.96 0.96 39317 PU 1.00 1.00 1.00 1576 VV 0.98 0.98 0.98 257831 XX 1.00 0.50 0.67 4 accuracy 0.98 891194 macro avg 0.97 0.90 0.93 891194 weighted avg 0.98 0.98 0.98 891194 Ethical Considerations It trained from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations Thai word token only ^ Back to top","title":"Part of speech"},{"location":"Part%20of%20speech/#part-of-speech","text":"","title":"Part of speech"},{"location":"Part%20of%20speech/#orchid-perceptron","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-5-15 Model version: 1.0 Used in PyThaiNLP version: 1.7 + Filename: pythainlp/corpus/pos_orchid_perceptron.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_orchid_postag_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of orchid corpus. Factors Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Orchid Corpus Evaluation Data Orchid Corpus Quantitative Analyses No data (This corpus do not have the test set.) Ethical Considerations It trains from orchid Corpus. It is possible to have a bias from orchid Corpus. Caveats and Recommendations Thai word token only","title":"orchid perceptron"},{"location":"Part%20of%20speech/#lst20-perceptron","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-8-11 Model version: 0.2.3 Used in PyThaiNLP version: 2.2.5 + Filename: pythainlp/corpus/pos_lst20_perceptron-v0.2.3.json perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of LST20 corpus. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data LST20 Corpus Train set Evaluation Data LST20 Corpus Test set Quantitative Analyses precision recall f1-score support AJ 0.90 0.87 0.88 4403 AV 0.88 0.79 0.83 6722 AX 0.95 0.94 0.95 7556 CC 0.94 0.97 0.95 17613 CL 0.87 0.85 0.86 3739 FX 0.99 0.99 0.99 6918 IJ 1.00 0.25 0.40 4 NG 1.00 1.00 1.00 1694 NN 0.97 0.98 0.98 58568 NU 0.98 0.98 0.98 6256 PA 0.88 0.89 0.88 194 PR 0.88 0.85 0.86 2139 PS 0.94 0.93 0.94 10886 PU 1.00 1.00 1.00 37973 VV 0.95 0.97 0.96 42586 XX 0.00 0.00 0.00 27 accuracy 0.96 207278 macro avg 0.88 0.83 0.84 207278 weighted avg 0.96 0.96 0.96 207278 Ethical Considerations It trains from LST20 Corpus. It is possible to have a bias from LST20 Corpus. Caveats and Recommendations Thai word token only ^ Back to top","title":"LST20 perceptron"},{"location":"Part%20of%20speech/#ud_thai-pud-part-of-speech","text":"","title":"UD_Thai-PUD Part-of-speech"},{"location":"Part%20of%20speech/#v01","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2018-4-15 Model version: 0.1 Used in PyThaiNLP version: 1.7 - 2.3 Filename: pythainlp/corpus/pos_ud_unigram.json and pythainlp/corpus/pos_ud_unigram.json unigram model & perceptron model License: CC0 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_lst20_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.2 Evaluation Data None Quantitative Analyses (This corpus do not have the test set.) Ethical Considerations no ideas Caveats and Recommendations Thai word token only","title":"v0.1"},{"location":"Part%20of%20speech/#v02","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2021-7-31 Model version: 0.2 Used in PyThaiNLP version: 2.4 + Filename: pythainlp/corpus/pos_ud_unigram-v0.2.json and pythainlp/corpus/pos_ud_unigram-v0.2.json unigram model & perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/pull/603 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_ud_thai_pud_pythainlp-v0.2.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of UD_Thai-PUD corpus. Factors - Based on known problems with thai natural Language processing. Metrics None Training Data UD_Thai-PUD v2.8 Evaluation Data None Quantitative Analyses None Ethical Considerations no ideas Caveats and Recommendations Thai word token only","title":"v0.2"},{"location":"Part%20of%20speech/#blackboard-perceptron","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2022-10-14 Model version: 1.0 Used in PyThaiNLP version: 3.2 + Filename: blackboard_pt_tagger-v1.0_pythainlp.json perceptron model License: CC0 GitHub: https://github.com/PyThaiNLP/pythainlp/issues/731 train notebook: https://github.com/PyThaiNLP/pythainlp_notebook/blob/master/postag/train_blackboard_pythainlp.ipynb Intended Use Part of speech for Thai. Not suitable for other languages or other domains of Blackboard treebank. Factors - Based on known problems with thai natural Language processing. Metrics Evaluation metrics include precision, recall and f1-score. Training Data Blackboard treebank Evaluation Data Blackboard treebank Quantitative Analyses precision recall f1-score support AJ 0.90 0.90 0.90 16030 AV 0.92 0.91 0.91 38078 AX 0.97 0.96 0.97 44719 CC 0.98 0.99 0.99 127801 CL 0.93 0.87 0.90 6738 FX 1.00 1.00 1.00 28991 IJ 1.00 0.58 0.74 12 NG 1.00 1.00 1.00 12121 NN 0.99 0.99 0.99 283971 NU 0.98 0.97 0.98 19220 PA 0.98 0.88 0.93 1916 PR 0.93 0.89 0.91 12869 PS 0.96 0.96 0.96 39317 PU 1.00 1.00 1.00 1576 VV 0.98 0.98 0.98 257831 XX 1.00 0.50 0.67 4 accuracy 0.98 891194 macro avg 0.97 0.90 0.93 891194 weighted avg 0.98 0.98 0.98 891194 Ethical Considerations It trained from Blackboard treebank. It is possible to have a bias from Blackboard treebank. Caveats and Recommendations Thai word token only ^ Back to top","title":"blackboard perceptron"},{"location":"WangChanGLM/","text":"WangChanGLM This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangChanGLM at pythainlp/wangchanglm-7.5B-sft-en .","title":"WangChanGLM"},{"location":"WangChanGLM/#wangchanglm","text":"This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangChanGLM at pythainlp/wangchanglm-7.5B-sft-en .","title":"WangChanGLM"},{"location":"encoder/","text":"Encoder models WangchanBERTa This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangchanBERTa at huggingface.co/airesearch/wangchanberta-base-att-spm-uncased .","title":"Encoder models"},{"location":"encoder/#encoder-models","text":"","title":"Encoder models"},{"location":"encoder/#wangchanberta","text":"This report author: Wannaphong Phatthiyaphaibun You can read the model card of WangchanBERTa at huggingface.co/airesearch/wangchanberta-base-att-spm-uncased .","title":"WangchanBERTa"},{"location":"thai2fit/","text":"thai2fit v0.32 Model Details Developer: Charin Polpanumas This report author: Wannaphong Phatthiyaphaibun Model date: 2019-06-14 Model version: 0.32 Used in PyThaiNLP version: 2.0+ Filename: ~/pythainlp-data/itos_lstm.pkl and ~/pythainlp-data/thwiki_model_lstm.pth GitHub: https://github.com/cstorm125/thai2fit Notebook for training: https://github.com/cstorm125/thai2fit/blob/96fe40d1a9f270dfe0d3a61d2a93254df4078b0d/thwiki_lm/thwiki_lm.ipynb Language Model License: MIT License Intended Use Language Modeling for Thai text classification pretrained or more. Factors Based on known problems with Thai natural Language processing. Language Modeling for many tasks of Natural Language processing. Ep. text classification, text generation, and more. Metrics Evaluation metrics include Perplexity. Training Data Thai Wikipedia Dump last updated February 17, 2019 Evaluation Data Thai Wikipedia Dump by using 40M/200k/200k tokens of train-validation-test split Quantitative Analyses perplexity is 28.71067 with 60,005 embeddings at 400 dimensions Ethical Considerations This language model is based on the Thai Wikipedia Dump (include bias from Thai Wikipedia). Caveats and Recommendations It\u2019s want to have fastai 1.9 for using it or using it from pythainlp. It supports Thai Language only.","title":"thai2fit"},{"location":"thai2fit/#thai2fit","text":"","title":"thai2fit"},{"location":"thai2fit/#v032","text":"Model Details Developer: Charin Polpanumas This report author: Wannaphong Phatthiyaphaibun Model date: 2019-06-14 Model version: 0.32 Used in PyThaiNLP version: 2.0+ Filename: ~/pythainlp-data/itos_lstm.pkl and ~/pythainlp-data/thwiki_model_lstm.pth GitHub: https://github.com/cstorm125/thai2fit Notebook for training: https://github.com/cstorm125/thai2fit/blob/96fe40d1a9f270dfe0d3a61d2a93254df4078b0d/thwiki_lm/thwiki_lm.ipynb Language Model License: MIT License Intended Use Language Modeling for Thai text classification pretrained or more. Factors Based on known problems with Thai natural Language processing. Language Modeling for many tasks of Natural Language processing. Ep. text classification, text generation, and more. Metrics Evaluation metrics include Perplexity. Training Data Thai Wikipedia Dump last updated February 17, 2019 Evaluation Data Thai Wikipedia Dump by using 40M/200k/200k tokens of train-validation-test split Quantitative Analyses perplexity is 28.71067 with 60,005 embeddings at 400 dimensions Ethical Considerations This language model is based on the Thai Wikipedia Dump (include bias from Thai Wikipedia). Caveats and Recommendations It\u2019s want to have fastai 1.9 for using it or using it from pythainlp. It supports Thai Language only.","title":"v0.32"},{"location":"tokenizer/","text":"Tokenizer CRFcut v1.0 Model Details Developer: Chonlapat Patanajirasit This report author: Wannaphong Phatthiyaphaibun Model date: 2020-05-09 Model version: 1.0 Used in PyThaiNLP version: 2.2 + Filename: pythainlp/corpus/sentenceseg_crfcut.model GitHub: https://github.com/vistec-AI/crfcut CRF Model License: CC0 Intended Use - Segmenting Thai text into sentences. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Ted + Orchid + Fake review Evaluation Data Ted + Orchid + Fake review dataset validate Quantitative Analyses The result of CRF-Cut is trained by different datasets are as follows: dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82 Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73 Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78 Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71 Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87 Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70 Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56 Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67 Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97 Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78 Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82 Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96 Ethical Considerations no ideas Caveats and Recommendations Thai text only Han-solo \ud83e\udebf Han-solo: Thai syllable segmenter This work wants to create a Thai syllable segmenter that can work in the Thai social media domain. Model Details Developer: Wannaphong Phatthiyaphaibun Model date: 2023-07-30 Model version: 1.0 Used in PyThaiNLP version: 5.0 Filename: pythainlp/corpus/han_solo.crfsuite GitHub: https://github.com/PyThaiNLP/Han-solo Pull request: https://github.com/PyThaiNLP/pythainlp/pull/830 CRF Model License: CC0 Intended Use Segmenting Thai text into syllables. Factors - Based on known problems with thai natural Language processing. Metrics F1-score Training Data Han-solo train set and Nutcha Dataset Evaluation Data Han-solo Testset Quantitative Analyses 1 is split, and 0 is not split. precision recall f1-score support 0 1.00 1.00 1.00 61078 1 1.00 0.99 0.99 29468 accuracy 1.00 90546 macro avg 1.00 1.00 1.00 90546 weighted avg 1.00 1.00 1.00 90546 Ethical Considerations The model trained on news and social network domain. It can has biase from human and domain. Caveats and Recommendations Thai text only","title":"Tokenizer"},{"location":"tokenizer/#tokenizer","text":"","title":"Tokenizer"},{"location":"tokenizer/#crfcut","text":"","title":"CRFcut"},{"location":"tokenizer/#v10","text":"Model Details Developer: Chonlapat Patanajirasit This report author: Wannaphong Phatthiyaphaibun Model date: 2020-05-09 Model version: 1.0 Used in PyThaiNLP version: 2.2 + Filename: pythainlp/corpus/sentenceseg_crfcut.model GitHub: https://github.com/vistec-AI/crfcut CRF Model License: CC0 Intended Use - Segmenting Thai text into sentences. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Ted + Orchid + Fake review Evaluation Data Ted + Orchid + Fake review dataset validate Quantitative Analyses The result of CRF-Cut is trained by different datasets are as follows: dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82 Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73 Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78 Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71 Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87 Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70 Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56 Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67 Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97 Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78 Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82 Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96 Ethical Considerations no ideas Caveats and Recommendations Thai text only","title":"v1.0"},{"location":"tokenizer/#han-solo","text":"\ud83e\udebf Han-solo: Thai syllable segmenter This work wants to create a Thai syllable segmenter that can work in the Thai social media domain. Model Details Developer: Wannaphong Phatthiyaphaibun Model date: 2023-07-30 Model version: 1.0 Used in PyThaiNLP version: 5.0 Filename: pythainlp/corpus/han_solo.crfsuite GitHub: https://github.com/PyThaiNLP/Han-solo Pull request: https://github.com/PyThaiNLP/pythainlp/pull/830 CRF Model License: CC0 Intended Use Segmenting Thai text into syllables. Factors - Based on known problems with thai natural Language processing. Metrics F1-score Training Data Han-solo train set and Nutcha Dataset Evaluation Data Han-solo Testset Quantitative Analyses 1 is split, and 0 is not split. precision recall f1-score support 0 1.00 1.00 1.00 61078 1 1.00 0.99 0.99 29468 accuracy 1.00 90546 macro avg 1.00 1.00 1.00 90546 weighted avg 1.00 1.00 1.00 90546 Ethical Considerations The model trained on news and social network domain. It can has biase from human and domain. Caveats and Recommendations Thai text only","title":"Han-solo"},{"location":"transliteration/","text":"Transliteration Thai W2P Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-12-29 Model version: 0.1 Used in PyThaiNLP version: 2.3+ Filename: ~/pythainlp-data/w2p_0.1.npy GitHub: https://github.com/PyThaiNLP/pythainlp/pull/511 License: CC0 train notebook: https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb Intended Use Converter thai word to thai phoneme Not suitable for other language. Factors Based on thai word to thai phoneme problems. Metrics Evaluation metrics include phoneme error rate (number error / number phonemes) Training Data Thai W2P (80%) Evaluation Data Thai W2P (20%) Quantitative Analyses epoch: 100 step: 100, loss: 0.03179970383644104 step: 200, loss: 0.04126007482409477 step: 300, loss: 0.01877519115805626 step: 400, loss: 0.03311225399374962 per: 0.0432 per: 0.0419 Ethical Considerations This corpus is based on the website, such as wiktionary, Royal Institute et cetera and more. It may not be the dialect that you use in everyday life. Caveats and Recommendations 1 Thai word only Thai2Rom Thai romanization using LSTM encoder-decoder model with attention mechanism v0.1 Model Details Developer: Chakri Lowphansirikul This report author: Wannaphong Phatthiyaphaibun Model date: 2019-08-11 Model version: 0.1 Used in PyThaiNLP version: 2.1 + Filename: ~/pythainlp-data/thai2rom-pytorch-attn-v0.1.tar GitHub: https://github.com/PyThaiNLP/pythainlp/pull/246 Train Notebook: https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb LSTM Model Dataset: https://github.com/lalital/thai-romanization/blob/master/dataset/data.new License: CC0 Intended Use - conversion of thai text to the Roman. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Thai2Rom trainset Evaluation Data Thai2Rom testset Quantitative Analyses The model was evaluated with 3 metrics including F1-score, Exact match, Exact match at character level on the validation set (20% of the dataset or 129,642 examples). F1 (macro-average): 0.987 Exact match: 0.883 Exact match (Character-level): 0.949 Ethical Considerations no ideas Caveats and Recommendations Thai text only Thai G2P Thai Grapheme-to-Phoneme (Thai G2P) based on Deep Learning (Seq2Seq model) v0.1 Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-08-20 Model version: 0.1 Used in PyThaiNLP version: 2.2+ Filename: ~/pythainlp-data/thaig2p-0.1.tar Pull request GitHub: https://github.com/PyThaiNLP/pythainlp/pull/377 GitHub: https://github.com/wannaphong/thai-g2p Train notebook: https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb Dataset: wiktionary-11-2-2020.tsv Seq2Seq model License: CC0 Intended Use Grapheme-to-Phoneme conversion tool. Factors Based on thai grapheme-to-phoneme conversion problems. Metrics f1-score. Training Data wiktionary trainset Evaluation Data wiktionary testset Quantitative Analyses F1 (macro-average) = 0.9415941561267093 EM = 0.71 EM (Character-level) = 0.8660247630539959 save best model em score=0.71 at epoch=1148 Save model at epoch 1148 Epoch: 1149 | Time: 2m 55s Train Loss: 0.352 | Train PPL: 1.422 Val. Loss: 0.512 | Val. PPL: 1.669 epoch=1149, teacher_forcing_ratio=0.4 Ethical Considerations This model is based on the Thai wiktionary Dump (include bias from Thai wiktionary). Caveats and Recommendations 1 Thai word only","title":"Transliteration"},{"location":"transliteration/#transliteration","text":"","title":"Transliteration"},{"location":"transliteration/#thai-w2p","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-12-29 Model version: 0.1 Used in PyThaiNLP version: 2.3+ Filename: ~/pythainlp-data/w2p_0.1.npy GitHub: https://github.com/PyThaiNLP/pythainlp/pull/511 License: CC0 train notebook: https://github.com/wannaphong/Thai_W2P/blob/main/train.ipynb Intended Use Converter thai word to thai phoneme Not suitable for other language. Factors Based on thai word to thai phoneme problems. Metrics Evaluation metrics include phoneme error rate (number error / number phonemes) Training Data Thai W2P (80%) Evaluation Data Thai W2P (20%) Quantitative Analyses epoch: 100 step: 100, loss: 0.03179970383644104 step: 200, loss: 0.04126007482409477 step: 300, loss: 0.01877519115805626 step: 400, loss: 0.03311225399374962 per: 0.0432 per: 0.0419 Ethical Considerations This corpus is based on the website, such as wiktionary, Royal Institute et cetera and more. It may not be the dialect that you use in everyday life. Caveats and Recommendations 1 Thai word only","title":"Thai W2P"},{"location":"transliteration/#thai2rom","text":"Thai romanization using LSTM encoder-decoder model with attention mechanism","title":"Thai2Rom"},{"location":"transliteration/#v01","text":"Model Details Developer: Chakri Lowphansirikul This report author: Wannaphong Phatthiyaphaibun Model date: 2019-08-11 Model version: 0.1 Used in PyThaiNLP version: 2.1 + Filename: ~/pythainlp-data/thai2rom-pytorch-attn-v0.1.tar GitHub: https://github.com/PyThaiNLP/pythainlp/pull/246 Train Notebook: https://github.com/lalital/thai-romanization/blob/master/notebook/thai_romanize_pytorch_seq2seq_attention.ipynb LSTM Model Dataset: https://github.com/lalital/thai-romanization/blob/master/dataset/data.new License: CC0 Intended Use - conversion of thai text to the Roman. Factors - Based on known problems with thai natural Language processing. Metrics - Evaluation metrics include precision, recall and f1-score. Training Data Thai2Rom trainset Evaluation Data Thai2Rom testset Quantitative Analyses The model was evaluated with 3 metrics including F1-score, Exact match, Exact match at character level on the validation set (20% of the dataset or 129,642 examples). F1 (macro-average): 0.987 Exact match: 0.883 Exact match (Character-level): 0.949 Ethical Considerations no ideas Caveats and Recommendations Thai text only","title":"v0.1"},{"location":"transliteration/#thai-g2p","text":"Thai Grapheme-to-Phoneme (Thai G2P) based on Deep Learning (Seq2Seq model)","title":"Thai G2P"},{"location":"transliteration/#v01_1","text":"Model Details Developer: Wannaphong Phatthiyaphaibun This report author: Wannaphong Phatthiyaphaibun Model date: 2020-08-20 Model version: 0.1 Used in PyThaiNLP version: 2.2+ Filename: ~/pythainlp-data/thaig2p-0.1.tar Pull request GitHub: https://github.com/PyThaiNLP/pythainlp/pull/377 GitHub: https://github.com/wannaphong/thai-g2p Train notebook: https://github.com/wannaphong/thai-g2p/blob/master/train.ipynb Dataset: wiktionary-11-2-2020.tsv Seq2Seq model License: CC0 Intended Use Grapheme-to-Phoneme conversion tool. Factors Based on thai grapheme-to-phoneme conversion problems. Metrics f1-score. Training Data wiktionary trainset Evaluation Data wiktionary testset Quantitative Analyses F1 (macro-average) = 0.9415941561267093 EM = 0.71 EM (Character-level) = 0.8660247630539959 save best model em score=0.71 at epoch=1148 Save model at epoch 1148 Epoch: 1149 | Time: 2m 55s Train Loss: 0.352 | Train PPL: 1.422 Val. Loss: 0.512 | Val. PPL: 1.669 epoch=1149, teacher_forcing_ratio=0.4 Ethical Considerations This model is based on the Thai wiktionary Dump (include bias from Thai wiktionary). Caveats and Recommendations 1 Thai word only","title":"v0.1"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 9b4b5e157a3754f5932134f567c4d2d8c81506cd..11f791b380e76bb4e54c83ec20b4594a22e80ca4 100644 GIT binary patch delta 11 Scmb=gXO-{f;BcGBS_J?T4+9hc delta 11 Scmb=gXO-{f;82^$S_J?Sj{@`n diff --git a/tokenizer/index.html b/tokenizer/index.html index 9a92776..42e7d60 100644 --- a/tokenizer/index.html +++ b/tokenizer/index.html @@ -306,6 +306,7 @@

Han-solo

  • Used in PyThaiNLP version: 5.0
  • Filename: pythainlp/corpus/han_solo.crfsuite
  • GitHub: https://github.com/PyThaiNLP/Han-solo
  • +
  • Pull request: https://github.com/PyThaiNLP/pythainlp/pull/830
  • CRF Model
  • License: CC0