Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/530 #253

Merged
merged 29 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
7587ff3
Added new Visual Document Classifier Annotator
gadde5300 Nov 8, 2023
a8f4f23
Added Albert_For_Question_Answering to component_universes
Dec 8, 2023
e358ce7
BartForZeroShotClassification Integration
Dec 9, 2023
e889ccd
XlmRobertaForZeroShotClassification Integration
Dec 10, 2023
0da21cb
Updated Visual Document classifier
gadde5300 Dec 23, 2023
6ac9297
Update ocr_extractors.py
gadde5300 Dec 23, 2023
5b59c7f
Update ocr_table_extraction_tests.py
gadde5300 Dec 23, 2023
28c9046
OpenAI Annotators Integration
Dec 30, 2023
a50ac81
Fixed spark start with apple silicon
sonurdogan Dec 30, 2023
444cb1b
Added Test and Colab nb for OpenAI annotators
sonurdogan Dec 31, 2023
f26e701
Added new tutorial notebook
gadde5300 Jan 2, 2024
9abff28
BGEEmbeddings Integration
sonurdogan Jan 21, 2024
92855fc
a new model called ner_protein_glove in english is added to the nlu
SKocer Feb 16, 2024
99040ec
make setup.py depend on nlu source code version directly
C-K-Loan Feb 28, 2024
c53ff83
drop modifcation time from ocr dfs because of bug in db env and add m…
C-K-Loan Feb 28, 2024
c6c735e
DeBertaForZeroShotClassification Integration
SKocer Mar 3, 2024
161b687
Bugfix improper file-handling on Databricks for visual and audio models
C-K-Loan Mar 5, 2024
e9a4835
Merge pull request #225 from JohnSnowLabs/sod_albertqa_fix
C-K-Loan Mar 5, 2024
48132cf
Merge pull request #226 from JohnSnowLabs/sod_bartzeroshot_integration
C-K-Loan Mar 5, 2024
c35d636
Merge remote-tracking branch 'origin/sod_xmlrobertazeroshot_integrati…
C-K-Loan Mar 5, 2024
dd82d0d
Merge remote-tracking branch 'origin/sod_bgee_integration' into relea…
C-K-Loan Mar 5, 2024
422355f
Merge remote-tracking branch 'origin/sod_openai_completion_integratio…
C-K-Loan Mar 5, 2024
7735712
Merge remote-tracking branch 'origin/sk-03032024-nlu_deberta_zero_sho…
C-K-Loan Mar 5, 2024
419fe37
Merge pull request #248 from JohnSnowLabs/skocer-16022024-nlu_ner_pro…
C-K-Loan Mar 5, 2024
d9c7a64
Merge remote-tracking branch 'origin/visual/classifier' into release/515
C-K-Loan Mar 5, 2024
91c026d
Fix import OCR annos only if OCR installed
C-K-Loan Mar 6, 2024
6246a2d
remove missplaced sparknlp.start()
C-K-Loan Mar 8, 2024
a1f46b5
Fix bug with validating file paths on dbfs
C-K-Loan Mar 8, 2024
10ff7d7
Merge pull request #252 from JohnSnowLabs/release/515
C-K-Loan Mar 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
],
"metadata": {
"id": "7A9NQR0tVbWf"
}
},
{
"cell_type": "markdown",
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/tree/master/examples/colab/component_examples/classifiers/Bart_Zero_Shot_Classifiers.ipynb)"
],
"metadata": {
"id": "XCxDeiyZxNyV"
}
},
{
"cell_type": "markdown",
"source": [
"### **Zero Shot Classifiers**"
],
"metadata": {
"id": "ba7qk8Dwxc29"
}
},
{
"cell_type": "markdown",
"source": [
"### Zero Shot Text Classification\n",
"\n",
"State-of-the-art NLP models for text classification without annotated data\n",
"\n",
"Natural language processing is a very exciting field right now. In recent years, the community has begun to figure out some pretty effective methods of learning from the enormous amounts of unlabeled data available on the internet. The success of transfer learning from unsupervised models has allowed us to surpass virtually all existing benchmarks on downstream supervised learning tasks. As we continue to develop new model architectures and unsupervised learning objectives, \"state of the art\" continues to be a rapidly moving target for many tasks where large amounts of labeled data are available.\n",
"\n",
"### Zero Shot learning\n",
"\n",
"Zero-shot Learning (ZSL) is one of the most recent advancements in Machine Learning aimed to train Deep Neural Network models to have higher generalisability on unseen data. One of the most prominent methods of training such models is to use text prompts that explain the task to be solved, along with all possible outputs.\n",
"\n",
"The primary aim of using ZSL over supervised learning is to address the following limitations of training traditional supervised learning models:\n",
"\n",
"1. Training supervised NLP models require substantial amount of training data.\n",
"2. Even with recent trend of fine-tuning large language models, the supervised approach of training or fine-tuning a model is basically to learn a very specific data distribution, which results in low performance when applied to diverse and unseen data.\n",
"3. The classical annotate-train-test cycle is highly demanding in terms of temporal and human resources."
],
"metadata": {
"id": "VOktZCAgxffG"
}
},
{
"cell_type": "markdown",
"source": [
"### Bart Zero Shot Classifier\n",
"\n",
"This model is intended to be used for zero-shot text classification, especially in English. It is fine-tuned on MNLI by using large BART model.\n",
"\n",
"BartForZeroShotClassification using a ModelForSequenceClassification trained on MNLI tasks. Equivalent of BartForSequenceClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.\n",
"\n",
"We used TFBartForSequenceClassification to train this model and used BartForZeroShotClassification annotator in Spark NLP 🚀 for prediction at scale"
],
"metadata": {
"id": "MvfFWlHrxmGQ"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8w2RtQGCU_Xg"
},
"outputs": [],
"source": [
"!pip install nlu\n",
"!pip install pyspark==3.4.1"
]
},
{
"cell_type": "code",
"source": [
"import nlu\n",
"import pandas as pd"
],
"metadata": {
"id": "mU_7-Y4nVZXA"
},
"execution_count": 3,
"outputs": []
},
{
"cell_type": "code",
"source": [
"text = ['I have a problem with my hotel reservation that needs to be resolved asap!!']"
],
"metadata": {
"id": "Bn1xHZfGVqJA"
},
"execution_count": 4,
"outputs": []
},
{
"cell_type": "code",
"source": [
"bart_zero_shot = nlu.load('en.bart.zero_shot_classifier')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "29Q7-Riqw74U",
"outputId": "e7ed4737-efbb-48cc-fbdd-40d01545cb27"
},
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"bart_large_zero_shot_classifier_mnli download started this may take some time.\n",
"Approximate size to download 445.4 MB\n",
"[OK!]\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"results = bart_zero_shot.predict(text, output_level = 'document')"
],
"metadata": {
"id": "34efPvSQw9cl"
},
"execution_count": 8,
"outputs": []
},
{
"cell_type": "code",
"source": [
"results"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 81
},
"id": "s75RpDk6w-5f",
"outputId": "8cd5cdf1-2982-478b-a05e-819c0bd4b2d2"
},
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" classified_sequence classified_sequence_confidence \\\n",
"0 [travel] [0.12591693] \n",
"\n",
" document \n",
"0 I have a problem with my hotel reservation tha... "
],
"text/html": [
"\n",
" <div id=\"df-69db0df6-4398-4d2e-8abe-7b957daed517\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>classified_sequence</th>\n",
" <th>classified_sequence_confidence</th>\n",
" <th>document</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>[travel]</td>\n",
" <td>[0.12591693]</td>\n",
" <td>I have a problem with my hotel reservation tha...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-69db0df6-4398-4d2e-8abe-7b957daed517')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-69db0df6-4398-4d2e-8abe-7b957daed517 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-69db0df6-4398-4d2e-8abe-7b957daed517');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
" </div>\n",
" </div>\n"
]
},
"metadata": {},
"execution_count": 9
}
]
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "UgFt5oRbogZw"
},
"execution_count": null,
"outputs": []
}
]
}
Loading
Loading