Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
KaikePing committed Oct 4, 2021
1 parent 4212dcd commit b98b574
Show file tree
Hide file tree
Showing 5 changed files with 1,959 additions and 63 deletions.
Binary file modified img/Accuracy_Final_Models.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/Accuracy_for_3_MultiOpt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/Schematic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
174 changes: 111 additions & 63 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,73 @@
# Prompting: a zero-shot language model to process multiple-choice questions on synonyms
# SynPL: a zero-shot language prompt model to process multiple-choice questions on synonyms

![Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20TRANSFORMERS-4.10.1-blue)

Multiple-choice questions are a classic section in exams. When taking a language test such as TOEFL®, some questions need you to select the "best" choice among a set of four options of words or phrases that are the closest meaning to a "keyword" in the context of a reading passage.
- [SynPL: a zero-shot language prompt model to process multiple-choice questions on synonyms](#synpl-a-zero-shot-language-prompt-model-to-process-multiple-choice-questions-on-synonyms)
- [Overview](#overview)
- [Background](#background)
- [Approaches](#approaches)
- [Results](#results)
- [Findings](#findings)
- [Datasets](#datasets)
- [Question structure](#question-structure)
- [*AutoModelForMultipleChoice* model](#automodelformultiplechoice-model)
- [Three input patterns](#three-input-patterns)
- [Finetune](#finetune)
- [Results](#results-1)
- [Fill-mask prompt model](#fill-mask-prompt-model)
- [Background](#background-1)
- [Prompt idea](#prompt-idea)
- [Finetune](#finetune-1)
- [Definition of accuracy](#definition-of-accuracy)
- [Data processing](#data-processing)
- [Results](#results-2)
- [MultiOpt VS Prompt](#multiopt-vs-prompt)
- [Results](#results-3)
- [Requirements](#requirements)
- [Model files](#model-files)

## Overview

I developed 2 kinds of language models to solve this problem. The first is to use the `AutoModelForMultipleChoice` pre-trained model from [🤗TRANSFORMERS](https://huggingface.co/transformers). The second is to build a prompt for the classic fill-mask model, so that this multiple-choice task can be formulated as a masked language modeling problem, which is what pre-trained models like BERT are designed for in the first place.
### Background

Multiple-choice questions are a classic section in exams. When taking a language test such as TOEFL®, some synonym questions need you to select the "best" choice among a set of four options of words or phrases that is the closest meaning to a "keyword" in the context of a reading passage.

Here my objective is to build language models to process this kind of questions automatically.

### Approaches

I developed 2 kinds of language models to solve this problem.

The figure below shows the schematic.
The first is to use the `AutoModelForMultipleChoice` pre-trained model from [🤗TRANSFORMERS](https://huggingface.co/transformers). This is a generic model with a multiple-choice head, which yields scores of inputs in a given selection. We select ID with max scores as the most plausible inputs.

Where:
The second is to build a prompt for the classic fill-mask model, so that this multiple-choice task can be formulated as a masked language modeling problem, which is what pre-trained models like BERT are designed for in the first place.

[Figure 1](#fig1) shows the schematic. Where:

- **MultiOpt_KO** model is finetuned *AutoModelForMultipleChoice* pre-trained BERT model with `Keyword [SEP] Options` pattern as input
- **PromptBi** model is the classic fill-mask model with prompt

![Fig 1. Diagram of 3 input patterns](img/Schematic.png)
<i id="fig1"></i>

![Fig 1. Schematic for 2 models](img/Schematic.png)

**Fig 1.** Schematic for 2 models

### Results

[Figure 2](#fig2) is the final result of these 2 models. Where **"Raw BERT"** is un-finetuned *AutoModelForMultipleChoice* pre-trained BERT model.

<i id="fig2"></i>

![Fig 2. Final result of 2 models](img/Accuracy_Final_Models.png)

And this is the final result of these 2 models. Where **"Raw BERT"** is un-finetuned *AutoModelForMultipleChoice* pre-trained BERT model.
**Fig 2.** Final result of 2 models

![Fig 1. Diagram of 3 input patterns](img/Accuracy_Final_Models.png)
### Findings

- Same as humans, models don't need contexts for this kind of synonym questions
- Prompt has the best performance in terms of accuracy, input conciseness, and the amount of training data needed (Because it can be finetuned on other datasets first).
- To my best knowledge, prompt models are a novel way of language models for these kinds of problems.

## Datasets

Expand Down Expand Up @@ -47,13 +95,13 @@ A typical example of synonym multiple-choice questions from TOEFL® test would b
It consists of 4 parts: **Context**, **Question**, 4 **Options**, and the **Answer**. We can transform those strings via Python's [`re`](https://docs.python.org/3/library/re.html) library into a structured data frame.

| ID | Context | Question | Opt1 | Opt2 | Opt3 | Opt4 | Ans |
|-----:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------|:----------|:----------|:--------------|:------------|:------|
| 0 | Most of these leaders were involved in public ... | The word "representative" is closest in meanin... | typical | satisfied | supportive | distinctive | A |
| 1 | In the United States, Louis Comfort Tiffany (1... | The word "prized" is closest in meaning to whi... | valued | universal | uncommon | preserved | A |
| 2 | The Art Nouveau style was a major force in the... | The word "overtaken" is closest in meaning to ... | surpassed | inclined | expressed | applied | A |
| 3 | During most of their lives, surge glaciers beh... | The word "intervals" is closest in meaning to ... | records | speeds | distances | periods | D |
| 4 | The increasing water pressure under the glacie... | The word "freeing" is closest in meaning to wh... | pushing | releasing | strengthening | draining | B |
| ID | Context | Question | Opt1 | Opt2 | Opt3 | Opt4 | Ans |
| ---: | :------------------------------------------------ | :------------------------------------------------ | :-------- | :-------- | :------------ | :---------- | :--- |
| 0 | Most of these leaders were involved in public ... | The word "representative" is closest in meanin... | typical | satisfied | supportive | distinctive | A |
| 1 | In the United States, Louis Comfort Tiffany (1... | The word "prized" is closest in meaning to whi... | valued | universal | uncommon | preserved | A |
| 2 | The Art Nouveau style was a major force in the... | The word "overtaken" is closest in meaning to ... | surpassed | inclined | expressed | applied | A |
| 3 | During most of their lives, surge glaciers beh... | The word "intervals" is closest in meaning to ... | records | speeds | distances | periods | D |
| 4 | The increasing water pressure under the glacie... | The word "freeing" is closest in meaning to wh... | pushing | releasing | strengthening | draining | B |

## *AutoModelForMultipleChoice* model

Expand All @@ -69,35 +117,35 @@ There are 3 input patterns we can try to build and then feed into the model:
1. `Keyword [SEP] Options` - **MultiOpt_KO**
1. `Keyword sentence [SEP] Option sentences` - **MultiOpt_KsOs**

![Fig 1. Diagram of 3 input patterns](img/Three_Input_Patterns_MultiOpt.png)
![Fig 3. Diagram of 3 input patterns](img/Three_Input_Patterns_MultiOpt.png)

**Fig 1.** Diagram of 3 input patterns
**Fig 3.** Diagram of 3 input patterns. *Where **C** stands for **Context**, **K** stands for **Keyword**, **O** stands for **Options**, **Ks** stands for **Keyword sentence** and **Os** stands for **Option sentences**.*

### Finetune

I divided each dataset 70% as train data and 30% as test data with the same random seed. Each model was finetuned 5 times with different random initial values.

### Results

[Table 1](#tab1) and [Figure 2](#fig2) show the mean accuracies for 3 models.
[Table 1](#tab1) and [Figure 4](#fig4) show the mean accuracies for 3 models.

<i id="tab1"></i>

**Table 1.** Evaluation Metrics for 3 *AutoModelForMultipleChoice* Models

| Model | Accuracy | Precision | Recall | F1 | AP |
|-------|----------|--|--|--|--|
| MultiOpt_CKO | 43.07% | 0.4329 | 0.4314 | 0.4306 | 0.4335|
| MultiOpt_KO | 72.26% | 0.7242 | 0.7211 | 0.7212 | 0.8751|
| MultiOpt_KsOs | 34.31% | 0.3428 | 0.3418 | 0.3415 | 0.5057|
| Model | Accuracy | Precision | Recall | F1 | AP |
| ------------- | -------- | --------- | ------ | ------ | ------ |
| MultiOpt_CKO | 46.16% | 0.4503 | 0.5186 | 0.3891 | 0.3520 |
| MultiOpt_KO | 72.17% | 0.7095 | 0.7230 | 0.7264 | 0.8902 |
| MultiOpt_KsOs | 35.23% | 0.3393 | 0.3490 | 0.3468 | 0.4865 |

**MultiOpt_KO** (the *Keyword [SEP] Options* pattern) got the best result. Not only has it the highest accuracy but also is the most robust one that achieved almost consistent prediction performance when trained from different initializations.

<i id="fig2"></i>
<i id="fig4"></i>

![Fig 2. Accuracy(%) of 3 Models](img/Accuracy_for_3_MultiOpt.png)
![Fig 4. Accuracy(%) of 3 Models](img/Accuracy_for_3_MultiOpt.png)

**Fig 2.** Accuracy(%) of 3 Models
**Fig 4.** Accuracy(%) of 3 Models

It actually adheres to our intuitions that when doing TOEFL® synonym multiple-choice questions, most of the time we don't have to read the context. By only examining the keyword and options, we can still select the best choice out.

Expand All @@ -113,9 +161,9 @@ Prompts are some small templates inserted into the inputs, with the aim that tas

For example, if you get a text classification problem to grade a movie review "The drama discloses nothing", then you can insert a text "It was ____" into the end of this review. After executing your model, you just need to examine logit scores of "terrible" and "great", so that you can determine whether this review is positive or negative.

![Fig 3. A typical example of using a prompt](img/Prompt_Example.png)
![Fig 5. A typical example of using a prompt](img/Prompt_Example.png)

**Fig 3.** A typical example of using a prompt
**Fig 5.** A typical example of using a prompt

You can read [Gao's post](https://thegradient.pub/prompting/) for more infomation.

Expand All @@ -131,13 +179,13 @@ This prompt can be performed like this on the TOEFL® dataset:

### Finetune

We still need to finetune the model with this prompt, because we should let the model know that it's facing a synonym problem. See [Figure 4](#fig4) and [Table 2](#tab2) for how this step can boost its performance.
We still need to finetune the model with this prompt, because we should let the model know that it's facing a synonym problem. See [Figure 6](#fig6) and [Table 2](#tab2) for how this step can boost its performance.

<i id="fig4"></i>
<i id="fig6"></i>

![Fig 4. An example of before and after finetuning](img/After_Before_Finetune.png)
![Fig 6. An example of before and after finetuning](img/After_Before_Finetune.png)

**Fig 4.** An example of before and after finetuning.
**Fig 6.** An example of before and after finetuning.
*Here you can see after finetuning, its output words are more like synonyms.*

### Definition of accuracy
Expand All @@ -162,36 +210,36 @@ In conclusion, there are 3 models in this part.

**Table 2.** Accuracy of prompt models

| Model |Top 1 |Top 2 |Top 3 |Top 5 |Top 10 |
|:---------|:------|:------|:------|:------|:------|
|Raw_BERT |5.51% |9.68% |12.24% |15.47% |20.48% |
|Prompt_Uni |12.13% |33.12% |44.34% |55.39% |65.13% |
|Prompt_Bi |13.01% |43.24% |53.12% |64.93% |74.12% |
| Model | Top 1 | Top 2 | Top 3 | Top 5 | Top 10 |
| :--------- | :----- | :----- | :----- | :----- | :----- |
| Raw_BERT | 5.51% | 9.68% | 12.24% | 15.47% | 20.48% |
| Prompt_Uni | 12.13% | 33.12% | 44.34% | 55.39% | 65.13% |
| Prompt_Bi | 13.01% | 43.24% | 53.12% | 64.93% | 74.12% |

<i id="fig5"></i>
<i id="fig7"></i>

![Fig 5. Accuracy of prompt models](img/Accuracy_prompt.png)
![Fig 7. Accuracy of prompt models](img/Accuracy_prompt.png)

**Fig 5.** Accuracy of prompt models
**Fig 7.** Accuracy of prompt models

Below are some other metrics.
| Model | N | Recall | F1 |
|:-----------|----:|---------:|-------:|
| Raw_BERT | 1 | 0.0275 | 0.0522 |
| Prompt_Uni | 1 | 0.0587 | 0.1051 |
| Prompt_Bi | 1 | 0.0662 | 0.117 |
| Raw_BERT | 2 | 0.0484 | 0.0883 |
| Prompt_Uni | 2 | 0.1661 | 0.2494 |
| Prompt_Bi | 2 | 0.2101 | 0.2958 |
| Raw_BERT | 3 | 0.0612 | 0.1091 |
| Prompt_Uni | 3 | 0.2179 | 0.3035 |
| Prompt_Bi | 3 | 0.2643 | 0.3458 |
| Raw_BERT | 5 | 0.0774 | 0.134 |
| Prompt_Uni | 5 | 0.2755 | 0.3552 |
| Prompt_Bi | 5 | 0.3214 | 0.3913 |
| Raw_BERT | 10 | 0.1024 | 0.17 |
| Prompt_Uni | 10 | 0.3367 | 0.4024 |
| Prompt_Bi | 10 | 0.3745 | 0.4283 |
| Model | N | Recall | F1 |
| :--------- | ---: | -----: | -----: |
| Raw_BERT | 1 | 0.0275 | 0.0522 |
| Prompt_Uni | 1 | 0.0587 | 0.1051 |
| Prompt_Bi | 1 | 0.0662 | 0.117 |
| Raw_BERT | 2 | 0.0484 | 0.0883 |
| Prompt_Uni | 2 | 0.1661 | 0.2494 |
| Prompt_Bi | 2 | 0.2101 | 0.2958 |
| Raw_BERT | 3 | 0.0612 | 0.1091 |
| Prompt_Uni | 3 | 0.2179 | 0.3035 |
| Prompt_Bi | 3 | 0.2643 | 0.3458 |
| Raw_BERT | 5 | 0.0774 | 0.134 |
| Prompt_Uni | 5 | 0.2755 | 0.3552 |
| Prompt_Bi | 5 | 0.3214 | 0.3913 |
| Raw_BERT | 10 | 0.1024 | 0.17 |
| Prompt_Uni | 10 | 0.3367 | 0.4024 |
| Prompt_Bi | 10 | 0.3745 | 0.4283 |

## MultiOpt VS Prompt

Expand All @@ -207,17 +255,17 @@ These models all use the same test data, which means data where all phrases alon

### Results

[Table 3](#tab3) shows how these models perform on the TOEFL® dataset.
[Table 3](#tab3) and [Figure 2](#fig2) shows how these models perform on the TOEFL® dataset.

<i id="tab3"></i>

**Table 3.** Metrics of 3 final models

| Model | Accuracy | Precision | Recall | F1 | AP |
|:-------|:----------|:--|:--|:--|:--|
|Raw_BERT |40.88% |0.4034|0.4066|0.4041|0.5956|
|MultiOpt_KO |72.99% |0.7355|0.7291|0.7287|0.887|
|Prompt_Bi |89.78% |0.9014|0.9001|0.9001|0.9735|
| Model | Accuracy | Precision | Recall | F1 | AP |
| :---------- | :------- | :-------- | :----- | :----- | :----- |
| Raw_BERT | 40.88% | 0.4034 | 0.4066 | 0.4041 | 0.5956 |
| MultiOpt_KO | 73.18% | 0.7385 | 0.7217 | 0.7341 | 0.8894 |
| Prompt_Bi | 89.69% | 0.9035 | 0.8977 | 0.9020 | 0.9683 |

## Requirements

Expand Down
Loading

0 comments on commit b98b574

Please sign in to comment.