Model Card for Model ID
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense proposed a strong discourse paraphraser known as DIPPER.
DIPPER is a large model, built from google/t5-efficient-xxl and finetuned on 6.3M datapoints. I am proposing a lightweight, non-context equivalent for lower-cost usage.
This model is built from google/t5-large-nl32 and finetuned on 100,000 datapoints. Notably, the datapoints are all non-context. Refer to the original paper if you wish for further understanding on this topic.
The dataset used to finetune this model is available here: Dataset
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Sam Jackson
- Model type: Sequence-to-Sequence Model
- Language(s) (NLP): English
- License: MIT
- Finetuned from model [optional]: google/t5-efficient-large-nl32
Model Sources [optional]
- Repository: Original Github
- Paper [optional]: Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
Uses
The model is intended to be used for paraphrasing with notions of control. The dataset used encourages lexical (word) and order (paragraph structure) parameters, which control the degree of strength in paraphrasing.
See the example code usage for a further understanding.
Direct Use
The model is entirely usable from the uploaded state. No further finetuning is required, although possible.
Downstream Use [optional]
This model was finetuned from a T5 checkpoint. It is possible to further finetune this model, if desired. If you plan for transfer learning, I would simply recommend starting from the initial checkpoint model: google/t5-large-nl32.
Recommendations
In terms of recommendation, if you have the capacity, I would recommend using the more powerful model: DIPPER
Otherwise, this model is sufficiently strong. It outperforms the sentence-based paraphraser ChatGPT Paraphraser when it comes to perplexity scores - when both models are compared using the facebook/opt-2.7b model.
How to Get Started with the Model
Use the code below to get started with the model.
Training Details
Training Data
As mentioned, the training data is here: kpar3-no-ctx Pre-processing simply contains tokenisation through the google/t5-efficient-large-nl32 tokenizer.
The data is classic paraphrase pairs. However, the first element in the pair has terms "lexical = x" and "order = y". The values x and y are in the set {0, 20, 40, 60, 80, 100} and denote the strength with which the model should paraphrase.
In particular, a sentence with "lexical = 0" should change as many words as possible, while maintaining the original meaning. Meanwhile, a sentence with "order = 0" should restructure the paragraph to the model's greatest extent.
The dataset only contains parameter values in increments of 20.
Training Hyperparameters
- Training regime:
learning_rate = 1e-4
bf16 = True
num_train_epochs = 2
auto_find_batch_size = True,
generation_num_beams = 2,
generation_max_length = 200
Speeds, Sizes, Times [optional]
Finetuning on 100,000 datapoints, this took around 14 GPU hours using a GTX 3090.
Example Usage
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")
model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)
text = "Each Wednesdsay, I take my dog for a walk in Central Park."
lexical = 20
order = 40
prompt = f"lexical = {lexical}, order = {order} {text}"
input_ids = tokenizer(
prompt,
return_tensors='pt',
padding="longest",
max_length=1000,
truncation=True,
).to(device)
outputs = model.generate(
**input_ids,
top_p=0.75,
do_sample=True,
max_new_tokens=300,
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"
print(response)
Citation [optional]
BibTeX:
@misc{krishna2023paraphrasing,
title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense},
author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer},
year={2023},
eprint={2303.13408},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Model Card Contact
Contact me through huggingface if you have any questions.
- Downloads last month
- 57