SMART Mnemonic Generation

This repository contains the code, data, and pre-trained models for our EMNLP 2024 paper: A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

🦾 Models 📊 Data 📝 Paper

Abstract

Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior works generate mnemonics for students, but they do not guide models toward mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a~curated set of user-written mnemonics. We then use LLM alignment to enhance \model: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings:

Expressed and observed preferences disagree; what students think is helpful does not fully capture what is truly helpful
Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which we show resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains.
Mnemonic experts assess SMART as matching GPT-4, at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

Models and Dataset

Our datasets and models can all be downloaded here. More specifically, we provide the following datasets and pre-trained models:

Mnemonic Datasets

Pre-trained SMART Models (LLaMA-2 70B)

Code

We provide the code for training SMART (or any LLM), learning the Bayesian effectiveness labels from diverse student feedback, and evaluating the pairwise quality of two model-generated mnemonic devices

Setup

This project was written in Python 3.10.0 and all packages were installed with Anaconda version 23.5.0 and pip version 24.0. All necessary packages can be installed with:

pip install -r requirements.txt

Training SMART

We have released the fine-tuning and DPO datasets using our combined Bayesian preference labels, so you can reproduce our trained model with the following steps.

Navigate to /model/
Set the specified parameters in config.py (described in the file)
Run python SFT/create_initial_model.py. This creates the initial LLaMA model with an extra token for padding. Requires higher CPU memory (1024 GB for LLaMA-2 70B)
Run python SFT/train.py. This trains the initial SMART model with supervised fine-tuning and LoRA. Requires higher GPU memory (192 GB for LLaMA-2 70B)
Run python SFT/merge.py. Merges the LoRA fine-tuned model into the original model. Requires higher CPU memory (1024 GB for LLaMA-2 70B)
Run python DPO/train.py. Further tunes the initial SMART model with DPO. Requires higher GPU memory (192 GB for LLaMA-2 70B)
Run python DPO/merge.py. Merges the LoRA DPO model into the original model. Requires higher CPU memory (1024 GB for LLaMA-2 70B)

All training hyperparameters are specified as the ones used when training SMART with LLaMA-2 70B. For different LLMs, different hyperparameters may be needed. We did some preliminary testing with LLaMA-2 7B and found that our current hyperparameters did not lead to large improvements with DPO (2% difference in Win/Loss Ratio versus the 10% difference in Win/Loss Ratio seen for LLaMA-2 70B), so some search may be necessary for optimal results.

Once that is all done, you can run inference with the SFT model using python SFT/inference.py and the DPO model using python DPO/inference.py. The results will be saved in a .pkl file in your specified results_dir folder.

Bayesian Preference Labels

You can run our Bayesian model with the command python Bayesian/bayesian.py (while staying in the /model folder). We have included the preprocessed data this model expects as input in Bayesian/bayesian_data.pkl (derived from student preference data), and the file bayesian.py details what each of these fields are. After running bayesian.py, we aggregate the latent effectiveness across chains and epochs from NUTS to construct a preference dataset, which will be saved under Bayesian/chosen_rejected_data

Evaluation

We also provide the code for comparing the quality of two mnemonic devices with GPT-4, which can be run by navigating to the evaluate folder and running dspy_clf.py. The code is currently set up to compare the fine-tuned versus DPO-trained LLMs for mnemonic generation, but this can be changed by altering which results to load on lines 87 to 90.

Citation

If you found our code, datasets, or paper useful, please cite:

@inproceedings{balepur-etal-2024-smart,
    title = "A {SMART} Mnemonic Sounds like {``}Glue Tonic{''}: Mixing {LLM}s with Student Feedback to Make Mnemonic Learning Stick",
    author = "Balepur, Nishant  and
      Shu, Matthew  and
      Hoyle, Alexander  and
      Robey, Alison  and
      Feng, Shi  and
      Goldfarb-Tarrant, Seraphina  and
      Boyd-Graber, Jordan Lee",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.786",
    doi = "10.18653/v1/2024.emnlp-main.786",
    pages = "14202--14225",
    abstract = "Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor.We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.",
}

If you have any questions or problems with the code feel free to raise an issue or email me at [email protected]. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
evaluate		evaluate
image		image
model		model
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMART Mnemonic Generation

🦾 Models 📊 Data 📝 Paper

Abstract

Models and Dataset

Mnemonic Datasets

Pre-trained SMART Models (LLaMA-2 70B)

Code

Setup

Training SMART

Bayesian Preference Labels

Evaluation

Citation

About

Releases

Packages

Languages

nbalepur/Mnemonic

Folders and files

Latest commit

History

Repository files navigation

SMART Mnemonic Generation

🦾 Models 📊 Data 📝 Paper

Abstract

Models and Dataset

Mnemonic Datasets

Pre-trained SMART Models (LLaMA-2 70B)

Code

Setup

Training SMART

Bayesian Preference Labels

Evaluation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages