This repository contains the code, data, and pre-trained models for our EMNLP 2024 paper: A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior works generate mnemonics for students, but they do not guide models toward mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a~curated set of user-written mnemonics. We then use LLM alignment to enhance \model: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings:
- Expressed and observed preferences disagree; what students think is helpful does not fully capture what is truly helpful
- Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which we show resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains.
- Mnemonic experts assess SMART as matching GPT-4, at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
Our datasets and models can all be downloaded here. More specifically, we provide the following datasets and pre-trained models:
- Mnemonic Fine-tuning Data
- Mnemonic Student Preferences Data
- Mnemonic Chosen/Rejected with Bayesian Labels
- Mnemonic Test Set
We provide the code for training SMART (or any LLM), learning the Bayesian effectiveness labels from diverse student feedback, and evaluating the pairwise quality of two model-generated mnemonic devices
This project was written in Python 3.10.0 and all packages were installed with Anaconda version 23.5.0 and pip version 24.0. All necessary packages can be installed with:
pip install -r requirements.txt
We have released the fine-tuning and DPO datasets using our combined Bayesian preference labels, so you can reproduce our trained model with the following steps.
- Navigate to
/model/
- Set the specified parameters in
config.py
(described in the file) - Run
python SFT/create_initial_model.py
. This creates the initial LLaMA model with an extra token for padding. Requires higher CPU memory (1024 GB for LLaMA-2 70B) - Run
python SFT/train.py
. This trains the initial SMART model with supervised fine-tuning and LoRA. Requires higher GPU memory (192 GB for LLaMA-2 70B) - Run
python SFT/merge.py
. Merges the LoRA fine-tuned model into the original model. Requires higher CPU memory (1024 GB for LLaMA-2 70B) - Run
python DPO/train.py
. Further tunes the initial SMART model with DPO. Requires higher GPU memory (192 GB for LLaMA-2 70B) - Run
python DPO/merge.py
. Merges the LoRA DPO model into the original model. Requires higher CPU memory (1024 GB for LLaMA-2 70B)
All training hyperparameters are specified as the ones used when training SMART with LLaMA-2 70B. For different LLMs, different hyperparameters may be needed. We did some preliminary testing with LLaMA-2 7B and found that our current hyperparameters did not lead to large improvements with DPO (2% difference in Win/Loss Ratio versus the 10% difference in Win/Loss Ratio seen for LLaMA-2 70B), so some search may be necessary for optimal results.
Once that is all done, you can run inference with the SFT model using python SFT/inference.py
and the DPO model using python DPO/inference.py
. The results will be saved in a .pkl
file in your specified results_dir
folder.
You can run our Bayesian model with the command python Bayesian/bayesian.py
(while staying in the /model
folder). We have included the preprocessed data this model expects as input in Bayesian/bayesian_data.pkl
(derived from student preference data), and the file bayesian.py
details what each of these fields are. After running bayesian.py
, we aggregate the latent effectiveness across chains and epochs from NUTS to construct a preference dataset, which will be saved under Bayesian/chosen_rejected_data
We also provide the code for comparing the quality of two mnemonic devices with GPT-4, which can be run by navigating to the evaluate
folder and running dspy_clf.py
. The code is currently set up to compare the fine-tuned versus DPO-trained LLMs for mnemonic generation, but this can be changed by altering which results to load on lines 87 to 90.
If you found our code, datasets, or paper useful, please cite:
@inproceedings{balepur-etal-2024-smart,
title = "A {SMART} Mnemonic Sounds like {``}Glue Tonic{''}: Mixing {LLM}s with Student Feedback to Make Mnemonic Learning Stick",
author = "Balepur, Nishant and
Shu, Matthew and
Hoyle, Alexander and
Robey, Alison and
Feng, Shi and
Goldfarb-Tarrant, Seraphina and
Boyd-Graber, Jordan Lee",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.786",
doi = "10.18653/v1/2024.emnlp-main.786",
pages = "14202--14225",
abstract = "Keyword mnemonics are memorable explanations that link new terms to simpler keywords.Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning.We build SMART, a mnemonic generator trained on feedback from real students learning new terms.To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics.We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor.We gather 2684 preferences from 45 students across two types: **expressed** (inferred from ratings) and **observed** (inferred from student learning), yielding three key findings.First, expressed and observed preferences disagree; what students *think* is helpful does not always capture what is *truly* helpful.Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal.SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.",
}
If you have any questions or problems with the code feel free to raise an issue or email me at [email protected]. Thank you!