Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim
We introduce CUPID 🏹, a benchmark for evaluating the capability of Large Language Models (LLMs) to infer and apply personalized, contextual preferences from multi-turn user interactions. Unlike existing approaches that assume static global preferences, CUPID tests models' ability to understand dynamic, context-dependent user preferences revealed through conversational and implicit feedback.
CUPID contains 756 human-curated interaction session histories between simulated users and LLM-based AI assistants. Each interaction session involves a specific context factor (e.g., person, artifact, organization) and presents a user expressing their preference relevant to the context through multi-turn feedback.
Key Features:
- Contextual Preferences: Tests models' ability to infer preferences that change based on context
 - Multi-turn Interactions: Evaluates understanding from conversational feedback rather than explicit statements
 - Preference Inference: Assesses capability to extract relevant preferences from prior interactions
 - Response Generation: Tests application of inferred preferences to new requests
 - Comprehensive Evaluation: Presents metrics to asses model performance at preference inference and response genration
 
Evaluation Tasks:
- Preference Inference: Given prior interactions, infer the user's contextual preference
 - Response Generation: Given prior interactions, generate response that can satisfy the user's contextual preferences
 
We recommend using a conda environment:
conda create -n cupid python=3.9
conda activate cupid
pip install -r requirements.txtSet up your API keys for model evaluation:
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key" 
export TOGETHER_API_KEY="your_together_key" # For models supported by Together AI
export GOOGLE_API_KEY="your_google_key"  # For Gemini modelsThe CUPID dataset is available on HuggingFace: kixlab/CUPID
Dataset Structure:
- 756 instances across diverse personas and contexts
 - Human-curated interactions showing contextual preference expression
 - Three instance types: consistent, contrastive, and changing preferences
 - Rich context factors influencing user preferences (e.g., personal relationships, prior experiences, etc.)
 
Data Fields:
persona_id: Unique identifier for the user personacurrent_request: The request to be answered by the modelcurrent_context_factor: Context influencing the user's preferencecurrent_contextual_preference: Ground-truth preference for this contextcurrent_checklist: Specific criteria for evaluating response alignmentprior_interactions: List of previous interaction sessions showing user feedback
We also release kixlab/CUPID-Unverified, a non-validated version of CUPID with >3k instances.
Evaluate a model on the CUPID dataset:
python -m evaluation.run \
    --results_dir results \
    --model "gpt-4.1-nano-2025-04-14" \
    --evaluator gpt-4o-2024-11-20 \
    --n_workers 4Key Parameters:
--model: Model to evaluate (must have a corresponding class inevaluation/models/)--evaluator: Model used for evaluation functions (preference decomposing and matching, response judging)--use_matcher: Use our finetuned preference matcher (kixlab/prefmatcher-7b) for preference inference--task: Runinference,generation, orbothevaluation stages--data_dir: Use custom data instead of the official CUPID dataset (data synthesis explained in the next section)
To evaluate your own model, you must first create a new model class in evaluation/models/your_model.py that inherits the Model class and implements the model's inference logic. The __call__ method should take in a system prompt and a user prompt, and return only the final text response.
- Create a new model class in 
evaluation/models/your_model.py: 
from evaluation.models.model import Model, register_model
@register_model
class YourModel(Model):
    model_name = "your-model-name"
    
    def __call__(self, system_prompt, user_prompt):
        # Your model inference logic here
        return response- Run evaluation:
 
python -m evaluation.run --model your-model-name --results_dir resultsCUPID evaluates models on two main tasks:
1. Preference Inference (Precision/Recall/F1)
- Measures how well models can infer the user's preference for the current request from prior interactions
 - Compares inferred preference to the ground-truth preference
 - Optionally, you can use our finetuned preference matcher for more cost-efficient evaluation
- Our finetuned preference matcher is available on HuggingFace: kixlab/prefmatcher-7b
 - First, run the bash script 
evaluation/serve_prefmatcher.shto serve the model through VLLM - This will serve the model at 
http://localhost:8000 - Then, run the evaluation script with the 
--use_matcherflag 
 
2. Response Generation (Average Score 1-10)
- Evaluates how well generated responses satisfy user preferences
 - Scored by LLM-based judges on response-preference alignment
 
This respository also incldues the synthesis pipeline for CUPID to generate additional training/evaluation data.
python -m synthesis.run \
    --output_dir synthetic_data \
    --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \
    --n_personas 10 \
    --n_factors 8 \
    --n_sessions 13 \
    --max_turns 16 \
    --n_workers 4Key Parameters:
--output_dir: Directory to save the generated data--model: Model to use for data generation--n_personas: Number of personas to generate (default: 4)--n_factors: Number of context factors to generate (default: 8)--n_sessions: Number of interaction sessions to generate (default: 13)--max_turns: Maximum number of turns in an interaction session (default: 16)--n_workers: Number of workers to use for data generation (default: 1)
Synthesis Pipeline: Consists of four main steps:
- Persona Generation: Create diverse user personas with different backgrounds and taits
 - Context Factors: For each persona, generate context factors that influence preferences
 - Session Generation: Create interaction scenarios based on personas and contexts
 - Interaction Simulation: Simulate multi-turn conversations with preference feedback
 
cupid/
├── evaluation          # Evaluation framework
│   ├── models/         # Model implementations
│   ├── modules/        # Evaluation components
│   ├── pipeline/       # Evaluation pipeline
│   └── run.py          # Main evaluation script
├── synthesis/          # Data synthesis framework  
│   ├── modules/        # Synthesis components
│   ├── pipeline/       # Synthesis pipeline
│   └── run.py          # Main synthesis script
├── prompts/            # Prompt templates
│   ├── evaluation/     # Evaluation prompts
│   └── synthesis/      # Synthesis prompts
├── utils/              # Utility functions
├── config.py           # Configuration settings
└── requirements.txt    # Dependencies
If you find our work useful, please consider citing our paper!
@article{kim2025cupid,
  title     = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
  author    = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
  journal   = {arXiv preprint arXiv:2508.01674},
  year      = {2025},
}