Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim
We introduce CUPID 🏹, a benchmark for evaluating the capability of Large Language Models (LLMs) to infer and apply personalized, contextual preferences from multi-turn user interactions. Unlike existing approaches that assume static global preferences, CUPID tests models' ability to understand dynamic, context-dependent user preferences revealed through conversational and implicit feedback.
CUPID contains 756 human-curated interaction session histories between simulated users and LLM-based AI assistants. Each interaction session involves a specific context factor (e.g., person, artifact, organization) and presents a user expressing their preference relevant to the context through multi-turn feedback.
Key Features:
- Contextual Preferences: Tests models' ability to infer preferences that change based on context
- Multi-turn Interactions: Evaluates understanding from conversational feedback rather than explicit statements
- Preference Inference: Assesses capability to extract relevant preferences from prior interactions
- Response Generation: Tests application of inferred preferences to new requests
- Comprehensive Evaluation: Presents metrics to asses model performance at preference inference and response genration
Evaluation Tasks:
- Preference Inference: Given prior interactions, infer the user's contextual preference
- Response Generation: Given prior interactions, generate response that can satisfy the user's contextual preferences
We recommend using a conda environment:
conda create -n cupid python=3.9
conda activate cupid
pip install -r requirements.txtSet up your API keys for model evaluation:
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
export TOGETHER_API_KEY="your_together_key" # For models supported by Together AI
export GOOGLE_API_KEY="your_google_key" # For Gemini modelsThe CUPID dataset is available on HuggingFace: kixlab/CUPID
Dataset Structure:
- 756 instances across diverse personas and contexts
- Human-curated interactions showing contextual preference expression
- Three instance types: consistent, contrastive, and changing preferences
- Rich context factors influencing user preferences (e.g., personal relationships, prior experiences, etc.)
Data Fields:
persona_id: Unique identifier for the user personacurrent_request: The request to be answered by the modelcurrent_context_factor: Context influencing the user's preferencecurrent_contextual_preference: Ground-truth preference for this contextcurrent_checklist: Specific criteria for evaluating response alignmentprior_interactions: List of previous interaction sessions showing user feedback
Check kixlab/CUPID-Raw for the full personas, context factors, and interaction sessions used to synthesize the benchmark.
We also release kixlab/CUPID-Unverified, a non-validated version of CUPID with >3k instances.
Evaluate a model on the CUPID dataset:
python -m evaluation.run \
--results_dir results \
--model "gpt-4.1-nano-2025-04-14" \
--evaluator gpt-4o-2024-11-20 \
--n_workers 4Key Parameters:
--model: Model to evaluate (must have a corresponding class inevaluation/models/)--evaluator: Model used for evaluation functions (preference decomposing and matching, response judging)--use_matcher: Use our finetuned preference matcher (kixlab/prefmatcher-7b) for preference inference--task: Runinference,generation, orbothevaluation stages--data_dir: Use custom data instead of the official CUPID dataset (data synthesis explained in the next section)
To evaluate your own model, you must first create a new model class in evaluation/models/your_model.py that inherits the Model class and implements the model's inference logic. The __call__ method should take in a system prompt and a user prompt, and return only the final text response.
- Create a new model class in
evaluation/models/your_model.py:
from evaluation.models.model import Model, register_model
@register_model
class YourModel(Model):
model_name = "your-model-name"
def __call__(self, system_prompt, user_prompt):
# Your model inference logic here
return response- Run evaluation:
python -m evaluation.run --model your-model-name --results_dir resultsCUPID evaluates models on two main tasks:
1. Preference Inference (Precision/Recall/F1)
- Measures how well models can infer the user's preference for the current request from prior interactions
- Compares inferred preference to the ground-truth preference
- Optionally, you can use our finetuned preference matcher for more cost-efficient evaluation
- Our finetuned preference matcher is available on HuggingFace: kixlab/prefmatcher-7b
- First, run the bash script
evaluation/serve_prefmatcher.shto serve the model through VLLM - This will serve the model at
http://localhost:8000 - Then, run the evaluation script with the
--use_matcherflag
2. Response Generation (Average Score 1-10)
- Evaluates how well generated responses satisfy user preferences
- Scored by LLM-based judges on response-preference alignment
This respository also incldues the synthesis pipeline for CUPID to generate additional training/evaluation data.
python -m synthesis.run \
--output_dir synthetic_data \
--model "anthropic.claude-3-5-sonnet-20241022-v2:0" \
--n_personas 10 \
--n_factors 8 \
--n_sessions 13 \
--max_turns 16 \
--n_workers 4Key Parameters:
--output_dir: Directory to save the generated data--model: Model to use for data generation--n_personas: Number of personas to generate (default: 4)--n_factors: Number of context factors to generate (default: 8)--n_sessions: Number of interaction sessions to generate (default: 13)--max_turns: Maximum number of turns in an interaction session (default: 16)--n_workers: Number of workers to use for data generation (default: 1)
Synthesis Pipeline: Consists of four main steps:
- Persona Generation: Create diverse user personas with different backgrounds and taits
- Context Factors: For each persona, generate context factors that influence preferences
- Session Generation: Create interaction scenarios based on personas and contexts
- Interaction Simulation: Simulate multi-turn conversations with preference feedback
cupid/
├── evaluation # Evaluation framework
│ ├── models/ # Model implementations
│ ├── modules/ # Evaluation components
│ ├── pipeline/ # Evaluation pipeline
│ └── run.py # Main evaluation script
├── synthesis/ # Data synthesis framework
│ ├── modules/ # Synthesis components
│ ├── pipeline/ # Synthesis pipeline
│ └── run.py # Main synthesis script
├── prompts/ # Prompt templates
│ ├── evaluation/ # Evaluation prompts
│ └── synthesis/ # Synthesis prompts
├── utils/ # Utility functions
├── config.py # Configuration settings
└── requirements.txt # Dependencies
If you find our work useful, please consider citing our paper!
@article{kim2025cupid,
title = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
author = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
journal = {arXiv preprint arXiv:2508.01674},
year = {2025},
}