This repository contains the dataset for studying the impact of grading scales on the alignment between human evaluators and large language models (LLMs) acting as judges. The data encompasses both human annotations and LLM-generated evaluations across six diverse natural language processing (NLP) benchmarks, featuring objective, open-ended subjective, and mixed tasks.
By providing scores collected on multiple rating scales (e.g., 0-5, 0-10, 0-100), this dataset facilitates the computation of intraclass correlation coefficients (ICC) to measure absolute agreement and inter-scale consistency. Furthermore, the human annotations are categorized by demographic profiles (e.g., gender), enabling sub-level diagnostics into systematic subgroup differences and potential heterogeneities inside human-LLM alignment.
The data covers the following 6 benchmarks across several dimensions (e.g., coding, writing, reasoning, STEM, safety, alignment):
- MT-Bench (
MT-Bench): Multi-turn conversational instruction-following evaluations. - MoralChoice (
MoralChoices): Evaluation of moral and ethical decision-making. - STS-Benchmark (
STS-B/similarity): Semantic Textual Similarity benchmark. - SummEval (
SummEval/summary): Text summarization evaluation (relevance, coherence, fluency, consistency). - ToxiGen (
ToxiGen/toxicity): Toxicity and hate speech detection. - TruthfulQA (
TruthfulQA): Evaluation of factual accuracy and truthfulness.
The dataset is divided into two primary directories, segregating the human-annotated ground truth from the LLM-generated judgments:
data/
├── human/
│ ├── Female_Subject_{ID}_{X}_{Y}/
│ └── Male_Subject_{ID}_{X}_{Y}/
└── llm/
├── MTbench_sample_25_scores_summary.csv
├── TruthfulQA_25_samples_comparison.csv
├── moralchoice_25_samples_comparison.csv
├── similarity_25_samples_all_models.csv
├── summary_data_sample_25_all_scores.csv
├── toxicity_25_samples_all_models.csv
└── temperatures/
The human directory contains human annotation data representing human evaluations of the benchmark datasets.
- Files are organized into folders by annotator subject profiles (e.g.,
Female_Subject_...,Male_Subject_...). - Inside each subject's folder, there are JSON files for the respective benchmarks containing the annotation results and metadata (format often associated with Label Studio outputs). Annotations include scoring for answers on metrics such as "overall" score or specific category features.
The llm directory contains evaluation results generated by various state-of-the-art Large Language Models acting as judges.
- CSV Summaries: The root of the
llmdirectory holds consolidated CSV files containing scores corresponding to samples for each benchmark. Models evaluated include GPT-4o, Llama, Qwen, DeepSeek, Mistral, and Gemini. Scores are often provided across multiple scales (0-5, 0-10, 0-100). - Temperatures (
data/llm/temperatures/): This subdirectory contains evaluation data categorized further by generation temperatures for each of the 6 benchmarks (e.g.,MTbench,MoralChoices,STS-Benchmark,SummEval,ToxiGen,TrusthfulQA).
This dataset can be used to:
- Replicate findings that human-LLM alignment is highest on the 0-5 grading scale.
- Compute and analyze intraclass correlation coefficients (ICC) to measure absolute agreement between humans and LLMs.
- Compare LLM judgment consistency when the underlying grading scale is altered (0-5, 0-10, 0-100).
- Investigate systematic subgroup differences in human-LLM alignment across different demographic groups (e.g., by analyzing gender-based annotation variations).
- Evaluate the impact of benchmark heterogeneity and model generation temperatures on evaluation quality.