Skip to content

Harvard-AI-and-Robotics-Lab/LLM-Judge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Grading Scale Impact on Human-LLM Alignment

This repository contains the dataset for studying the impact of grading scales on the alignment between human evaluators and large language models (LLMs) acting as judges. The data encompasses both human annotations and LLM-generated evaluations across six diverse natural language processing (NLP) benchmarks, featuring objective, open-ended subjective, and mixed tasks.

By providing scores collected on multiple rating scales (e.g., 0-5, 0-10, 0-100), this dataset facilitates the computation of intraclass correlation coefficients (ICC) to measure absolute agreement and inter-scale consistency. Furthermore, the human annotations are categorized by demographic profiles (e.g., gender), enabling sub-level diagnostics into systematic subgroup differences and potential heterogeneities inside human-LLM alignment.

Benchmarks Included

The data covers the following 6 benchmarks across several dimensions (e.g., coding, writing, reasoning, STEM, safety, alignment):

  1. MT-Bench (MT-Bench): Multi-turn conversational instruction-following evaluations.
  2. MoralChoice (MoralChoices): Evaluation of moral and ethical decision-making.
  3. STS-Benchmark (STS-B / similarity): Semantic Textual Similarity benchmark.
  4. SummEval (SummEval / summary): Text summarization evaluation (relevance, coherence, fluency, consistency).
  5. ToxiGen (ToxiGen / toxicity): Toxicity and hate speech detection.
  6. TruthfulQA (TruthfulQA): Evaluation of factual accuracy and truthfulness.

Repository Structure

The dataset is divided into two primary directories, segregating the human-annotated ground truth from the LLM-generated judgments:

data/
├── human/
│   ├── Female_Subject_{ID}_{X}_{Y}/
│   └── Male_Subject_{ID}_{X}_{Y}/
└── llm/
    ├── MTbench_sample_25_scores_summary.csv
    ├── TruthfulQA_25_samples_comparison.csv
    ├── moralchoice_25_samples_comparison.csv
    ├── similarity_25_samples_all_models.csv
    ├── summary_data_sample_25_all_scores.csv
    ├── toxicity_25_samples_all_models.csv
    └── temperatures/

Human Annotations (data/human/)

The human directory contains human annotation data representing human evaluations of the benchmark datasets.

  • Files are organized into folders by annotator subject profiles (e.g., Female_Subject_..., Male_Subject_...).
  • Inside each subject's folder, there are JSON files for the respective benchmarks containing the annotation results and metadata (format often associated with Label Studio outputs). Annotations include scoring for answers on metrics such as "overall" score or specific category features.

LLM Evaluations (data/llm/)

The llm directory contains evaluation results generated by various state-of-the-art Large Language Models acting as judges.

  • CSV Summaries: The root of the llm directory holds consolidated CSV files containing scores corresponding to samples for each benchmark. Models evaluated include GPT-4o, Llama, Qwen, DeepSeek, Mistral, and Gemini. Scores are often provided across multiple scales (0-5, 0-10, 0-100).
  • Temperatures (data/llm/temperatures/): This subdirectory contains evaluation data categorized further by generation temperatures for each of the 6 benchmarks (e.g., MTbench, MoralChoices, STS-Benchmark, SummEval, ToxiGen, TrusthfulQA).

Usage

This dataset can be used to:

  • Replicate findings that human-LLM alignment is highest on the 0-5 grading scale.
  • Compute and analyze intraclass correlation coefficients (ICC) to measure absolute agreement between humans and LLMs.
  • Compare LLM judgment consistency when the underlying grading scale is altered (0-5, 0-10, 0-100).
  • Investigate systematic subgroup differences in human-LLM alignment across different demographic groups (e.g., by analyzing gender-based annotation variations).
  • Evaluate the impact of benchmark heterogeneity and model generation temperatures on evaluation quality.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors