EMNLP 2025: Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

This repository contains the code for the EMNLP 2025 main paper "Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks". Our work introduces a novel approach to understanding how large language models (LLMs) process different types of reasoning tasks by analyzing the internal mechanisms that drive benchmark performance.

🎯 Overview

Benchmark Profiling is a mechanistic interpretability method that identifies and analyzes the specific neural network regions responsible for different cognitive abilities in LLMs. By selectively damaging these regions, we can:

Identify critical parameters for specific reasoning abilities
Understand cross-benchmark relationships and shared cognitive mechanisms
Provide mechanistic insights into how LLMs solve different types of problems
Enable targeted model analysis for specific cognitive capabilities

🏗️ Project Structure


├── damage_region/           # Model damage experiments
│   └── damage_model.py      # Apply damage to critical regions
├── data_preprocess/         # Dataset preprocessing pipeline
│   ├── download_dataset.py  # Download datasets from HuggingFace
│   ├── preprocess.py        # Data preprocessing utilities
│   └── transform.py         # Data transformation functions
├── extract_region/          # Parameter region extraction
│   └── extract_region.py    # Extract critical parameters and save selections
├── training/                # Model fine-tuning components
│   ├── step1_supervised_finetuning/ # SFT training scripts
│   └── utils/              # Training utilities

├── config.yml              # Main configuration file
├── run.sh                  # Main execution script
├── requirements.txt         # Python dependencies
├── CITATION.bib            # Academic citation
├── CONTRIBUTING.md         # Contribution guidelines
└── LICENSE                 # Apache License 2.0

Note: The repository includes only the core code. Datasets, experimental results, and generated figures are excluded and should be downloaded/generated separately.

🚀 Quick Start

Prerequisites

Python 3.8+
CUDA-compatible GPU(s)
PyTorch 2.0+
Transformers library
Additional dependencies listed in requirements.txt

Installation

Clone the repository:

git clone https://github.com/junkim100/Unveiling-Regions.git
cd Unveiling-Regions

Install dependencies:

pip install -r requirements.txt

Download datasets:

# Use the provided script to download required datasets
python data_preprocess/download_dataset.py

Configure your setup:

# Edit config.yml to specify your models, datasets, and hardware configuration
vim config.yml

Note: This repository contains only the core code. Datasets and experimental outputs are not included and must be downloaded/generated separately.

Basic Usage

Run the complete benchmark profiling pipeline:

# Run full pipeline (training, extraction, damage, evaluation)
./run.sh

# Run evaluation only (skip training and extraction)
./run.sh -e

Advanced Usage Examples

1. Extract Critical Regions

# Extract regions for specific model and dataset
cd extract_region
python extract_region.py generate_masks \
    --input_dir ./outputs/Analogical_Reasoning/llama3.1/train/checkpoint_full \
    --output_dir ./outputs/Analogical_Reasoning/llama3.1/extract/checkpoint_full \
    --k 0.01024

2. Apply Damage to Models

# Damage model using extracted regions
cd damage_region
python damage_model.py \
    ./outputs/Analogical_Reasoning/llama3.1/extract \
    ./outputs/Analogical_Reasoning/llama3.1/damage \
    meta-llama/Llama-3.1-8B-Instruct \
    0.01024

4. Evaluate Models

# Evaluate damaged model on specific benchmarks
CUDA_VISIBLE_DEVICES=0 lm_eval --model hf \
    --model_args pretrained=./outputs/Analogical_Reasoning/llama3.1/damage/checkpoint_full/top0.01024 \
    --tasks gsm8k,arc_challenge,hellaswag \
    --batch_size 8

📊 Supported Benchmarks

Our framework supports analysis across multiple cognitive reasoning domains:

Analogical Reasoning - Pattern recognition and analogy completion
Commonsense & Causal Reasoning - Common sense understanding and causal relationships
Contextual Recall - Information retrieval from context
Deductive Reasoning - Logical deduction and inference
Inductive Reasoning - Pattern generalization and rule learning
Long-term Knowledge - Factual knowledge retrieval
Quantitative Reasoning - Mathematical and numerical reasoning
Semantic Relationship - Understanding semantic connections
Spatial Reasoning - Spatial relationship understanding
Temporal Reasoning - Time-based logical reasoning

🔧 Configuration

The main configuration is handled through config.yml:

settings:
  cuda_visible_devices: 0,1,2,3,4,5,6,7
  k_values: [0.00001, 0.00004, 0.00016, 0.00064, 0.00256, 0.01024]

models:
  - name: meta-llama/Llama-3.1-8B-Instruct
    tokenizer: llama3.1

evals:
  benchmarks: ["Inductive_Reasoning", "Analogical_Reasoning"]
  num_fewshot: [0, 0]

🔬 Methodology

Our approach consists of four main stages:

Fine-tuning: Adapt models to specific reasoning tasks
Region Extraction: Identify critical parameters using gradient-based methods
Selective Modification: Apply targeted damage
Evaluation: Assess performance changes across benchmarks

📚 Citation

If you use this code or find our work helpful, please cite:

@article{kim2025benchmark,
  title={Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks},
  author={Kim, Dongjun and Shim, Gyuho and Chun, Yongchan and Kim, Minhyuk and Park, Chanjun and Lim, Heuiseok},
  journal={arXiv preprint arXiv:2510.01232},
  year={2025}
}

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📧 Contact

For questions or issues, please open a GitHub issue or contact the authors.

Note: This repository is actively maintained and updated. Please check for the latest version and updates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMNLP 2025: Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

🎯 Overview

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Installation

Basic Usage

Advanced Usage Examples

1. Extract Critical Regions

2. Apply Damage to Models

4. Evaluate Models

📊 Supported Benchmarks

🔧 Configuration

🔬 Methodology

📚 Citation

📄 License

🤝 Contributing

📧 Contact

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
damage_region		damage_region
data_preprocess		data_preprocess
extract_region		extract_region
training		training
.gitignore		.gitignore
CITATION.bib		CITATION.bib
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

EMNLP 2025: Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

🎯 Overview

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Installation

Basic Usage

Advanced Usage Examples

1. Extract Critical Regions

2. Apply Damage to Models

4. Evaluate Models

📊 Supported Benchmarks

🔧 Configuration

🔬 Methodology

📚 Citation

📄 License

🤝 Contributing

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages