Behavioral Fingerprinting of Large Language Models

A reproducible framework to build multi-dimensional "behavioral fingerprints" of LLMs using a diagnostic prompt suite and an automated evaluator. The pipeline collects model responses, scores them against detailed rubrics via a separate evaluator model, and generates visual summaries.

Highlights

Diagnostic Prompt Suite across reasoning, world model, bias, personality, and robustness
Automated evaluation using a strong LLM as an impartial judge (JSON outputs)
Visualizations: radar profiles and category comparison charts
Narrative reports summarizing each model's qualitative fingerprint
Fully file-based artifacts checked into the repo (results/, evaluations/, charts/, reports/)

Repository structure

src/ — scripts to run the end-to-end pipeline
- run_experiment.py — parse prompts and collect model responses into results/
- run_evaluation.py — construct meta-prompts with rubrics and score into evaluations/
- visualize_results.py — aggregate scores, generate charts in charts/, and write per-model reports in reports/
- requirements.txt — Python dependencies
AI-comm-records/ — LaTeX records of the prompt suite and evaluation protocol, plus cached prompts.json
- prompt_suite.tex, evaluation_protocol.tex, idea.tex, prompts.json
results/ — raw model responses (per model directory, per prompt .txt)
evaluations/ — evaluator JSON outputs mirroring results/ prompt IDs
charts/ — to be generated figures (radar and comparisons)
reports/ — generated narrative reports (one per model)

Installation

Python 3.10+
Create a virtual environment and install dependencies:

python -m venv .venv && source .venv/bin/activate
pip install -r src/requirements.txt

Configure environment for OpenRouter (used for both target models and evaluator):

Create a .env file at the repo root with:

OPENROUTER_API_KEY=your_key_here

Note: If no key is present, scripts run in simulation mode and still write placeholder outputs so the pipeline can be exercised end-to-end.

Usage

1) Collect model responses

Edit TARGET_MODELS in src/run_experiment.py to include the OpenRouter identifiers you wish to evaluate, then run:

python src/run_experiment.py

Prompts are read from AI-comm-records/prompts.json (cached) or parsed from AI-comm-records/prompt_suite.tex on first run.
Outputs are written per model into results/<provider>/<model>/<prompt_id>.txt or results/<model_id>/<prompt_id>.txt depending on your choice of naming. The current repo uses flat model IDs like results/openai/gpt-5/.

2) Score responses with evaluator

Set the TARGET_MODELS list in src/run_evaluation.py to match the result folders you want scored. Optionally set EVALUATOR_MODEL.

python src/run_evaluation.py

Produces JSON files in evaluations/<provider>/<model>/<prompt_id>.json.
Robustness pairs (e.g., 4.1.1A/B) are evaluated jointly and saved as 4.1.1.json, 4.1.2.json.

3) Aggregate, visualize, and report

python src/visualize_results.py

Aggregates numeric scores, normalizes by category maxima, and emits:
- Radar charts per model in charts/ (e.g., gpt-5_radar.png)
- Comparison bar charts per category in charts/large/ or charts/mid/
- Narrative reports per model in reports/ (e.g., gpt-5_report.txt)

Example artifacts

Radar: charts/gpt-5_radar.png, charts/gemini-2.5-pro_radar.png
Comparisons: charts/large/Robustness_comparison.png or charts/mid/Causal_Chain_comparison.png
Reports: reports/gpt-5_report.txt, reports/claude-opus-4.1_report.txt

Prompt suite and evaluation protocol

Prompts defined in AI-comm-records/prompt_suite.tex (cached JSON in AI-comm-records/prompts.json).
Rubrics and procedures in AI-comm-records/evaluation_protocol.tex.
Research narrative and scoping in AI-comm-records/idea.tex, discussion_points.tex, and literature_review.tex.

Notes and tips

Model identifiers: scripts assume OpenRouter-style IDs (e.g., openai/gpt-5). Adjust paths or names consistently if you change the layout.
Simulation mode: without an API key, the system writes placeholder responses/evaluations so you can test downstream steps.
Personality classification prompts (3.3.x) yield non-numeric scores (e.g., E/I/S/N). Visualization code treats these separately and excludes them from numeric averages.

Cite

If you found this work useful, please consider citing:

@article{pei2025behavioral,
  title={Behavioral Fingerprinting of Large Language Models},
  author={Pei, Zehua and Zhen, Hui-Ling and Zhang, Ying and Yang,  Zhiyuan and Li, Xing and Yu, Xianzhi and Yuan, Mingxuan and Yu, Bei},
  journal={arXiv preprint arXiv:2509.04504},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
AI-comm-records		AI-comm-records
evaluations		evaluations
reports		reports
results		results
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Behavioral Fingerprinting of Large Language Models

Highlights

Repository structure

Installation

Usage

1) Collect model responses

2) Score responses with evaluator

3) Aggregate, visualize, and report

Example artifacts

Prompt suite and evaluation protocol

Notes and tips

Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Behavioral Fingerprinting of Large Language Models

Highlights

Repository structure

Installation

Usage

1) Collect model responses

2) Score responses with evaluator

3) Aggregate, visualize, and report

Example artifacts

Prompt suite and evaluation protocol

Notes and tips

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages