A reproducible framework to build multi-dimensional "behavioral fingerprints" of LLMs using a diagnostic prompt suite and an automated evaluator. The pipeline collects model responses, scores them against detailed rubrics via a separate evaluator model, and generates visual summaries.
- Diagnostic Prompt Suite across reasoning, world model, bias, personality, and robustness
- Automated evaluation using a strong LLM as an impartial judge (JSON outputs)
- Visualizations: radar profiles and category comparison charts
- Narrative reports summarizing each model's qualitative fingerprint
- Fully file-based artifacts checked into the repo (
results/,evaluations/,charts/,reports/)
src/— scripts to run the end-to-end pipelinerun_experiment.py— parse prompts and collect model responses intoresults/run_evaluation.py— construct meta-prompts with rubrics and score intoevaluations/visualize_results.py— aggregate scores, generate charts incharts/, and write per-model reports inreports/requirements.txt— Python dependencies
AI-comm-records/— LaTeX records of the prompt suite and evaluation protocol, plus cachedprompts.jsonprompt_suite.tex,evaluation_protocol.tex,idea.tex,prompts.json
results/— raw model responses (per model directory, per prompt.txt)evaluations/— evaluator JSON outputs mirroringresults/prompt IDscharts/— to be generated figures (radar and comparisons)reports/— generated narrative reports (one per model)
- Python 3.10+
- Create a virtual environment and install dependencies:
python -m venv .venv && source .venv/bin/activate
pip install -r src/requirements.txt- Configure environment for OpenRouter (used for both target models and evaluator):
- Create a
.envfile at the repo root with:
OPENROUTER_API_KEY=your_key_here
Note: If no key is present, scripts run in simulation mode and still write placeholder outputs so the pipeline can be exercised end-to-end.
Edit TARGET_MODELS in src/run_experiment.py to include the OpenRouter identifiers you wish to evaluate, then run:
python src/run_experiment.py- Prompts are read from
AI-comm-records/prompts.json(cached) or parsed fromAI-comm-records/prompt_suite.texon first run. - Outputs are written per model into
results/<provider>/<model>/<prompt_id>.txtorresults/<model_id>/<prompt_id>.txtdepending on your choice of naming. The current repo uses flat model IDs likeresults/openai/gpt-5/.
Set the TARGET_MODELS list in src/run_evaluation.py to match the result folders you want scored. Optionally set EVALUATOR_MODEL.
python src/run_evaluation.py- Produces JSON files in
evaluations/<provider>/<model>/<prompt_id>.json. - Robustness pairs (e.g.,
4.1.1A/B) are evaluated jointly and saved as4.1.1.json,4.1.2.json.
python src/visualize_results.py- Aggregates numeric scores, normalizes by category maxima, and emits:
- Radar charts per model in
charts/(e.g.,gpt-5_radar.png) - Comparison bar charts per category in
charts/large/orcharts/mid/ - Narrative reports per model in
reports/(e.g.,gpt-5_report.txt)
- Radar charts per model in
- Radar:
charts/gpt-5_radar.png,charts/gemini-2.5-pro_radar.png - Comparisons:
charts/large/Robustness_comparison.pngorcharts/mid/Causal_Chain_comparison.png - Reports:
reports/gpt-5_report.txt,reports/claude-opus-4.1_report.txt
- Prompts defined in
AI-comm-records/prompt_suite.tex(cached JSON inAI-comm-records/prompts.json). - Rubrics and procedures in
AI-comm-records/evaluation_protocol.tex. - Research narrative and scoping in
AI-comm-records/idea.tex,discussion_points.tex, andliterature_review.tex.
- Model identifiers: scripts assume OpenRouter-style IDs (e.g.,
openai/gpt-5). Adjust paths or names consistently if you change the layout. - Simulation mode: without an API key, the system writes placeholder responses/evaluations so you can test downstream steps.
- Personality classification prompts (3.3.x) yield non-numeric scores (e.g., E/I/S/N). Visualization code treats these separately and excludes them from numeric averages.
If you found this work useful, please consider citing:
@article{pei2025behavioral,
title={Behavioral Fingerprinting of Large Language Models},
author={Pei, Zehua and Zhen, Hui-Ling and Zhang, Ying and Yang, Zhiyuan and Li, Xing and Yu, Xianzhi and Yuan, Mingxuan and Yu, Bei},
journal={arXiv preprint arXiv:2509.04504},
year={2025}
}