Replication package for PROBE-SWE: a dynamic benchmark to generate, validate, and analyze data-induced cognitive biases in GPAI on typical software-engineering dilemmas.
benchmarking reproducible-research prolog openai software-engineering llama dataset-generation reasoning cognitive-bias replication-package bias-detection ai-evaluation llm-evaluation deepseek llm-as-a-judge gpt-4o dynamic-benchmark general-purpose-ai probe-swe bias-sensitivity
-
Updated
Nov 29, 2025 - Jupyter Notebook