ACM REP '26 Artifact Submission
We evaluate the reproducibility of code generated by three leading AI coding agents -- Claude Code, GitHub Copilot (Codex), and Gemini Code Assist -- across 300 projects in Python, Java, and JavaScript. Each project is tested in a clean Docker environment using only the dependencies the agent declares, and for each run, it must reset between each of the instance. Through three-layer dependency analysis (claimed, working, runtime), we find that only 68.3% of projects execute successfully. Success rates vary dramatically by language (Python: 89.2%, JavaScript: 61.9%, Java: 44.0%) and agent (Claude: 73%, Gemini: 72%, Copilot: 60%). Transitive dependencies create an explosion for Python, Java, and JavaScript between what agents declare and what is actually installed.
LLM-Code-Evaluation/
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies for analysis
│
├── prompts/ # 100 standardized task prompts
│ ├── p_1 .. p_40 # Python tasks
│ ├── p_41 .. p_75 # JavaScript tasks
│ └── p_76 .. p_100 # Java tasks
│
├── generated_code/ # 300 AI-generated projects
│ ├── claude/ # Claude Code output (100 projects)
│ │ ├── python/ # p_1..p_40
│ │ ├── javascript/ # p_41..p_75
│ │ └── java/ # p_76..p_100
│ ├── codex/ # GitHub Copilot output (100 projects)
│ │ ├── python/
│ │ ├── javascript/
│ │ └── java/
│ └── gemini/ # Gemini Code Assist output (100 projects)
│ ├── python/
│ ├── javascript/
│ └── java/
│
├── data/ # Raw evaluation data
│ └── raw_csvs/ # 9 CSV files (300 rows total)
│
├── analysis/ # Analysis scripts and outputs
│ ├── regenerate_all.py # Single script to regenerate all figures + tables
│ ├── figures/ # 6 publication figures (PNG + PDF)
│ └── tables/ # 6 summary tables (CSV)
│
├── docker/ # Docker evaluation environment
│ ├── Dockerfile.python # python:3.10-slim
│ ├── Dockerfile.javascript # node:18-slim
│ ├── Dockerfile.java # maven:3.9-eclipse-temurin-17
│ └── run_evaluation.sh # Runs all 300 projects in isolated containers
│
└── docs/ # Study documentation
├── METHODOLOGY.md # Detailed experimental methodology
└── CODE_GENERATION_STATISTICS.md
# Clone the repository
git clone https://github.com/radiant-systems-lab/LLM-Code-Evaluation.git
cd LLM-Code-Evaluation
# Install dependencies
pip install -r requirements.txt
# Regenerate all figures and tables from raw CSV data
cd analysis
python regenerate_all.py