LLM Code Reproducibility: An Empirical Study of 300 AI-Generated Projects

ACM REP '26 Artifact Submission

Abstract

We evaluate the reproducibility of code generated by three leading AI coding agents -- Claude Code, GitHub Copilot (Codex), and Gemini Code Assist -- across 300 projects in Python, Java, and JavaScript. Each project is tested in a clean Docker environment using only the dependencies the agent declares, and for each run, it must reset between each of the instance. Through three-layer dependency analysis (claimed, working, runtime), we find that only 68.3% of projects execute successfully. Success rates vary dramatically by language (Python: 89.2%, JavaScript: 61.9%, Java: 44.0%) and agent (Claude: 73%, Gemini: 72%, Copilot: 60%). Transitive dependencies create an explosion for Python, Java, and JavaScript between what agents declare and what is actually installed.

Repository Structure

LLM-Code-Evaluation/
├── README.md                  # This file
├── LICENSE                    # MIT License
├── requirements.txt           # Python dependencies for analysis
│
├── prompts/                   # 100 standardized task prompts
│   ├── p_1 .. p_40           # Python tasks
│   ├── p_41 .. p_75          # JavaScript tasks
│   └── p_76 .. p_100         # Java tasks
│
├── generated_code/            # 300 AI-generated projects
│   ├── claude/               # Claude Code output (100 projects)
│   │   ├── python/           # p_1..p_40
│   │   ├── javascript/       # p_41..p_75
│   │   └── java/             # p_76..p_100
│   ├── codex/                # GitHub Copilot output (100 projects)
│   │   ├── python/
│   │   ├── javascript/
│   │   └── java/
│   └── gemini/               # Gemini Code Assist output (100 projects)
│       ├── python/
│       ├── javascript/
│       └── java/
│
├── data/                      # Raw evaluation data
│   └── raw_csvs/             # 9 CSV files (300 rows total)
│
├── analysis/                  # Analysis scripts and outputs
│   ├── regenerate_all.py     # Single script to regenerate all figures + tables
│   ├── figures/              # 6 publication figures (PNG + PDF)
│   └── tables/               # 6 summary tables (CSV)
│
├── docker/                    # Docker evaluation environment
│   ├── Dockerfile.python     # python:3.10-slim
│   ├── Dockerfile.javascript # node:18-slim
│   ├── Dockerfile.java       # maven:3.9-eclipse-temurin-17
│   └── run_evaluation.sh     # Runs all 300 projects in isolated containers
│
└── docs/                      # Study documentation
    ├── METHODOLOGY.md         # Detailed experimental methodology
    └── CODE_GENERATION_STATISTICS.md

Quick Start: Reproduce All Figures and Tables

# Clone the repository
git clone https://github.com/radiant-systems-lab/LLM-Code-Evaluation.git
cd LLM-Code-Evaluation

# Install dependencies
pip install -r requirements.txt

# Regenerate all figures and tables from raw CSV data
cd analysis
python regenerate_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Code Reproducibility: An Empirical Study of 300 AI-Generated Projects

Abstract

Repository Structure

Quick Start: Reproduce All Figures and Tables

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
data		data
docker		docker
docs		docs
generated_code		generated_code
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
STATUS.md		STATUS.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Code Reproducibility: An Empirical Study of 300 AI-Generated Projects

Abstract

Repository Structure

Quick Start: Reproduce All Figures and Tables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages