Skip to content

radiant-systems-lab/LLM-Code-Evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Code Reproducibility: An Empirical Study of 300 AI-Generated Projects

ACM REP '26 Artifact Submission

Abstract

We evaluate the reproducibility of code generated by three leading AI coding agents -- Claude Code, GitHub Copilot (Codex), and Gemini Code Assist -- across 300 projects in Python, Java, and JavaScript. Each project is tested in a clean Docker environment using only the dependencies the agent declares, and for each run, it must reset between each of the instance. Through three-layer dependency analysis (claimed, working, runtime), we find that only 68.3% of projects execute successfully. Success rates vary dramatically by language (Python: 89.2%, JavaScript: 61.9%, Java: 44.0%) and agent (Claude: 73%, Gemini: 72%, Copilot: 60%). Transitive dependencies create an explosion for Python, Java, and JavaScript between what agents declare and what is actually installed.

Repository Structure

LLM-Code-Evaluation/
├── README.md                  # This file
├── LICENSE                    # MIT License
├── requirements.txt           # Python dependencies for analysis
│
├── prompts/                   # 100 standardized task prompts
│   ├── p_1 .. p_40           # Python tasks
│   ├── p_41 .. p_75          # JavaScript tasks
│   └── p_76 .. p_100         # Java tasks
│
├── generated_code/            # 300 AI-generated projects
│   ├── claude/               # Claude Code output (100 projects)
│   │   ├── python/           # p_1..p_40
│   │   ├── javascript/       # p_41..p_75
│   │   └── java/             # p_76..p_100
│   ├── codex/                # GitHub Copilot output (100 projects)
│   │   ├── python/
│   │   ├── javascript/
│   │   └── java/
│   └── gemini/               # Gemini Code Assist output (100 projects)
│       ├── python/
│       ├── javascript/
│       └── java/
│
├── data/                      # Raw evaluation data
│   └── raw_csvs/             # 9 CSV files (300 rows total)
│
├── analysis/                  # Analysis scripts and outputs
│   ├── regenerate_all.py     # Single script to regenerate all figures + tables
│   ├── figures/              # 6 publication figures (PNG + PDF)
│   └── tables/               # 6 summary tables (CSV)
│
├── docker/                    # Docker evaluation environment
│   ├── Dockerfile.python     # python:3.10-slim
│   ├── Dockerfile.javascript # node:18-slim
│   ├── Dockerfile.java       # maven:3.9-eclipse-temurin-17
│   └── run_evaluation.sh     # Runs all 300 projects in isolated containers
│
└── docs/                      # Study documentation
    ├── METHODOLOGY.md         # Detailed experimental methodology
    └── CODE_GENERATION_STATISTICS.md

Quick Start: Reproduce All Figures and Tables

# Clone the repository
git clone https://github.com/radiant-systems-lab/LLM-Code-Evaluation.git
cd LLM-Code-Evaluation

# Install dependencies
pip install -r requirements.txt

# Regenerate all figures and tables from raw CSV data
cd analysis
python regenerate_all.py

About

Evaluation of LLM-generated code reproducibility across multiple languages and agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors