RepoCraft is a benchmark for evaluating repository-level code generation, derived from the paper "RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation". It consists of 1,052 tasks across 6 real-world Python projects, assessing whether AI agents can generate repositories that are functionally complete, algorithmically correct, and at real-world scale.
RepoCraft evaluates generated repositories along four dimensions:
| Metric | Description |
|---|---|
| Functionality Coverage | Proportion of reference feature categories covered |
| Functionality Novelty | Proportion of generated features outside reference taxonomy |
| Functionality Accuracy | Pass Rate (tests passed) and Voting Rate (semantic checks passed) |
| Code-Level Statistics | File count, Lines of Code (LOC), Token count |
| Original | Anonymized Name | Domain | Tasks |
|---|---|---|---|
| scikit-learn | MLKit-Py | Machine Learning | 236 |
| pandas | TableKit | Data Analysis | 175 |
| sympy | SymbolicMath | Symbolic Computation | 192 |
| statsmodels | StatModeler | Statistical Modeling | 234 |
| requests | HttpEasy | HTTP Client | 50 |
| django | PyWebEngine | Web Framework | 165 |
Repository names are paraphrased to prevent pretraining data leakage.
repocraft/
├── __init__.py # Top-level exports
├── README.md # This file
│
├── benchmark/ # Benchmark construction pipeline
│ ├── __init__.py
│ ├── __main__.py # Allows `python -m repocraft.benchmark`
│ ├── main.py # Unified CLI: parse → refactor → sample → generate
│ ├── main_parse_test.py # Standalone CLI for test parsing
│ ├── parse_test.py # Test tree parsing with LLM
│ ├── refactor_test_tree.py # Categorize flat test tree by directory structure
│ ├── sample.py # Hierarchical test sampling
│ ├── generate_query.py # Task query generation with LLM
│ └── prompt.py # System prompts for parsing
│
├── coverage/ # Coverage evaluation
│ ├── coverage.py # Embedding-based coverage metrics
│ ├── sys_prompt.py # Prompts for coverage evaluation
│ └── gt_repo_tree/ # Ground-truth feature trees (JSON)
│ ├── HttpEasy.json
│ ├── MLKit-Py.json
│ ├── PyWebEngine.json
│ ├── StatModeler.json
│ ├── SymbolicMath.json
│ └── TableKit.json
│
├── framework/ # Accuracy Evaluation framework
│ ├── __init__.py
│ ├── eval_framework.py # Main evaluation orchestrator
│ ├── writing_code.py # Test generation agent
│ ├── sys_prompt.py # Prompts (localization, voting, coding)
│ ├── utils.py # Utility functions
│ └── docker/ # Docker-based test execution
│ ├── __init__.py
│ ├── repo_docker.py # Docker container management
│ └── eval_docker.py # Docker environment for evaluation
│
├── evaluation.py # Batch result evaluation & reporting
└── run.py # CLI entry point for evaluation run
The benchmark construction follows a 5-stage pipeline:
Reference Repository (e.g., scikit-learn)
│
▼
[Stage 1] Parse Test Tree ──────────── parse_test.py
│ Collect test files, parse into classes/functions,
│ use LLM to group test methods by algorithm
▼
Flat Feature-Grouped Test Tree (JSON)
│
▼
[Stage 2] Refactor Test Tree ──────── refactor_test_tree.py
│ Categorize flat file structure into tree
│ by meaningful directory names
▼
Categorized Test Tree (JSON)
│
▼
[Stage 3] Sample Tests ─────────────── sample.py
│ Hierarchical 3-level sampling:
│ files → classes → features
▼
Sampled Test Subset (JSON)
│
▼
[Stage 4] Generate Task Queries ─────── generate_query.py
│ For each test group, use LLM to produce:
│ - Algorithm description
│ - Natural language task query
▼
Task Queries (JSON) ← Ready for evaluation
│
▼
[Stage 5] Evaluate ──────────────────── run.py + framework/
│ Localization → Voting → Test execution
▼
Results
You can run the entire benchmark construction pipeline (Stages 1-4) with a single command:
python -m repocraft.benchmark pipeline \
--repo_dir /path/to/scikit-learn \
--output_dir ./all_results \
--repo_name sklearn \
--max_parse_workers 4 \
--max_query_workers 6Or run individual stages separately (see below).
Parse the reference repository's test suite into a hierarchical feature tree. This identifies what algorithms/functions each test validates.
python -m repocraft.benchmark.main_parse_test parse \
--repo_dir /path/to/scikit-learn \
--result_path ./all_results/result_tests/sklearn.json \
--max_workers 4 \
--max_iterations 10What it does:
- Scans the repository for test files (files in
tests/directories) - Parses each test file into code units (classes, functions) using AST
- Uses an LLM to group test methods by the core algorithm/functionality they test
- Outputs a hierarchical JSON structure
Output format:
{
"tests/test_svm.py": {
"class TestSVC": {
"svc_prediction": ["test_svc_predict", "test_svc_predict_proba"],
"svc_parameters": ["test_svc_kernel", "test_svc_gamma"]
},
"functions": {
"svr_basic": ["test_svr_predict"]
}
}
}How the LLM grouping works:
- The LLM receives the full test class code and is instructed to group methods by the core algorithm they validate
- Grouping is semantic-first: tests checking different aspects of the same algorithm are merged
- Each test method appears exactly once
- Names use
snake_caseand refer to the public API or canonical algorithm name
Convert the flat file-based test tree (keyed by absolute paths) into a categorized tree organized by meaningful directory names (e.g., metrics, clustering, preprocessing).
python -m repocraft.benchmark.main refactor \
--parsed_test ./all_results/result_tests/sklearn.json \
--result_path ./all_results/refactored_test/sklearn.jsonWhat it does:
- Reads the flat parsed test tree from Stage 1
- For each file path, extracts the most meaningful directory name (skipping
test/testsdirectories) - Groups test files under their functional category
- Outputs a JSON with both the original (
files) and refactored (refactor) structures
Output format:
{
"files": { ... },
"refactor": {
"svm": {
"test_svm": { "class TestSVC": { ... }, "functions": { ... } }
},
"metrics": {
"test_pairwise": { ... },
"test_regression": { ... }
}
}
}Apply hierarchical sampling to select a representative subset from the refactored test tree.
python -m repocraft.benchmark.main sample \
--refactored_test ./all_results/refactored_test/sklearn.json \
--result_path ./all_results/sampled_test/sample_sklearn.json \
--num_files 12 \
--num_classes_per_file 20 \
--num_modules_per_class 10Sampling strategy:
- Level 1 (Files): Randomly select
num_filestest files, excluding base/issue files - Level 2 (Classes/Functions): From each file, randomly select
num_classes_or_functions_per_filetest classes or function groups - Level 3 (Features): From each class, randomly select
num_modules_per_classalgorithm features
This ensures balanced coverage across the repository's functional categories.
For each sampled test group, generate a natural language task description and algorithm description using an LLM.
python -m repocraft.benchmark.main generate \
--sampled_test ./all_results/sampled_test/sample_sklearn.json \
--parsed_test ./all_results/result_tests/sklearn.json \
--result_path ./all_results/task_results/sklearn.json \
--max_workers 6What it does:
- Reads the sampled test JSON and the full parsed test JSON
- For each sampled test group, extracts the actual test code
- Uses an LLM to generate:
- Algorithm Description (
alg_description): A high-level, abstract description of what the algorithm does, without implementation details - Task Query (
task_query): A natural language query like "You are testing an algorithm that..."
- Algorithm Description (
Output format (one task):
{
"category": "metrics",
"file": "tests/test_metrics.py",
"module": "class TestRegressionMetrics",
"cap": "mean_squared_error",
"functions": ["test_mse_basic", "test_mse_multioutput"],
"query_code": "def test_mse_basic():\n ...",
"alg_description": "Computing the mean squared error between predicted and actual values, supporting both single-output and multi-output regression scenarios.",
"task_query": "You are testing an algorithm that calculates the mean squared error (MSE) between predicted values and ground truth, with support for sample weights and multi-output averaging strategies.",
"id": "sklearn-0042"
}Run the evaluation pipeline on a generated repository against the task set.
python -m repocraft.run \
--tasks_file ./all_results/task_results/sklearn.json \
--method_path /path/to/generated/MLKit-Py \
--cache_dir ./eval_cache \
--mnt_dir /tmp/workspace \
--model_loc_vote o3-mini \
--model_test o3-mini \
--max_loc_iters 40 \
--max_coding_iters 15 \
--max_retries 5 \
--image_name zerorepo \
--skip_existing \
--verboseThe evaluation pipeline has 3 stages:
An RPGAgent navigates the generated repository to find functions/classes that implement the target algorithm. It uses:
- RPG-guided search (functionality-based fuzzy matching)
- Repository code view (inspect function bodies)
- Dependency exploration (trace edges for related modules)
5 rounds of LLM-based voting to verify whether the localized code actually implements the target algorithm. If voting fails, the pipeline retries localization (up to 3 times).
The ground-truth test is adapted to match the naming/structure of the localized code, then executed in a Docker container. The test result determines functional correctness.
After running evaluation, analyze the results:
python -m repocraft.evaluation \
--base-dir ./exp_results \
--models gpt-4.1 gpt-5-mini \
--exp-types docs ref \
--show-failed \
--output results.jsonThis produces summary tables with pass rates, voting rates, and per-repository breakdowns.
To evaluate how well a generated repository covers the reference feature taxonomy:
from repocraft.coverage.coverage import SubtreeCoverageEvaluator
evaluator = SubtreeCoverageEvaluator(
model_id="Alibaba-NLP/gte-Qwen2-7B-instruct",
outlier_tag="new_features"
)
# predicted_trees: feature tree from generated repository
# gt_tree: ground-truth feature tree from coverage/gt_repo_tree/
results = evaluator.evaluate(
predicted_trees=predicted_trees,
gt_tree=gt_tree
)Ground-truth feature trees for all 6 repositories are provided in coverage/gt_repo_tree/.
# Stages 1-4 in one command
python -m repocraft.benchmark pipeline \
--repo_dir /path/to/scikit-learn \
--output_dir ./all_results \
--repo_name sklearn \
--max_parse_workers 4 \
--max_query_workers 6
# Stage 5: Run evaluation on generated repository
python -m repocraft.run \
--tasks_file ./all_results/task_results/sklearn.json \
--method_path /path/to/generated/MLKit-Py \
--cache_dir ./eval_cache \
--model o3-mini \
--skip_existing
# Analyze results
python -m repocraft.evaluation \
--base-dir ./eval_cache \
--show-failed# Stage 1: Parse test tree
python -m repocraft.benchmark.main parse \
--repo_dir /path/to/scikit-learn \
--result_path ./all_results/result_tests/sklearn.json \
--max_workers 4
# Stage 2: Refactor into categorized tree
python -m repocraft.benchmark.main refactor \
--parsed_test ./all_results/result_tests/sklearn.json \
--result_path ./all_results/refactored_test/sklearn.json
# Stage 3: Sample tests
python -m repocraft.benchmark.main sample \
--refactored_test ./all_results/refactored_test/sklearn.json \
--result_path ./all_results/sampled_test/sample_sklearn.json
# Stage 4: Generate task queries
python -m repocraft.benchmark.main generate \
--sampled_test ./all_results/sampled_test/sample_sklearn.json \
--parsed_test ./all_results/result_tests/sklearn.json \
--result_path ./all_results/task_results/sklearn.json \
--max_workers 6
# Stage 5: Run evaluation on generated repository
python -m repocraft.run \
--tasks_file ./all_results/task_results/sklearn.json \
--method_path /path/to/generated/MLKit-Py \
--cache_dir ./eval_cache \
--model o3-mini \
--skip_existing
# Analyze results
python -m repocraft.evaluation \
--base-dir ./eval_cache \
--show-failedThe pipeline uses LLM for test parsing, query generation, localization, voting, and test adaptation. Configure via:
from zerorepo.rpg_gen.base.llm_client import LLMConfig
# Use a specific model
cfg = LLMConfig(model="o3-mini")
# Or load from JSON/YAML file
cfg = LLMConfig.from_source("path/to/config.json")The evaluation framework runs tests inside Docker containers for isolation:
- Image:
zerorepo(default) - Mount: Test workspace mounted at
/workspace, repo at/repo - Conda: Activates
zerorepoconda environment inside container
| Parameter | Default | Description |
|---|---|---|
max_loc_iters |
40 | Max localization iterations per attempt |
max_coding_iters |
15 | Max test generation iterations |
max_retries |
5 | Max localization retry attempts |
voting_times |
5 | Number of voting rounds |
context_window |
10-20 | LLM memory context window |
zerorepo: Repository analysis, LLM client, code parsingdocker: Container management for test executiontiktoken: Token counting for output truncationnetworkx: Graph operations (dependency graphs)torch,transformers: Embedding models for coverage evaluationscikit-learn: Clustering for coverage matchingtqdm: Progress bars