Skip to content

Commit e788a35

Browse files
committed
Commit in progress - todo amend
1 parent 927fde0 commit e788a35

File tree

10 files changed

+517
-0
lines changed

10 files changed

+517
-0
lines changed

_blogs/codemonkeys.md

Lines changed: 469 additions & 0 deletions
Large diffs are not rendered by default.

_pubs/codemonkeys.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
title: 'CodeMonkeys: Scaling Test-Time Compute for Software Engineering'
3+
authors:
4+
- key: ryanehrlich
5+
equal: true
6+
- key: bradleybrown
7+
affiliation: University of Oxford
8+
equal: true
9+
- key: jordanjuravsky
10+
equal: true
11+
- name: Ronald Clark
12+
affiliation: University of Oxford
13+
- name: Christopher Ré
14+
affiliation: Stanford
15+
- key: azaliamirhoseini
16+
venue: preprint
17+
year: 2025
18+
day: 23
19+
has_pdf: true
20+
doi:
21+
tags:
22+
- machine learning
23+
- generative AI
24+
teaser: In this work, we present CodeMonkeys, a system designed to solve software engineering problems by scaling test time compute. CodeMonkeys resolves 57.4% of issues in SWE-bench Verified. When ensembling with edits from existing top SWE-bench submissions, we obtains a score of 66.2% outperforming the best member of the ensemble on its own.
25+
materials:
26+
- name: Paper
27+
url: https://arxiv.org/
28+
type: file-pdf
29+
- name: CodeMonkeys Codebase
30+
url: https://github.com/ScalingIntelligence/codemonkeys
31+
type: code
32+
- name: Trajectories
33+
url:
34+
type: database
35+
- name: Codebase Content Dataset
36+
url: https://huggingface.co/datasets/ScalingIntelligence/swe-bench-verified-codebase-content
37+
type: database
38+
---
39+
Scaling test-time compute is a promising axis for improving LLM capabilities.
40+
However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research.
41+
Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset.
42+
Our system (CodeMonkeys) allows models to iteratively edit a codebase by jointly developing and running a testing script alongside their draft edit.
43+
We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits.
44+
This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem.
45+
With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file.
46+
In order to select between candidate edits, we combine voting with model-generated tests with a final multi-turn trajectory dedicated to selection.
47+
Overall, CodeMonkeys resolves 57.7% of issues from SWE-bench Verified using a budget of approximately 2300 USD.
48+
Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own.
75.5 KB
Loading
58.7 KB
Loading

imgs/blog/codemonkeys/swebench.png

182 KB
Loading
330 KB
Loading
111 KB
Loading

imgs/teasers/codemonkeys.png

330 KB
Loading

imgs/thumbs/codemonkeys.png

705 KB
Loading

pubs/codemonkeys.pdf

1.42 MB
Binary file not shown.

0 commit comments

Comments
 (0)