|
| 1 | +--- |
| 2 | +title: 'CodeMonkeys: Scaling Test-Time Compute for Software Engineering' |
| 3 | +authors: |
| 4 | + - key: ryanehrlich |
| 5 | + equal: true |
| 6 | + - key: bradleybrown |
| 7 | + affiliation: University of Oxford |
| 8 | + equal: true |
| 9 | + - key: jordanjuravsky |
| 10 | + equal: true |
| 11 | + - name: Ronald Clark |
| 12 | + affiliation: University of Oxford |
| 13 | + - name: Christopher Ré |
| 14 | + affiliation: Stanford |
| 15 | + - key: azaliamirhoseini |
| 16 | +venue: preprint |
| 17 | +year: 2025 |
| 18 | +day: 23 |
| 19 | +has_pdf: true |
| 20 | +doi: |
| 21 | +tags: |
| 22 | + - machine learning |
| 23 | + - generative AI |
| 24 | +teaser: In this work, we present CodeMonkeys, a system designed to solve software engineering problems by scaling test time compute. CodeMonkeys resolves 57.4% of issues in SWE-bench Verified. When ensembling with edits from existing top SWE-bench submissions, we obtains a score of 66.2% outperforming the best member of the ensemble on its own. |
| 25 | +materials: |
| 26 | + - name: Paper |
| 27 | + url: https://arxiv.org/ |
| 28 | + type: file-pdf |
| 29 | + - name: CodeMonkeys Codebase |
| 30 | + url: https://github.com/ScalingIntelligence/codemonkeys |
| 31 | + type: code |
| 32 | + - name: Trajectories |
| 33 | + url: |
| 34 | + type: database |
| 35 | + - name: Codebase Content Dataset |
| 36 | + url: https://huggingface.co/datasets/ScalingIntelligence/swe-bench-verified-codebase-content |
| 37 | + type: database |
| 38 | +--- |
| 39 | +Scaling test-time compute is a promising axis for improving LLM capabilities. |
| 40 | +However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. |
| 41 | +Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. |
| 42 | +Our system (CodeMonkeys) allows models to iteratively edit a codebase by jointly developing and running a testing script alongside their draft edit. |
| 43 | +We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. |
| 44 | +This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. |
| 45 | +With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. |
| 46 | +In order to select between candidate edits, we combine voting with model-generated tests with a final multi-turn trajectory dedicated to selection. |
| 47 | +Overall, CodeMonkeys resolves 57.7% of issues from SWE-bench Verified using a budget of approximately 2300 USD. |
| 48 | +Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. |
0 commit comments