Skip to content

Commit 3a23f04

Browse files
author
fengji.zhang
authored
Merge pull request #13 from microsoft/zfj/RepoCoder
initialize RepoCoder
2 parents 1951f72 + c1ae5d6 commit 3a23f04

File tree

15 files changed

+1118
-0
lines changed

15 files changed

+1118
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ These projects are presented by Microsoft Research Asia and Microsoft Azure AI.
77

88
- [[CodeT]](./CodeT/): Code Generation with Generated Tests
99
- [[DIVERSE]](./DIVERSE/): On the Advance of Making Language Models Better Reasoners
10+
- [[RepoCoder]](./RepoCoder/): Repository-Level Code Completion Through Iterative Retrieval and Generation

RepoCoder/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) Microsoft Corporation.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE

RepoCoder/README.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
2+
3+
# Overview
4+
5+
In the paper, we present **RepoCoder**, a simple, generic, and effective framework to tackle the repository-level code completion task, which is to continue writing the unfinished code based on a broader context of the repository. RepoCoder incorporates a similarity-based retriever, a pre-trained code language model, and a novel iterative retrieval-generation paradigm. It streamlines the overall process and eliminates the need for heuristic rules, static code analysis, data labeling, and model re-training in previous studies.
6+
7+
![framework](./figs/framework.png)
8+
<center>
9+
Figure 1. The illustration of our RepoCoder framework.
10+
</center>
11+
12+
We also present a new benchmark, **RepoEval**, for the repository-level code completion task, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder and show that it significantly improves the zero-shot code completion baseline by over 10% and consistently outperforms the vanilla retrieval-augmented code completion approach.
13+
14+
## Project
15+
16+
This project contains the basic components of RepoCoder. Here is an overview:
17+
18+
```shell
19+
|-- make_window.py # slice the repository files and the model predictions into code snippets
20+
|-- build_vector.py # build the vector representation for the code snippets
21+
|-- search_code.py # search relevant code snippets with the vector representation
22+
|-- build_prompt.py # build the prompt with the unfinished code and the retrieved code snippets
23+
|-- run_pipeline.py # run the code completion pipeline
24+
|-- compute_score.py # evaluate the performance of the code completion
25+
|-- utils.py # utility functions
26+
|-- datasets/datasets.zip # the input data for the code completion task
27+
|-- function_level_completion_4k_context_codex.test.jsonl
28+
|-- function_level_completion_2k_context_codex.test.jsonl
29+
|-- line_level_completion_4k_context_codex.test.jsonl
30+
|-- line_level_completion_2k_context_codex.test.jsonl
31+
|-- line_level_completion_2k_context_codegen.test.jsonl
32+
|-- line_level_completion_1k_context_codegen.test.jsonl
33+
|-- api_level_completion_4k_context_codex.test.jsonl
34+
|-- api_level_completion_2k_context_codex.test.jsonl
35+
|-- api_level_completion_2k_context_codegen.test.jsonl
36+
|-- api_level_completion_1k_context_codegen.test.jsonl
37+
|-- repositories # the checkpoints of repositories used to build our benchmark
38+
|-- function_level.zip
39+
|-- CarperAI_trlx
40+
|-- lucidrains_imagen-pytorch
41+
|-- deepmind_tracr
42+
|-- leopard-ai_betty
43+
|-- google_lightweight_mmm
44+
|-- amazon-science_patchcore-inspection
45+
|-- facebookresearch_omnivore
46+
|-- maxhumber_redframes
47+
|-- line_and_api_level.zip
48+
|-- pytorch_rl
49+
|-- opendilab_ACE
50+
|-- google_vizier
51+
|-- awslabs_fortuna
52+
|-- huggingface_evaluate
53+
|-- huggingface_diffusers
54+
|-- nerfstudio-project_nerfstudio
55+
|-- alibaba_FederatedScope
56+
```
57+
58+
We utilize a private library to handle the execution and evaluation of the function-level completion. Due to the license issue, we cannot release the code. However, we provide the data for the function-level completion task in `datasets/datasets.zip` and `repositories/function_level.zip`.
59+
60+
# Quickstart
61+
62+
## Prepare Environment
63+
First, we should set up a python environment. This code base has been tested under python 3.8.
64+
65+
```bash
66+
$ conda create -n repocoder python=3.8
67+
$ conda activate repocoder
68+
$ pip install -r requirements.txt
69+
```
70+
71+
## Run the Code Completion
72+
1. The `run_RG1_and_oracle_method` function in `run_pipeline.py` shows the process of building the prompts for vanilla retrieval-augmented generation (RG1) and the Oracle method. The generated prompts are listed in a .jsonl file, where each line contains the content in the following format:
73+
```json
74+
{
75+
"prompt": "...the retrieved code snippets and unfinished code...",
76+
"metadata": {
77+
"task_id": "repo_name/idx",
78+
"ground_truth": "ground truth completion",
79+
"fpath_tuple": ["path", "to", "target file"],
80+
"context_start_lineno": 0,
81+
"line_no": 10,
82+
}
83+
}
84+
```
85+
86+
2. Then we can call the model to generate completions and organize the results in the following format:
87+
```json
88+
{
89+
"prompt": "...the retrieved code snippets and unfinished code...",
90+
"choices": [{"text": "...generated completion without repeating the input prompt..."}],
91+
"metadata": {}
92+
}
93+
```
94+
95+
3. Next, we can evaluate the performance of the code completion with the `compute_score.py` script. The script will compute the Exact Match and Edit Similarity scores.
96+
97+
4. After that, we can use the `run_RepoCoder_method` function in `run_pipeline.py` to run a second iteration of retrieval-generation, which is our proposed RepoCoder approach, using the prediction file of RG1. And finally, we can evaluate the performance of RepoCoder as introduced in Step 3.
98+
99+
# Citation
100+
101+
If our work is useful, please consider citing our paper:
102+
103+
```bibtex
104+
@article{zhang2023repocoder,
105+
title={RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation},
106+
author={Zhang, Fengji and Chen, Bei and Zhang, Yue and Liu, Jin and Zan, Daoguang and Mao, Yi and Lou, Jian-Guang and Chen, Weizhu},
107+
journal={arXiv preprint arXiv:2303.12570},
108+
year={2023}
109+
}
110+
```
111+
112+
# Contributing
113+
114+
This project welcomes contributions and suggestions. Most contributions require you to agree to a
115+
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
116+
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
117+
118+
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
119+
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
120+
provided by the bot. You will only need to do this once across all repos using our CLA.
121+
122+
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
123+
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
124+
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
125+
126+
# License
127+
128+
Please note that this repo is under [MIT License](LICENSE).
129+
130+
# Trademarks
131+
132+
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
133+
trademarks or logos is subject to and must follow
134+
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
135+
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
136+
Any use of third-party trademarks or logos are subject to those third-party's policies.

RepoCoder/build_prompt.py

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Copyright (c) Microsoft Corporation.
2+
# Licensed under the MIT license.
3+
4+
import functools
5+
import os
6+
7+
from utils import Tools, FilePathBuilder, CodexTokenizer, CodeGenTokenizer, CONSTANTS
8+
9+
class PromptBuilder:
10+
def __init__(self, query_lines_with_retrieval_results, task_path, log_message, tokenizer):
11+
self.query_lines_with_retrieval_results = query_lines_with_retrieval_results
12+
self.log_message = log_message
13+
if tokenizer == CodexTokenizer:
14+
self.tokenizer = CodexTokenizer()
15+
self.max_retrieval_length = 2000 # half of the max length of the model
16+
elif tokenizer == CodeGenTokenizer:
17+
self.tokenizer = CodeGenTokenizer()
18+
self.max_retrieval_length = 1000
19+
tasks = Tools.load_jsonl(task_path)
20+
self.tasks_by_task_id = {task['metadata']['task_id']: task for task in tasks}
21+
self.seperator = '# ' + '-' * 50
22+
self.max_examples = 10 # maximum number of examples to be included in the prompt
23+
24+
def _make_a_block(self, retrieved_context):
25+
content, sim_score = retrieved_context
26+
metadata = content['metadata']
27+
# put the file path in the comment
28+
assert metadata[0]['fpath_tuple'][0] == metadata[0]['repo']
29+
f_paths = ['/'.join(x['fpath_tuple'][1:]) for x in metadata]
30+
f_paths_str = '\n'.join([f'# {f_path}' for f_path in f_paths])
31+
f_path_comment = f'# the below code fragment can be found in:'
32+
# put code lines in the comment
33+
content_lines = content['context'].splitlines()
34+
content_lines_comment = [f'# {line}' for line in content_lines]
35+
# aggregate the comment and the code lines
36+
37+
block_str = '\n'.join([f_path_comment, f_paths_str, self.seperator] + content_lines_comment + [self.seperator]) + '\n'
38+
tokenized_block = self.tokenizer.tokenize(block_str)
39+
token_len = len(tokenized_block)
40+
return block_str, token_len
41+
42+
def _make_an_extended_block(self, retrieved_context):
43+
content, sim_score = retrieved_context
44+
metadata = content['metadata']
45+
# put the file path in the comment
46+
assert metadata[0]['fpath_tuple'][0] == metadata[0]['repo']
47+
f_paths = ['/'.join(x['fpath_tuple'][1:]) for x in metadata]
48+
f_paths_str = '\n'.join([f'# {f_path}' for f_path in f_paths])
49+
f_path_comment = f'# the below code fragment can be found in:'
50+
# put code lines in the comment
51+
original_code = Tools.read_code(os.path.join(FilePathBuilder.repo_base_dir, *metadata[0]['fpath_tuple']))
52+
code_lines = original_code.splitlines()
53+
end_line_no = metadata[0]['end_line_no']
54+
window_size = metadata[0]['window_size']
55+
slice_size = metadata[0]['slice_size']
56+
new_end_line_no = min(end_line_no + window_size // slice_size, len(code_lines))
57+
new_start_line_no = max(0, new_end_line_no - window_size)
58+
content_lines = code_lines[new_start_line_no:new_end_line_no]
59+
content_lines_comment = [f'# {line}' for line in content_lines]
60+
# aggregate the comment and the code lines
61+
block_str = '\n'.join([f_path_comment, f_paths_str, self.seperator] + content_lines_comment + [self.seperator]) + '\n'
62+
tokenized_block = self.tokenizer.tokenize(block_str)
63+
token_len = len(tokenized_block)
64+
return block_str, token_len
65+
66+
def _build_prompt(self, mode, prompt, top_k_context):
67+
prepend_context = "# Here are some relevant code fragments from other files of the repo:\n"
68+
prepend_context += self.seperator + '\n'
69+
current_token_length = 20 # the length of the head_prompt, same for codex and codegen tokenizer
70+
prepend_blocks = []
71+
chosen_context = []
72+
make_block_func = self._make_an_extended_block if mode == CONSTANTS.rg else self._make_a_block
73+
for retrieved_context in top_k_context[::-1]:
74+
if len(chosen_context) >= self.max_examples:
75+
break
76+
block_str, token_len = make_block_func(retrieved_context)
77+
if current_token_length + token_len < self.max_retrieval_length:
78+
prepend_blocks.insert(0, block_str)
79+
current_token_length += token_len
80+
chosen_context.append(retrieved_context)
81+
else:
82+
continue
83+
prepend_context += ''.join(prepend_blocks) # all the blocks already have a line break at the end
84+
return prepend_context + '\n' + prompt, chosen_context
85+
86+
def build_2nd_stage_input_file(self, mode):
87+
new_prompt_lines = []
88+
for query_line in self.query_lines_with_retrieval_results:
89+
task_id = query_line['metadata']['task_id']
90+
task = self.tasks_by_task_id[task_id]
91+
old_prompt = task['prompt']
92+
top_k_context = query_line['top_k_context']
93+
new_prompt, chosen_context = self._build_prompt(mode, old_prompt, top_k_context)
94+
new_prompt_line = {
95+
'prompt': new_prompt,
96+
'metadata': task['metadata'],
97+
}
98+
new_prompt_line['metadata']['query_window'] = {
99+
'context': query_line['context'],
100+
'metadata': query_line['metadata'],
101+
}
102+
new_prompt_line['metadata']['top_k_context'] = [
103+
{
104+
'context': x[0]['context'],
105+
'metadata': x[0]['metadata'],
106+
'sim_score': x[1],
107+
} for x in chosen_context
108+
]
109+
new_prompt_line['metadata']['window_size'] = query_line['metadata']['window_size']
110+
new_prompt_line['metadata']['slice_size'] = chosen_context[0][0]['metadata'][0]['slice_size']
111+
new_prompt_lines.append(new_prompt_line)
112+
print('done! ' + self.log_message)
113+
return new_prompt_lines
114+
115+
class BuildPromptWrapper:
116+
def __init__(self, vectorizer, benchmark, repos, window_size, slice_size, tokenizer):
117+
if vectorizer == 'one-gram':
118+
self.vector_path_builder = FilePathBuilder.one_gram_vector_path
119+
elif vectorizer == 'ada002':
120+
self.vector_path_builder = FilePathBuilder.ada002_vector_path
121+
self.max_top_k = 20
122+
self.repos = repos
123+
self.window_size = window_size
124+
self.slice_size = slice_size
125+
if benchmark == CONSTANTS.line_benchmark:
126+
self.task_path = FilePathBuilder.random_line_completion_benchmark
127+
elif benchmark == CONSTANTS.api_benchmark:
128+
self.task_path = FilePathBuilder.api_completion_benchmark
129+
elif benchmark == CONSTANTS.short_api_benchmark:
130+
self.task_path = FilePathBuilder.short_api_completion_benchmark
131+
elif benchmark == CONSTANTS.short_line_benchmark:
132+
self.task_path = FilePathBuilder.short_random_line_completion_benchmark
133+
self.benchmark = benchmark
134+
self.tokenizer = tokenizer
135+
136+
def _run(self, mode, query_window_path_builder, output_file_path):
137+
workers = []
138+
for repo in self.repos:
139+
query_window_path = query_window_path_builder(repo, self.window_size)
140+
query_line_path = self.vector_path_builder(query_window_path)
141+
repo_window_path = FilePathBuilder.repo_windows_path(repo, self.window_size, self.slice_size)
142+
repo_embedding_path = self.vector_path_builder(repo_window_path)
143+
retrieval_results = FilePathBuilder.retrieval_results_path(query_line_path, repo_embedding_path, self.max_top_k)
144+
145+
query_lines_with_retrieval_results = Tools.load_pickle(retrieval_results)
146+
log_message = f'repo: {repo}, window: {self.window_size}, slice: {self.slice_size}'
147+
worker = PromptBuilder(query_lines_with_retrieval_results, self.task_path, log_message, self.tokenizer)
148+
workers.append(worker)
149+
lines = []
150+
for worker in workers:
151+
lines += worker.build_2nd_stage_input_file(mode)
152+
Tools.dump_jsonl(lines, output_file_path)
153+
154+
def build_first_search_prompt(self, mode, output_path):
155+
query_line_path_temp = functools.partial(FilePathBuilder.search_first_window_path, self.benchmark, mode)
156+
self._run(mode, query_line_path_temp, output_path)
157+
158+
159+
def build_prediction_prompt(self, mode, prediction_path, output_path):
160+
query_line_path_temp = functools.partial(FilePathBuilder.gen_first_window_path, self.benchmark, mode, prediction_path)
161+
self._run(mode, query_line_path_temp, output_path)
162+

0 commit comments

Comments
 (0)