microsoft
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎RepoCoder/LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎RepoCoder/LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎RepoCoder/README.md‎
Lines changed: 136 additions & 0 deletions b/‎RepoCoder/README.md‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎RepoCoder/build_prompt.py‎
Lines changed: 162 additions & 0 deletions b/‎RepoCoder/build_prompt.py‎
Lines changed: 162 additions & 0 deletions
@@ -7,3 +7,4 @@ These projects are presented by Microsoft Research Asia and Microsoft Azure AI.
 
 - [[CodeT]](./CodeT/): Code Generation with Generated Tests
 - [[DIVERSE]](./DIVERSE/): On the Advance of Making Language Models Better Reasoners
+- [[RepoCoder]](./RepoCoder/): Repository-Level Code Completion Through Iterative Retrieval and Generation
@@ -0,0 +1,21 @@
+    MIT License
+
+    Copyright (c) Microsoft Corporation.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
@@ -0,0 +1,136 @@
+# RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
+
+# Overview
+
+In the paper, we present **RepoCoder**, a simple, generic, and effective framework to tackle the repository-level code completion task, which is to continue writing the unfinished code based on a broader context of the repository. RepoCoder incorporates a similarity-based retriever, a pre-trained code language model, and a novel iterative retrieval-generation paradigm. It streamlines the overall process and eliminates the need for heuristic rules, static code analysis, data labeling, and model re-training in previous studies. 
+
+![framework](./figs/framework.png)
+<center>
+Figure 1. The illustration of our RepoCoder framework.
+</center>
+
+We also present a new benchmark, **RepoEval**, for the repository-level code completion task, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder and show that it significantly improves the zero-shot code completion baseline by over 10% and consistently outperforms the vanilla retrieval-augmented code completion approach.
+
+## Project
+
+This project contains the basic components of RepoCoder. Here is an overview:
+
+```shell
+|-- make_window.py # slice the repository files and the model predictions into code snippets
+|-- build_vector.py # build the vector representation for the code snippets
+|-- search_code.py # search relevant code snippets with the vector representation
+|-- build_prompt.py # build the prompt with the unfinished code and the retrieved code snippets
+|-- run_pipeline.py # run the code completion pipeline
+|-- compute_score.py # evaluate the performance of the code completion
+|-- utils.py # utility functions
+|-- datasets/datasets.zip # the input data for the code completion task
+    |-- function_level_completion_4k_context_codex.test.jsonl
+    |-- function_level_completion_2k_context_codex.test.jsonl
+    |-- line_level_completion_4k_context_codex.test.jsonl
+    |-- line_level_completion_2k_context_codex.test.jsonl
+    |-- line_level_completion_2k_context_codegen.test.jsonl
+    |-- line_level_completion_1k_context_codegen.test.jsonl
+    |-- api_level_completion_4k_context_codex.test.jsonl
+    |-- api_level_completion_2k_context_codex.test.jsonl
+    |-- api_level_completion_2k_context_codegen.test.jsonl
+    |-- api_level_completion_1k_context_codegen.test.jsonl
+|-- repositories # the checkpoints of repositories used to build our benchmark
+    |-- function_level.zip 
+      |-- CarperAI_trlx
+      |-- lucidrains_imagen-pytorch
+      |-- deepmind_tracr
+      |-- leopard-ai_betty
+      |-- google_lightweight_mmm
+      |-- amazon-science_patchcore-inspection
+      |-- facebookresearch_omnivore
+      |-- maxhumber_redframes
+    |-- line_and_api_level.zip
+      |-- pytorch_rl
+      |-- opendilab_ACE
+      |-- google_vizier
+      |-- awslabs_fortuna
+      |-- huggingface_evaluate
+      |-- huggingface_diffusers
+      |-- nerfstudio-project_nerfstudio
+      |-- alibaba_FederatedScope
+```
+
+We utilize a private library to handle the execution and evaluation of the function-level completion. Due to the license issue, we cannot release the code. However, we provide the data for the function-level completion task in `datasets/datasets.zip` and `repositories/function_level.zip`.
+
+# Quickstart
+
+## Prepare Environment
+First, we should set up a python environment. This code base has been tested under python 3.8.
+
+```bash
+$ conda create -n repocoder python=3.8
+$ conda activate repocoder
+$ pip install -r requirements.txt
+```
+
+## Run the Code Completion
+1. The `run_RG1_and_oracle_method` function in `run_pipeline.py` shows the process of building the prompts for vanilla retrieval-augmented generation (RG1) and the Oracle method. The generated prompts are listed in a .jsonl file, where each line contains the content in the following format:
+```json
+{
+  "prompt": "...the retrieved code snippets and unfinished code...",
+  "metadata": {
+    "task_id": "repo_name/idx",
+    "ground_truth": "ground truth completion",
+    "fpath_tuple": ["path", "to", "target file"],
+    "context_start_lineno": 0,
+    "line_no": 10,
+  }
+}
+```
+
+2. Then we can call the model to generate completions and organize the results in the following format:
+```json
+{
+  "prompt": "...the retrieved code snippets and unfinished code...",
+  "choices": [{"text": "...generated completion without repeating the input prompt..."}],
+  "metadata": {}
+}
+```
+
+3. Next, we can evaluate the performance of the code completion with the `compute_score.py` script. The script will compute the Exact Match and Edit Similarity scores.
+
+4. After that, we can use the `run_RepoCoder_method` function in `run_pipeline.py` to run a second iteration of retrieval-generation, which is our proposed RepoCoder approach, using the prediction file of RG1. And finally, we can evaluate the performance of RepoCoder as introduced in Step 3.
+
+# Citation
+
+If our work is useful, please consider citing our paper:
+
+```bibtex
+@article{zhang2023repocoder,
+  title={RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation},
+  author={Zhang, Fengji and Chen, Bei and Zhang, Yue and Liu, Jin and Zan, Daoguang and Mao, Yi and Lou, Jian-Guang and Chen, Weizhu},
+  journal={arXiv preprint arXiv:2303.12570},
+  year={2023}
+}
+```
+
+# Contributing
+
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
+Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
+the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+
+When you submit a pull request, a CLA bot will automatically determine whether you need to provide
+a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
+provided by the bot. You will only need to do this once across all repos using our CLA.
+
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
+contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+
+# License
+
+Please note that this repo is under [MIT License](LICENSE).
+
+# Trademarks
+
+This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
+trademarks or logos is subject to and must follow 
+[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
+Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
+Any use of third-party trademarks or logos are subject to those third-party's policies.
@@ -0,0 +1,162 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+import functools
+import os
+
+from utils import Tools, FilePathBuilder, CodexTokenizer, CodeGenTokenizer, CONSTANTS
+
+class PromptBuilder:
+    def __init__(self, query_lines_with_retrieval_results, task_path, log_message, tokenizer):
+        self.query_lines_with_retrieval_results = query_lines_with_retrieval_results
+        self.log_message = log_message
+        if tokenizer == CodexTokenizer:
+            self.tokenizer = CodexTokenizer()
+            self.max_retrieval_length = 2000  # half of the max length of the model
+        elif tokenizer == CodeGenTokenizer:
+            self.tokenizer = CodeGenTokenizer()
+            self.max_retrieval_length = 1000
+        tasks = Tools.load_jsonl(task_path)
+        self.tasks_by_task_id = {task['metadata']['task_id']: task for task in tasks}
+        self.seperator = '# ' + '-' * 50
+        self.max_examples = 10  # maximum number of examples to be included in the prompt
+
+    def _make_a_block(self, retrieved_context):
+        content, sim_score = retrieved_context
+        metadata = content['metadata']
+        # put the file path in the comment
+        assert metadata[0]['fpath_tuple'][0] == metadata[0]['repo']
+        f_paths = ['/'.join(x['fpath_tuple'][1:]) for x in metadata]
+        f_paths_str = '\n'.join([f'# {f_path}' for f_path in f_paths])
+        f_path_comment = f'# the below code fragment can be found in:'
+        # put code lines in the comment
+        content_lines = content['context'].splitlines()
+        content_lines_comment = [f'# {line}' for line in content_lines]
+        # aggregate the comment and the code lines
+        
+        block_str = '\n'.join([f_path_comment, f_paths_str, self.seperator] + content_lines_comment + [self.seperator]) + '\n'
+        tokenized_block = self.tokenizer.tokenize(block_str)
+        token_len = len(tokenized_block)
+        return block_str, token_len
+
+    def _make_an_extended_block(self, retrieved_context):
+        content, sim_score = retrieved_context
+        metadata = content['metadata']
+        # put the file path in the comment
+        assert metadata[0]['fpath_tuple'][0] == metadata[0]['repo']
+        f_paths = ['/'.join(x['fpath_tuple'][1:]) for x in metadata]
+        f_paths_str = '\n'.join([f'# {f_path}' for f_path in f_paths])
+        f_path_comment = f'# the below code fragment can be found in:'
+        # put code lines in the comment
+        original_code = Tools.read_code(os.path.join(FilePathBuilder.repo_base_dir, *metadata[0]['fpath_tuple']))
+        code_lines = original_code.splitlines()
+        end_line_no = metadata[0]['end_line_no']
+        window_size = metadata[0]['window_size']
+        slice_size = metadata[0]['slice_size']
+        new_end_line_no = min(end_line_no + window_size // slice_size, len(code_lines))
+        new_start_line_no = max(0, new_end_line_no - window_size)
+        content_lines = code_lines[new_start_line_no:new_end_line_no]
+        content_lines_comment = [f'# {line}' for line in content_lines]
+        # aggregate the comment and the code lines
+        block_str = '\n'.join([f_path_comment, f_paths_str, self.seperator] + content_lines_comment + [self.seperator]) + '\n'
+        tokenized_block = self.tokenizer.tokenize(block_str)
+        token_len = len(tokenized_block)
+        return block_str, token_len
+
+    def _build_prompt(self, mode, prompt, top_k_context):
+        prepend_context = "# Here are some relevant code fragments from other files of the repo:\n"
+        prepend_context += self.seperator + '\n'
+        current_token_length = 20  # the length of the head_prompt, same for codex and codegen tokenizer
+        prepend_blocks = []
+        chosen_context = []
+        make_block_func = self._make_an_extended_block if mode == CONSTANTS.rg else self._make_a_block
+        for retrieved_context in top_k_context[::-1]:
+            if len(chosen_context) >= self.max_examples:
+                break
+            block_str, token_len = make_block_func(retrieved_context)
+            if current_token_length + token_len < self.max_retrieval_length:
+                prepend_blocks.insert(0, block_str) 
+                current_token_length += token_len
+                chosen_context.append(retrieved_context)
+            else:
+                continue
+        prepend_context += ''.join(prepend_blocks)  # all the blocks already have a line break at the end
+        return prepend_context + '\n' + prompt, chosen_context
+
+    def build_2nd_stage_input_file(self, mode):
+        new_prompt_lines = []
+        for query_line in self.query_lines_with_retrieval_results:
+            task_id = query_line['metadata']['task_id']
+            task = self.tasks_by_task_id[task_id]
+            old_prompt = task['prompt']
+            top_k_context = query_line['top_k_context']
+            new_prompt, chosen_context = self._build_prompt(mode, old_prompt, top_k_context)
+            new_prompt_line = {
+                'prompt': new_prompt,
+                'metadata': task['metadata'],
+            }
+            new_prompt_line['metadata']['query_window'] = {
+                'context': query_line['context'],
+                'metadata': query_line['metadata'],
+            }
+            new_prompt_line['metadata']['top_k_context'] = [
+                {
+                    'context': x[0]['context'],
+                    'metadata': x[0]['metadata'],
+                    'sim_score': x[1],
+                } for x in chosen_context
+            ]
+            new_prompt_line['metadata']['window_size'] = query_line['metadata']['window_size']
+            new_prompt_line['metadata']['slice_size'] = chosen_context[0][0]['metadata'][0]['slice_size']
+            new_prompt_lines.append(new_prompt_line)
+        print('done! ' + self.log_message)
+        return new_prompt_lines
+
+class BuildPromptWrapper:
+    def __init__(self, vectorizer, benchmark, repos, window_size, slice_size, tokenizer):
+        if vectorizer == 'one-gram':
+            self.vector_path_builder = FilePathBuilder.one_gram_vector_path
+        elif vectorizer == 'ada002':
+            self.vector_path_builder = FilePathBuilder.ada002_vector_path
+        self.max_top_k = 20
+        self.repos = repos
+        self.window_size = window_size
+        self.slice_size = slice_size
+        if benchmark == CONSTANTS.line_benchmark:
+            self.task_path = FilePathBuilder.random_line_completion_benchmark
+        elif benchmark == CONSTANTS.api_benchmark:
+            self.task_path = FilePathBuilder.api_completion_benchmark
+        elif benchmark == CONSTANTS.short_api_benchmark:
+            self.task_path = FilePathBuilder.short_api_completion_benchmark
+        elif benchmark == CONSTANTS.short_line_benchmark:
+            self.task_path = FilePathBuilder.short_random_line_completion_benchmark
+        self.benchmark = benchmark
+        self.tokenizer = tokenizer
+    
+    def _run(self, mode, query_window_path_builder, output_file_path):
+        workers = []
+        for repo in self.repos:
+            query_window_path = query_window_path_builder(repo, self.window_size)
+            query_line_path = self.vector_path_builder(query_window_path)
+            repo_window_path = FilePathBuilder.repo_windows_path(repo, self.window_size, self.slice_size)
+            repo_embedding_path = self.vector_path_builder(repo_window_path)
+            retrieval_results = FilePathBuilder.retrieval_results_path(query_line_path, repo_embedding_path, self.max_top_k)
+            
+            query_lines_with_retrieval_results = Tools.load_pickle(retrieval_results)
+            log_message = f'repo: {repo}, window: {self.window_size}, slice: {self.slice_size}'
+            worker = PromptBuilder(query_lines_with_retrieval_results, self.task_path, log_message, self.tokenizer)
+            workers.append(worker)
+        lines = []
+        for worker in workers:
+            lines += worker.build_2nd_stage_input_file(mode)
+        Tools.dump_jsonl(lines, output_file_path)
+
+    def build_first_search_prompt(self, mode, output_path):
+        query_line_path_temp = functools.partial(FilePathBuilder.search_first_window_path, self.benchmark, mode)
+        self._run(mode, query_line_path_temp, output_path)
+
+    
+    def build_prediction_prompt(self, mode, prediction_path, output_path):
+        query_line_path_temp = functools.partial(FilePathBuilder.gen_first_window_path, self.benchmark, mode, prediction_path)
+        self._run(mode, query_line_path_temp, output_path)
+
Original file line number	Diff line number	Diff line change
`@@ -7,3 +7,4 @@ These projects are presented by Microsoft Research Asia and Microsoft Azure AI.`
`7`	`7`
`8`	`8`	`- [[CodeT]](./CodeT/): Code Generation with Generated Tests`
`9`	`9`	`- [[DIVERSE]](./DIVERSE/): On the Advance of Making Language Models Better Reasoners`
	`10`	`+- [[RepoCoder]](./RepoCoder/): Repository-Level Code Completion Through Iterative Retrieval and Generation`