- 📜 Contains 154 code snippets to test and benchmark.
- 🏷 Offers 845 type annotations across a diverse set of Python functionalities.
- 📂 Organized into 18 distinct categories targeting various Python features.
- 🚢 Seamlessly manages the execution of containerized tools.
- 🔄 Efficiently transforms inferred types into a standardized format.
- 📊 Automatically produces meaningful metrics for in-depth assessment and comparison.
- 🤖 Autogenerates code snippets and ground truth to scale the benchmark based on the original TypeEvalPybenchmark.
- 📈 The autogen benchmark now contains:
- Python files: 7121
- Type annotations: 78373
 
| Supported ✅ | In-progress 🔧 | Planned 💡 | 
|---|---|---|
| HeaderGen | Intellij PSI | MonkeyType | 
| Jedi | Pyre | Pyannotate | 
| Pyright | PySonar2 | |
| HiTyper | Pytype | |
| Scalpel | TypeT5 | |
| Type4Py | ||
| GPT | ||
| Ollama | ||
| RightTyper | 
Below is a comparison showcasing exact matches across different tools and LLMs on the Autogen benchmark.
| Rank | 🛠️ Tool | Function Return Type | Function Parameter Type | Local Variable Type | Total | 
|---|---|---|---|---|---|
| 1 | mistral-large-it-2407-123b | 16701 | 728 | 57550 | 74979 | 
| 2 | qwen2-it-72b | 16488 | 629 | 55160 | 72277 | 
| 3 | llama3.1-it-70b | 16648 | 580 | 54445 | 71673 | 
| 4 | gemma2-it-27b | 16342 | 599 | 49772 | 66713 | 
| 5 | codestral-v0.1-22b | 16456 | 706 | 49379 | 66541 | 
| 6 | codellama-it-34b | 15960 | 473 | 48957 | 65390 | 
| 7 | mistral-nemo-it-2407-12.2b | 16221 | 526 | 48439 | 65186 | 
| 8 | mistral-v0.3-it-7b | 16686 | 472 | 47935 | 65093 | 
| 9 | phi3-medium-it-14b | 16802 | 467 | 45121 | 62390 | 
| 10 | llama3.1-it-8b | 16125 | 492 | 44313 | 60930 | 
| 11 | codellama-it-13b | 16214 | 479 | 43021 | 59714 | 
| 12 | phi3-small-it-7.3b | 16155 | 422 | 38093 | 54670 | 
| 13 | qwen2-it-7b | 15684 | 313 | 38109 | 54106 | 
| 14 | HeaderGen | 14086 | 346 | 36370 | 50802 | 
| 15 | phi3-mini-it-3.8b | 15908 | 320 | 30341 | 46569 | 
| 16 | phi3.5-mini-it-3.8b | 15763 | 362 | 28694 | 44819 | 
| 17 | codellama-it-7b | 13779 | 318 | 29346 | 43443 | 
| 18 | Jedi | 13160 | 0 | 15403 | 28563 | 
| 19 | Scalpel | 15383 | 171 | 18 | 15572 | 
| 20 | gemma2-it-9b | 1611 | 66 | 5464 | 7141 | 
| 21 | Type4Py | 3143 | 38 | 2243 | 5424 | 
| 22 | tinyllama-1.1b | 1514 | 28 | 2699 | 4241 | 
| 23 | mixtral-v0.1-it-8x7b | 3235 | 33 | 377 | 3645 | 
| 24 | phi3.5-moe-it-41.9b | 3090 | 25 | 273 | 3388 | 
| 25 | gemma2-it-2b | 1497 | 41 | 1848 | 3386 | 
(Auto-generated based on the the analysis run on 30 Aug 2024)
git clone https://github.com/secure-software-engineering/TypeEvalPy.gitdocker build -t typeevalpy .🕒 Takes about 30mins on first run to build Docker containers.
📂 Results will be generated in the results folder within the root directory of the repository.
Each results folder will have a timestamp, allowing you to easily track and compare different runs.
Correlation of CSV Files Generated to Tables in ICSE Paper
Here is how the auto-generated CSV tables relate to the paper's tables:- 
Table 1 in the paper is derived from three auto-generated CSV tables: - paper_table_1.csv- details Exact matches by type category.
- paper_table_2.csv- lists Exact matches for 18 micro-benchmark categories.
- paper_table_3.csv- provides Sound and Complete values for tools.
 
- 
Table 2 in the paper is based on the following CSV table: - paper_table_5.csv- shows Exact matches with top_n values for machine learning tools.
 
Additionally, there are CSV tables that are not included in the paper:
- paper_table_4.csv- containing Sound and Complete values for 18 micro-benchmark categories.
- paper_table_6.csv- featuring Sensitivity analysis.
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy🔧 Optionally, run analysis on specific tools:
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners headergen scalpel📊 Run analysis on custom benchmarks:
Here, running with the autogen benchmark on HeaderGen
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy \
      --runners headergen \
      --custom_benchmark_dir /app/autogen_typeevalpy_benchmark🛠️ Available options: headergen, pyright, scalpel, jedi, hityper, type4py, hityperdl
TypeEvalPy integrates with LLMs through Ollama, streamlining their management. Begin by setting up your environment:
- Create Configuration File: Copy the config_template.yamlfrom the src directory and rename it toconfig.yaml.
In the config.yaml, configure in the following:
- openai_key: your key for accessing OpenAI's models.
- ollama_url: the URL for your Ollama instance. For simplicity, we recommend deploying Ollama using their Docker container. Get started with Ollama here.
- prompt_id: set this to- questions_based_2for optimal performance, based on our tests.
- ollama_models: select a list of model tags from the Ollama library. For better operation, ensure the model is pre-downloaded with the- ollama pullcommand.
With the config.yaml configured, run the following command:
docker run \
      -v /var/run/docker.sock:/var/run/docker.sock \
      -v ./results:/app/results \
      typeevalpy --runners ollamaRunning From Source...
- 
Clone the repo git clone https://github.com/secure-software-engineering/TypeEvalPy.git 
- 
Install Dependencies and Set Up Virtual Environment Run the following commands to set up your virtual environment and activate the virtual environment. python3 -m venv .env source .env/bin/activatepip install -r requirements.txt 
- 
Navigate to the srcDirectorycd src
- 
Execute the Analyzer Run the following command to start the benchmarking process on all tools: python main_runner.py or Run analysis on specific tools python main_runner.py --runners headergen scalpel
To generate an extended version of the original TypeEvalPy benchmark to include many more Python types, run the following commands:
- 
Navigate to the autogenDirectorycd autogen
- 
Execute the Generation Script Run the following command to start the generation process: python generate_typeevalpy_dataset.py 
This will generate a folder in the repo root with the autogen benchmark with the current date.
To run the LLM based Type Inference for ManyType4Py dataset, please follow the guide here : src/target_tools/real-world-llms/README.md
Thank you for your interest in contributing! To add support for a new tool, please utilize the Docker templates provided in our repository. After implementing and testing your tool, please submit a pull request (PR) with a descriptive message. Our maintainers will review your submission, and merge them.
To get started with integrating your tool, please follow the guide here: docs/Tool_Integration_Guide.md
Give a ⭐️ if this project helped you!
