From eaa1047d67bfc04032864fbc2049b99b3fa03b30 Mon Sep 17 00:00:00 2001 From: Chesars Date: Thu, 30 Oct 2025 10:37:35 -0300 Subject: [PATCH 1/3] =?UTF-8?q?Fix=20typo=20in=20README.md:=20'ro=20run'?= =?UTF-8?q?=20=E2=86=92=20'to=20run'?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Corrected typo in line 62 of the inspect_ai installation instructions. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4765b42..6770498 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,7 @@ SciCode sources challenging and realistic research-level coding problems across ## Instructions to evaluate a new model using `inspect_ai` (recommended) -Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps ro run: +Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps to run: 1. Clone this repository `git clone git@github.com:scicode-bench/SciCode.git` 2. Install the `scicode` package with `pip install -e .` From 89ae0a6604558cc1c0ebc3d218bf3ad67ae25ce7 Mon Sep 17 00:00:00 2001 From: Chesars Date: Thu, 30 Oct 2025 10:42:29 -0300 Subject: [PATCH 2/3] =?UTF-8?q?Fix=20typo=20in=20eval/inspect=5Fai/README.?= =?UTF-8?q?md:=20'correpsonding'=20=E2=86=92=20'corresponding'?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Corrected typo in line 5 of the API keys setup instructions. --- eval/inspect_ai/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/eval/inspect_ai/README.md b/eval/inspect_ai/README.md index 388cb09..3ed4742 100644 --- a/eval/inspect_ai/README.md +++ b/eval/inspect_ai/README.md @@ -2,7 +2,7 @@ ### 1. Set Up Your API Keys -Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate. +Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup corresponding API keys depending on the types of models they would like to evaluate. ### 2. Setup Command Line Arguments if Needed From ad69a25561d3464cae3489a76d02a1076d40d216 Mon Sep 17 00:00:00 2001 From: Chesars Date: Tue, 4 Nov 2025 11:05:49 -0300 Subject: [PATCH 3/3] docs: fix typos in README files (set up, evaluated, Together AI, spacing) --- README.md | 6 +++--- eval/inspect_ai/README.md | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 6770498..c056c83 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ SciCode is a challenging benchmark designed to evaluate the capabilities of lang ## Dataset Creation -SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability. +SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. SciCode mainly focuses on 1. Numerical methods 2. Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability. ## 🏆 Leaderboard @@ -59,12 +59,12 @@ SciCode sources challenging and realistic research-level coding problems across ## Instructions to evaluate a new model using `inspect_ai` (recommended) -Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps to run: +SciCode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps to run: 1. Clone this repository `git clone git@github.com:scicode-bench/SciCode.git` 2. Install the `scicode` package with `pip install -e .` 3. Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5` -4. Go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command: +4. Go to the `eval/inspect_ai` directory, set up the corresponding API key, and run the following command: ```bash cd eval/inspect_ai diff --git a/eval/inspect_ai/README.md b/eval/inspect_ai/README.md index 3ed4742..47731b1 100644 --- a/eval/inspect_ai/README.md +++ b/eval/inspect_ai/README.md @@ -25,7 +25,7 @@ However, there are some additional command line arguments that could be useful a - `gold` mode can only be used on the validation set which loads the gold answer - `dummy` mode does not call any real LLMs and generates some dummy outputs -For example, user can run five samples on the validation set with background as +For example, users can run five samples on the validation set with background as ```bash inspect eval scicode.py \ @@ -38,7 +38,7 @@ inspect eval scicode.py \ -T mode=normal ``` -User can run the evaluation on `Deepseek-v3` using together ai via the following command: +Users can run the evaluation on `Deepseek-v3` using Together AI via the following command: ```bash export TOGETHER_API_KEY= @@ -55,7 +55,7 @@ For more information regarding `inspect_ai`, we refer users to its [official doc ### Extra: How SciCode are Evaluated Under the Hood? -During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evalauted LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem. +During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evaluated LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem. ### Extra: Reproducibility of `inspect_ai` Integration We use the SciCode `inspect_ai` integration to evaluate OpenAI's GPT-4o, and we compare it with the original way of evaluation. Below shows the comparison of two ways of the evaluations.