scicode-bench · Chesars · Oct 30, 2025 · Oct 30, 2025 · Nov 4, 2025
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ SciCode is a challenging benchmark designed to evaluate the capabilities of lang
 
 
 ## Dataset Creation
-SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.
+SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. SciCode mainly focuses on 1. Numerical methods 2. Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.
 
 ## 🏆 Leaderboard
 
@@ -59,12 +59,12 @@ SciCode sources challenging and realistic research-level coding problems across
 ## Instructions to evaluate a new model using `inspect_ai` (recommended)
 
 
-Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps ro run:
+SciCode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps to run:
 
 1. Clone this repository `git clone git@github.com:scicode-bench/SciCode.git`
 2. Install the `scicode` package with `pip install -e .`
 3. Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
-4. Go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
+4. Go to the `eval/inspect_ai` directory, set up the corresponding API key, and run the following command:
 
 ```bash
 cd eval/inspect_ai

diff --git a/eval/inspect_ai/README.md b/eval/inspect_ai/README.md
@@ -2,7 +2,7 @@
 
 ### 1. Set Up Your API Keys
 
-Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate.
+Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup corresponding API keys depending on the types of models they would like to evaluate.
 
 ### 2. Setup Command Line Arguments if Needed
 
@@ -25,7 +25,7 @@ However, there are some additional command line arguments that could be useful a
     - `gold` mode can only be used on the validation set which loads the gold answer
     - `dummy` mode does not call any real LLMs and generates some dummy outputs
 
-For example, user can run five samples on the validation set with background as
+For example, users can run five samples on the validation set with background as
 
 ```bash
 inspect eval scicode.py \
@@ -38,7 +38,7 @@ inspect eval scicode.py \
     -T mode=normal
 ```
 
-User can run the evaluation on `Deepseek-v3` using together ai via the following command:
+Users can run the evaluation on `Deepseek-v3` using Together AI via the following command:
 
 ```bash
 export TOGETHER_API_KEY=<YOUR_API_KEY>
@@ -55,7 +55,7 @@ For more information regarding `inspect_ai`, we refer users to its [official doc
 
 ### Extra: How SciCode are Evaluated Under the Hood?
 
-During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evalauted LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem. 
+During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evaluated LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem. 
 
 ### Extra: Reproducibility of `inspect_ai` Integration 
 We use the SciCode `inspect_ai` integration to evaluate OpenAI's GPT-4o, and we compare it with the original way of evaluation. Below shows the comparison of two ways of the evaluations.