Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ SciCode is a challenging benchmark designed to evaluate the capabilities of lang


## Dataset Creation
SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. Scicode mainly focuses on 1. Numerical methods 2.Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.
SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. SciCode mainly focuses on 1. Numerical methods 2. Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability.

## 🏆 Leaderboard

Expand Down Expand Up @@ -59,12 +59,12 @@ SciCode sources challenging and realistic research-level coding problems across
## Instructions to evaluate a new model using `inspect_ai` (recommended)


Scicode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps ro run:
SciCode has been integrated with `inspect_ai` for easier and faster model evaluation. You need to run the following steps to run:

1. Clone this repository `git clone git@github.com:scicode-bench/SciCode.git`
2. Install the `scicode` package with `pip install -e .`
3. Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
4. Go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
4. Go to the `eval/inspect_ai` directory, set up the corresponding API key, and run the following command:

```bash
cd eval/inspect_ai
Expand Down
8 changes: 4 additions & 4 deletions eval/inspect_ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

### 1. Set Up Your API Keys

Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate.
Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup corresponding API keys depending on the types of models they would like to evaluate.

### 2. Setup Command Line Arguments if Needed

Expand All @@ -25,7 +25,7 @@ However, there are some additional command line arguments that could be useful a
- `gold` mode can only be used on the validation set which loads the gold answer
- `dummy` mode does not call any real LLMs and generates some dummy outputs

For example, user can run five samples on the validation set with background as
For example, users can run five samples on the validation set with background as

```bash
inspect eval scicode.py \
Expand All @@ -38,7 +38,7 @@ inspect eval scicode.py \
-T mode=normal
```

User can run the evaluation on `Deepseek-v3` using together ai via the following command:
Users can run the evaluation on `Deepseek-v3` using Together AI via the following command:

```bash
export TOGETHER_API_KEY=<YOUR_API_KEY>
Expand All @@ -55,7 +55,7 @@ For more information regarding `inspect_ai`, we refer users to its [official doc

### Extra: How SciCode are Evaluated Under the Hood?

During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evalauted LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem.
During the evaluation, the sub-steps of each main problem of SciCode are passed in order to the evaluated LLM with necessary prompts and LLM responses for previous sub-steps. The generated Python code from LLM will be parsed and saved to disk, which will be used to run on test cases to determine the pass or fail for the sub-steps. The main problem will be considered as solved if the LLM can pass all sub-steps of the main problem.

### Extra: Reproducibility of `inspect_ai` Integration
We use the SciCode `inspect_ai` integration to evaluate OpenAI's GPT-4o, and we compare it with the original way of evaluation. Below shows the comparison of two ways of the evaluations.
Expand Down