|
2 | 2 |
|
3 | 3 | <div align="center" markdown="1"> |
4 | 4 |
|
5 | | -# SciCode Leaderboard |
6 | | - |
7 | | -| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> | |
8 | | -|--------------------------|-------------------------------------|-------------------------------------| |
9 | | -| 🥇 OpenAI o3-mini-low | <div align="center">**10.8**</div> | <div align="center" style="color:grey">33.3</div> | |
10 | | -| 🥈 OpenAI o3-mini-high | <div align="center">**9.2**</div> | <div align="center" style="color:grey">34.4</div> | |
11 | | -| 🥉 OpenAI o3-mini-medium | <div align="center">**9.2**</div> | <div align="center" style="color:grey">33.0</div> | |
12 | | -| OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> | |
13 | | -| Deepseek-R1 | <div align="center">**4.6**</div> | <div align="center" style="color:grey">28.5</div> | |
14 | | -| Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> | |
15 | | -| Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> | |
16 | | -| Deepseek-v3 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">23.7</div> | |
17 | | -| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> | |
18 | | -| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> | |
19 | | -| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> | |
20 | | -| OpenAI o1-mini | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.2</div> | |
21 | | -| Gemini 1.5 Pro | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.9</div> | |
22 | | -| Claude3-Opus | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.5</div> | |
23 | | -| Llama-3.1-405B-Chat | <div align="center">**1.5**</div> | <div align="center" style="color:grey">19.8</div> | |
24 | | -| Claude3-Sonnet | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> | |
25 | | -| Qwen2-72B-Instruct | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> | |
26 | | -| Llama-3.1-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">17.0</div> | |
27 | | -| Mixtral-8x22B-Instruct | <div align="center">**0.0**</div> | <div align="center" style="color:grey">16.3</div> | |
28 | | -| Llama-3-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">14.6</div> | |
| 5 | +| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> | |
| 6 | +|--------------------------|:-------------------------:|:--------------------------------------------:| |
| 7 | +| 🥇 OpenAI o3-mini-low | **10.8** | <span style="color:grey">33.3</span> | |
| 8 | +| 🥈 OpenAI o3-mini-high | **9.2** | <span style="color:grey">34.4</span> | |
| 9 | +| 🥉 OpenAI o3-mini-medium | **9.2** | <span style="color:grey">33.0</span> | |
| 10 | +| OpenAI o1-preview | **7.7** | <span style="color:grey">28.5</span> | |
| 11 | +| Deepseek-R1 | **4.6** | <span style="color:grey">28.5</span> | |
| 12 | +| Claude3.5-Sonnet | **4.6** | <span style="color:grey">26.0</span> | |
| 13 | +| Claude3.5-Sonnet (new) | **4.6** | <span style="color:grey">25.3</span> | |
| 14 | +| Deepseek-v3 | **3.1** | <span style="color:grey">23.7</span> | |
| 15 | +| Deepseek-Coder-v2 | **3.1** | <span style="color:grey">21.2</span> | |
| 16 | +| GPT-4o | **1.5** | <span style="color:grey">25.0</span> | |
| 17 | +| GPT-4-Turbo | **1.5** | <span style="color:grey">22.9</span> | |
| 18 | +| OpenAI o1-mini | **1.5** | <span style="color:grey">22.2</span> | |
| 19 | +| Gemini 1.5 Pro | **1.5** | <span style="color:grey">21.9</span> | |
| 20 | +| Claude3-Opus | **1.5** | <span style="color:grey">21.5</span> | |
| 21 | +| Llama-3.1-405B-Chat | **1.5** | <span style="color:grey">19.8</span> | |
| 22 | +| Claude3-Sonnet | **1.5** | <span style="color:grey">17.0</span> | |
| 23 | +| Qwen2-72B-Instruct | **1.5** | <span style="color:grey">17.0</span> | |
| 24 | +| Llama-3.1-70B-Chat | **0.0** | <span style="color:grey">17.0</span> | |
| 25 | +| Mixtral-8x22B-Instruct | **0.0** | <span style="color:grey">16.3</span> | |
| 26 | +| Llama-3-70B-Chat | **0.0** | <span style="color:grey">14.6</span> | |
29 | 27 |
|
30 | 28 | **Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.** |
31 | 29 |
|
|
0 commit comments