Skip to content

Commit 80bc845

Browse files
authored
Merge pull request #4 from scicode-bench/codex/fix-visualization-issue-on-leaderboard-z4catj
Fix leaderboard table layout
2 parents 48b7b1b + 49cdf01 commit 80bc845

File tree

1 file changed

+22
-24
lines changed

1 file changed

+22
-24
lines changed

docs/leaderboard.md

Lines changed: 22 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,28 @@
22

33
<div align="center" markdown="1">
44

5-
# SciCode Leaderboard
6-
7-
| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> |
8-
|--------------------------|-------------------------------------|-------------------------------------|
9-
| 🥇 OpenAI o3-mini-low | <div align="center">**10.8**</div> | <div align="center" style="color:grey">33.3</div> |
10-
| 🥈 OpenAI o3-mini-high | <div align="center">**9.2**</div> | <div align="center" style="color:grey">34.4</div> |
11-
| 🥉 OpenAI o3-mini-medium | <div align="center">**9.2**</div> | <div align="center" style="color:grey">33.0</div> |
12-
| OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> |
13-
| Deepseek-R1 | <div align="center">**4.6**</div> | <div align="center" style="color:grey">28.5</div> |
14-
| Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> |
15-
| Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> |
16-
| Deepseek-v3 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">23.7</div> |
17-
| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> |
18-
| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> |
19-
| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> |
20-
| OpenAI o1-mini | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.2</div> |
21-
| Gemini 1.5 Pro | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.9</div> |
22-
| Claude3-Opus | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.5</div> |
23-
| Llama-3.1-405B-Chat | <div align="center">**1.5**</div> | <div align="center" style="color:grey">19.8</div> |
24-
| Claude3-Sonnet | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> |
25-
| Qwen2-72B-Instruct | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> |
26-
| Llama-3.1-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">17.0</div> |
27-
| Mixtral-8x22B-Instruct | <div align="center">**0.0**</div> | <div align="center" style="color:grey">16.3</div> |
28-
| Llama-3-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">14.6</div> |
5+
| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> |
6+
|--------------------------|:-------------------------:|:--------------------------------------------:|
7+
| 🥇 OpenAI o3-mini-low | **10.8** | <span style="color:grey">33.3</span> |
8+
| 🥈 OpenAI o3-mini-high | **9.2** | <span style="color:grey">34.4</span> |
9+
| 🥉 OpenAI o3-mini-medium | **9.2** | <span style="color:grey">33.0</span> |
10+
| OpenAI o1-preview | **7.7** | <span style="color:grey">28.5</span> |
11+
| Deepseek-R1 | **4.6** | <span style="color:grey">28.5</span> |
12+
| Claude3.5-Sonnet | **4.6** | <span style="color:grey">26.0</span> |
13+
| Claude3.5-Sonnet (new) | **4.6** | <span style="color:grey">25.3</span> |
14+
| Deepseek-v3 | **3.1** | <span style="color:grey">23.7</span> |
15+
| Deepseek-Coder-v2 | **3.1** | <span style="color:grey">21.2</span> |
16+
| GPT-4o | **1.5** | <span style="color:grey">25.0</span> |
17+
| GPT-4-Turbo | **1.5** | <span style="color:grey">22.9</span> |
18+
| OpenAI o1-mini | **1.5** | <span style="color:grey">22.2</span> |
19+
| Gemini 1.5 Pro | **1.5** | <span style="color:grey">21.9</span> |
20+
| Claude3-Opus | **1.5** | <span style="color:grey">21.5</span> |
21+
| Llama-3.1-405B-Chat | **1.5** | <span style="color:grey">19.8</span> |
22+
| Claude3-Sonnet | **1.5** | <span style="color:grey">17.0</span> |
23+
| Qwen2-72B-Instruct | **1.5** | <span style="color:grey">17.0</span> |
24+
| Llama-3.1-70B-Chat | **0.0** | <span style="color:grey">17.0</span> |
25+
| Mixtral-8x22B-Instruct | **0.0** | <span style="color:grey">16.3</span> |
26+
| Llama-3-70B-Chat | **0.0** | <span style="color:grey">14.6</span> |
2927

3028
**Note: If the models tie in the Main Problem resolve rate, we will then compare the Subproblems.**
3129

0 commit comments

Comments
 (0)