WordleBench

WordleBench is a benchmark for evaluating LLMs on their ability to solve Wordle puzzles.

Results

https://www.wordlebench.com/

Why?

Wordle is a game that consists of some strategy and logic. The LLM needs to keep track of guesses, feedback and analyze the information to make informed guesses. This makes it a good benchmark for evaluating the reasoning capabilities of LLMs.

Methodology

All models are run with the default settings through OpenRouter.
A list of 100 randomly picked previous Wordle words are used as the test set for all models.
Each game is a multi-turn interaction where the model makes a guess, receives feedback, and then makes another guess based on that feedback until it either solves the puzzle or exhausts the maximum number of allowed guesses.

Running the Benchmark

# Install dependencies
uv sync

# Run the benchmark
uv run python main.py

# Analyze results
uv run python analyze.py

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
prompts		prompts
site		site
viewer		viewer
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
analyze.py		analyze.py
db.py		db.py
generate_table.py		generate_table.py
main.py		main.py
models.py		models.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock
words.txt		words.txt
words_full.txt		words_full.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordleBench

Results

Why?

Methodology

Running the Benchmark

About

Uh oh!

Languages

abronte/wordlebench

Folders and files

Latest commit

History

Repository files navigation

WordleBench

Results

Why?

Methodology

Running the Benchmark

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages