Evaluate AI models (or humans) on New York Times Connections puzzles.
This project provides a comprehensive evaluation framework for testing linguistic reasoning capabilities using Connections puzzles. It supports 200+ AI models through a unified OpenRouter integration and includes both batch evaluation and interactive modes.
- Multi-model support: Access 200+ AI models through OpenRouter (OpenAI, Anthropic, xAI, Google Gemini, and more)
- Reasoning model support: Full support for reasoning models (GPT-5, o3, Grok-4, etc.) with proper parameter handling
- Interactive mode: Human players can test their skills
- Cost tracking: Separate tracking of OpenRouter and upstream provider costs
- Detailed token metrics: Breakdown of prompt vs completion tokens
- Verbose logging: Real-time exchange logging with
--verboseflag - Comprehensive metrics: Track guesses, errors, time, and token usage
- Reproducible: Controlled randomization with optional seeds
- Detailed logging: JSONL format for analysis
- Improved logging docs to reflect the current JSONL file naming and controllog outputs
- Clarified CLI examples for reasoning models and verbose logging
- Version bump; no breaking API changes
Requires Python β₯3.12 and uv.
git clone <repository>
cd eval-connections
uv syncuv run connections_eval list-models# Set OpenRouter API key
export OPENROUTER_API_KEY="your-key-here"
# Run evaluation with verbose logging
uv run connections_eval run --model gpt5 --puzzles 5 --verbose
# Run with reasoning model (automatically handled)
uv run connections_eval run --model grok4 --puzzles 3uv run connections_eval run --interactiveuv run connections_eval run \
--model gemini \
--puzzles 3 \
--seed 42 \
--inputs-path ./custom-inputs \
--log-path ./custom-logsAll models are accessed through OpenRouter using a single API key. Set your OpenRouter API key:
export OPENROUTER_API_KEY="your-openrouter-key"Models are configured in inputs/model_mappings.yml. Here are some popular options:
| CLI Name | OpenRouter Model ID | Type | Description |
|---|---|---|---|
gpt5 |
openai/gpt-5 |
Reasoning | OpenAI GPT-5 (latest) |
gpt5-mini |
openai/gpt-5-mini |
Reasoning | OpenAI GPT-5 Mini |
gpt5-nano |
openai/gpt-5-nano |
Reasoning | OpenAI GPT-5 Nano |
gpt-oss-120b |
openai/gpt-oss-120b |
Reasoning | OpenAI GPT OSS 120B |
gpt-oss-20b |
openai/gpt-oss-20b |
Reasoning | OpenAI GPT OSS 20B |
o3 |
openai/o3 |
Reasoning | OpenAI o3 |
o3-mini |
openai/o3-mini |
Reasoning | OpenAI o3-mini |
grok4 |
x-ai/grok-4 |
Reasoning | xAI Grok-4 |
grok3 |
x-ai/grok-3 |
Standard | xAI Grok-3 |
grok3-mini |
x-ai/grok-3-mini |
Reasoning | xAI Grok-3 Mini |
opus-4.1 |
anthropic/claude-opus-4.1 |
Standard | Anthropic Claude Opus 4.1 |
sonnet |
anthropic/claude-3.5-sonnet |
Standard | Anthropic Claude 3.5 Sonnet |
gemini |
google/gemini-2.5-pro |
Reasoning | Google Gemini 2.5 Pro |
Reasoning Models: Automatically handled with special parameter configurations (no max_tokens, temperature, etc.)
Each puzzle contains 16 words that form 4 groups of 4 related words:
- Objective: Find all 4 groups by guessing 4 related words at a time
- Attempts: Maximum 6 total guesses (4 mistakes allowed)
- Response Format: Exactly 4 words, ALL CAPS, comma-separated
- Feedback: Only "CORRECT" or "INCORRECT" (no hints)
- Invalid Responses: Wrong word count, duplicates, or words not in puzzle
puzzles:
- id: 476
date: 2024-09-29
difficulty: 4.5
words: [BLANKET, SHAM, SHEET, THROW, ...]
groups:
- name: Bedding
color: green
words: [BLANKET, SHAM, SHEET, THROW]
# ... 3 more groups<system>
You are an expert puzzle-solver. Follow the rules strictly.
</system>
<user>
HOW TO PLAY
1. Guess 4 related words.
2. You'll be told only "Correct" or "Incorrect".
3. You have at most 6 total guesses (2 correct + 4 mistakes).
Respond with EXACTLY four words, ALL CAPS, comma-separated.
</user>
<puzzle>{{WORDS}}</puzzle>
<id>{{PUZZLE_ID}}</id>
<difficulty>{{DIFFICULTY}}</difficulty>Results are displayed in a formatted table showing:
- Solve rate and guess accuracy
- Performance metrics (time, tokens)
- Detailed token breakdown (prompt vs completion tokens)
- Cost tracking (OpenRouter cost vs upstream provider cost)
- Comprehensive evaluation statistics
Example output:
π Evaluation Results
βββββββββββββββββββββ³βββββββββββββ
β Metric β Value β
β‘βββββββββββββββββββββββββββββββββ©
β Puzzles Solved β 4 β
β Solve Rate β 80.0% β
β Total Tokens β 12847 β
β Prompt Tokens β 1203 β
β Completion Tokens β 11644 β
β OpenRouter Cost β $0.001354 β
β OpenAI Cost β $0.027087 β
βββββββββββββββββββββ΄βββββββββββββ
Detailed logs are saved to logs/connections_eval_<timestamp>.jsonl:
{"timestamp": "2025-07-31T17:23:08Z", "run_id": "...", "model": "o3", ...}Each line contains either:
- Exchange log: Individual guess with request/response/timing
- Summary log: Final run statistics
This project also emits accounting-style telemetry as structured, balanced events:
- Files:
logs/controllog/YYYY-MM-DD/events.jsonlandlogs/controllog/YYYY-MM-DD/postings.jsonl - IDs: UUIDv7 for
event_idandposting_id(sortable by time) - Run/Task identifiers:
run_id: e.g.,2025-10-01T12-30-00_grok3task_id:T{puzzle_id}:{run_id}(one task per puzzle attempt)agent_id:agent:connections_eval
- Accounts used (balanced per event):
resource.tokens(provider β project)resource.time_ms(agent β project)resource.money(vendor β project) for OpenRouter and optional upstreamtruth.state(task WIPβDONE/FAILED)value.utility(optional reward; task β project)
The SDK initializes automatically at run start; raw payloads are preserved.
The evaluation automatically uploads controllog files to MotherDuck after each run (if configured). This includes validation and trial balance checks.
Setup:
-
Set your MotherDuck token in your environment (e.g., in a
.envfile):export MOTHERDUCK_TOKEN="your-token-here" export MOTHERDUCK_DB="md:" # Optional: "md:" for default database, or "md:database_name" for a specific database
Note: If
MOTHERDUCK_DBis not set, the upload step will be skipped. Use"md:"to connect to your default MotherDuck database, or"md:database_name"if you've created a specific database. -
Run an evaluation - upload happens automatically:
uv run connections_eval run --model grok3 --puzzles 2
The upload process:
- Uploads controllog events and postings to MotherDuck
- Validates that the run's data exists in the database
- Runs a trial balance check to ensure data integrity
- Optionally deletes local controllog files (use
--keep-local-filesto retain them)
Manual Upload (if needed):
If you need to manually upload controllog files:
# Set your token and database
export MOTHERDUCK_DB="md:" # or "md:database_name" for a specific database
export CTRL_LOG_DIR="logs"
uv run python scripts/load_controllog_to_motherduck.py
# Alternatively, load into a local DuckDB file
export MOTHERDUCK_DB="controllog.duckdb"
uv run python scripts/load_controllog_to_motherduck.pyRun a fast trial balance check (double-entry invariants) and example reports:
export MOTHERDUCK_DB="md:" # or "md:database_name" or a local .duckdb path
uv run python scripts/reports_controllog.pyOutputs include:
- Trial balance PASS/FAIL
- Cost and utility flows per project
- Average wall latency by model
uv run connections_eval run [OPTIONS]| Option | Type | Default | Description |
|---|---|---|---|
--model |
str | None | Model to evaluate (required unless --interactive) |
--interactive |
flag | False | Run in interactive mode |
--puzzles |
int | All | Maximum puzzles to run |
--verbose |
flag | False | Enable real-time exchange logging |
--seed |
int | Random | Random seed for reproducibility |
--inputs-path |
path | inputs/ |
Input files directory |
--log-path |
path | logs/ |
Log output directory |
--prompt-file |
str | prompt_template.xml |
Prompt template filename |
--keep-local-files |
flag | False | Keep local controllog files after uploading to MotherDuck |
# Basic evaluation
uv run connections_eval run --model gpt5
# Limited run with verbose logging
uv run connections_eval run --model gemini --puzzles 3 --verbose
# Reasoning model evaluation
uv run connections_eval run --model grok4 --puzzles 5 --seed 42
# Interactive play
uv run connections_eval run --interactive
# Custom paths with verbose mode
uv run connections_eval run --model sonnet --inputs-path ./my-puzzles --verbose
# Keep local controllog files after upload (for inspection)
uv run connections_eval run --model grok4 --keep-local-filesuv run pytestsrc/connections_eval/
βββ cli.py # Typer CLI interface
βββ core.py # Game logic & metrics
βββ adapters/ # AI model adapters
β βββ openrouter_adapter.py # Unified OpenRouter adapter (200+ models)
βββ utils/ # Utilities
βββ timing.py # Timer utilities
βββ tokens.py # Token counting & cost extraction
βββ logging.py # JSON logging
βββ retry.py # Retry with backoff
βββ motherduck.py # MotherDuck upload and validation utilities
inputs/
βββ connections_puzzles.yml # Puzzle database
βββ model_mappings.yml # Model configuration
βββ prompt_template.xml # Prompt template
- API Failures: Automatic retry with exponential backoff (3 attempts)
- Missing API Keys: Fail fast with clear error message for
OPENROUTER_API_KEY - Invalid Responses: Track and limit (max 3 per puzzle)
- Network Issues: Graceful degradation with detailed logging
- Reasoning Models: Special parameter handling for reasoning models (o1, o3, o4) that don't support
max_tokensortemperature
The evaluation tracks comprehensive metrics:
- Success Rate: Puzzles solved / attempted
- Guess Accuracy: Correct guesses / total guesses
- Response Validity: Invalid responses per puzzle
- Performance: Average time per puzzle
- Token Usage:
- Total tokens consumed
- Prompt tokens (input to model)
- Completion tokens (generated by model)
- Token counting method (API vs approximate)
- Cost Tracking:
- OpenRouter cost (what you pay)
- Upstream cost (what OpenRouter pays the provider)
- Per-exchange cost breakdown in logs
You can run evaluations automatically via GitHub Actions workflow dispatch. This is useful for testing new models and adding them to the result set.
-
Configure GitHub Secrets:
OPENROUTER_API_KEY: Your OpenRouter API keyMOTHERDUCK_TOKEN: Your MotherDuck authentication tokenMOTHERDUCK_DB: (Optional) MotherDuck database connection string (defaults tomd:if not set)
-
Run the Workflow:
- Go to the "Actions" tab in your GitHub repository
- Select "Run Model Evaluation" workflow
- Click "Run workflow"
- Enter the model name (e.g.,
gpt5,grok4) - Optionally specify the number of puzzles to run
- Click "Run workflow"
The workflow will:
- Run the evaluation with the specified model
- Upload results to MotherDuck
- Update the docs folder (run summaries and log views)
- Commit and push the updated docs back to the repository
# Workflow dispatch with:
# model: "gpt5"
# puzzles: "10"This will run 10 puzzles with the GPT-5 model and automatically update the documentation.
π View Interactive Results Table - Sports-style box score showing latest model performance
Table includes solve rates, costs, token usage, and timing metrics formatted like sports statistics.
MIT License - see LICENSE file.
- Ensure Python β₯3.12 and
uvare installed - Run tests:
uv run pytest - Follow existing code patterns and add tests for new features
- Update documentation as needed
For questions or issues, please check the logs in logs/ directory for detailed debugging information.