Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -155,9 +155,10 @@ cython_debug/
llm_rules.md
.python-version

benchmarks/results/*
benchmarks/**/results
benchmarks/**/plots
docs/api/_build/*
docs/api/reference/*
examples/**/results/*
examples/**/results
docs/general/**/data_*
docs/site/*
88 changes: 88 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Benchmarks

Performance benchmarks compare Mesa Frames backends ("frames") with classic Mesa ("mesa")
implementations for a small set of representative models. They help track runtime scaling
and regressions.

Currently included models:

- **boltzmann**: Simple wealth exchange ("Boltzmann wealth") model.
- **sugarscape**: Sugarscape Immediate Growback variant (square grid sized relative to agent count).

## Quick start

```bash
uv run benchmarks/cli.py
```

That command (with defaults) will:

- Benchmark both models (`boltzmann`, `sugarscape`).
- Use agent counts 1000, 2000, 3000, 4000, 5000.
- Run 100 steps per simulation.
- Repeat each configuration once.
- Save CSV results and generate plots.

## CLI options

Invoke `uv run benchmarks/cli.py --help` to see full help. Key options:

| Option | Default | Description |
| ------ | ------- | ----------- |
| `--models` | `all` | Comma list or `all`; accepted: `boltzmann`, `sugarscape`. |
| `--agents` | `1000:5000:1000` | Single int or range `start:stop:step`. |
| `--steps` | `100` | Steps per simulation run. |
| `--repeats` | `1` | How many repeats per (model, backend, agents) config. Seed increments per repeat. |
| `--seed` | `42` | Base RNG seed. Incremented by repeat index. |
| `--save / --no-save` | `--save` | Persist per‑model CSVs. |
| `--plot / --no-plot` | `--plot` | Generate scaling plots (PNG + possibly other formats). |
| `--results-dir` | `benchmarks/results` | Root directory that will receive a timestamped subdirectory. |

Range parsing: `A:B:S` includes `A, A+S, ... <= B`. Final value > B is dropped.

## Output layout

Each invocation uses a single UTC timestamp, e.g. `20251016_173702`:

```text
benchmarks/
results/
20251016_173702/
boltzmann_perf_20251016_173702.csv
sugarscape_perf_20251016_173702.csv
plots/
boltzmann_runtime_20251016_173702_dark.png
sugarscape_runtime_20251016_173702_dark.png
... (other themed variants if enabled)
```

CSV schema (one row per completed run):

| Column | Meaning |
| ------ | ------- |
| `model` | Model key (`boltzmann`, `sugarscape`). |
| `backend` | `mesa` or `frames`. |
| `agents` | Agent count for that run. |
| `steps` | Steps simulated. |
| `seed` | Seed used (base seed + repeat index). |
| `repeat_idx` | Repeat counter starting at 0. |
| `runtime_seconds` | Wall-clock runtime for that run. |
| `timestamp` | Shared timestamp identifier for the benchmark batch. |

## Performance tips

- Ensure the environment variable `MESA_FRAMES_RUNTIME_TYPECHECKING` is **unset** or set to `0` / `false` when collecting performance numbers. Enabling it adds runtime type validation overhead and the CLI will warn you.
- Run multiple repeats (`--repeats 5`) to smooth variance.

## Extending benchmarks

To benchmark an additional model:

1. Add or import both a Mesa implementation and a Frames implementation exposing a `simulate(agents:int, steps:int, seed:int|None, ...)` function.
2. Register it in `benchmarks/cli.py` inside the `MODELS` dict with two backends (names must be `mesa` and `frames`).
3. Ensure any extra spatial parameters are derived from `agents` inside the runner lambda (see sugarscape example).
4. Run the CLI to verify new CSV columns still align.

## Related documentation

See `docs/user-guide/5_benchmarks.md` (user-facing narrative) and the main project `README.md` for overall context.
266 changes: 266 additions & 0 deletions benchmarks/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
"""Typer CLI for running mesa vs mesa-frames performance benchmarks."""

from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime, timezone
import os
from pathlib import Path
from time import perf_counter
from typing import Literal, Annotated, Protocol, Optional

import math
import polars as pl
import typer

from examples.boltzmann_wealth import backend_frames as boltzmann_frames
from examples.boltzmann_wealth import backend_mesa as boltzmann_mesa
from examples.sugarscape_ig.backend_frames import model as sugarscape_frames
from examples.sugarscape_ig.backend_mesa import model as sugarscape_mesa
from examples.plotting import (
plot_performance as _examples_plot_performance,
)

app = typer.Typer(add_completion=False)


class RunnerP(Protocol):
def __call__(self, agents: int, steps: int, seed: int | None = None) -> None: ...


@dataclass(slots=True)
class Backend:
name: Literal["mesa", "frames"]
runner: RunnerP


@dataclass(slots=True)
class ModelConfig:
name: str
backends: list[Backend]


MODELS: dict[str, ModelConfig] = {
"boltzmann": ModelConfig(
name="boltzmann",
backends=[
Backend(name="mesa", runner=boltzmann_mesa.simulate),
Backend(name="frames", runner=boltzmann_frames.simulate),
],
),
"sugarscape": ModelConfig(
name="sugarscape",
backends=[
Backend(
name="mesa",
runner=lambda agents, steps, seed=None: sugarscape_mesa.simulate(
agents=agents,
steps=steps,
width=int(max(20, math.ceil((agents) ** 0.5) * 2)),
height=int(max(20, math.ceil((agents) ** 0.5) * 2)),
seed=seed,
),
),
Backend(
name="frames",
# Benchmarks expect a runner signature (agents:int, steps:int, seed:int|None)
# Sugarscape frames simulate requires width/height; choose square close to agent count.
runner=lambda agents, steps, seed=None: sugarscape_frames.simulate(
agents=agents,
steps=steps,
width=int(max(20, math.ceil((agents) ** 0.5) * 2)),
height=int(max(20, math.ceil((agents) ** 0.5) * 2)),
seed=seed,
),
),
],
),
}


def _parse_agents(value: str) -> list[int]:
value = value.strip()
if ":" in value:
parts = value.split(":")
if len(parts) != 3:
raise typer.BadParameter("Ranges must use start:stop:step format")
try:
start, stop, step = (int(part) for part in parts)
except ValueError as exc:
raise typer.BadParameter("Range values must be integers") from exc
if step <= 0:
raise typer.BadParameter("Step must be positive")
if start < 0 or stop <= 0:
raise typer.BadParameter("Range endpoints must be positive")
if start > stop:
raise typer.BadParameter("Range start must be <= stop")
counts = list(range(start, stop + step, step))
if counts[-1] > stop:
counts.pop()
return counts
try:
agents = int(value)
except ValueError as exc: # pragma: no cover - defensive
raise typer.BadParameter("Agent count must be an integer") from exc
if agents <= 0:
raise typer.BadParameter("Agent count must be positive")
return [agents]
Comment on lines +81 to +107
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Off-by-one edge case in range parsing.

When start == 0 is provided, the validation if start < 0 or stop <= 0 passes, but the subsequent logic may produce unexpected results. Consider whether start == 0 should be allowed (0 agents makes no sense).

-        if start < 0 or stop <= 0:
-            raise typer.BadParameter("Range endpoints must be positive")
+        if start <= 0 or stop <= 0:
+            raise typer.BadParameter("Range endpoints must be positive (> 0)")
🧰 Tools
🪛 Ruff (0.14.7)

86-86: Avoid specifying long messages outside the exception class

(TRY003)


90-90: Avoid specifying long messages outside the exception class

(TRY003)


92-92: Avoid specifying long messages outside the exception class

(TRY003)


94-94: Avoid specifying long messages outside the exception class

(TRY003)


96-96: Avoid specifying long messages outside the exception class

(TRY003)


104-104: Avoid specifying long messages outside the exception class

(TRY003)


106-106: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In benchmarks/cli.py around lines 81 to 107, the range parsing allows start == 0
which is invalid for agent counts; change the validation to reject non-positive
endpoints by replacing the check `if start < 0 or stop <= 0` with `if start <= 0
or stop <= 0` and update the BadParameter message if needed to say "Range
endpoints must be positive (>=1)"; keep the rest of the parsing logic (including
step validation and range construction) intact so ranges start at 1 or higher.



def _parse_models(value: str) -> list[str]:
"""Parse models option into a list of model keys.

Accepts:
- "all" -> returns all available model keys
- a single model name -> returns [name]
- a comma-separated list of model names -> returns list

Validates that each selected model exists in MODELS.
"""
value = value.strip()
if value == "all":
return list(MODELS.keys())
# support comma-separated lists
parts = [part.strip() for part in value.split(",") if part.strip()]
if not parts:
raise typer.BadParameter("Model selection must not be empty")
unknown = [p for p in parts if p not in MODELS]
if unknown:
raise typer.BadParameter(f"Unknown model selection: {', '.join(unknown)}")
# preserve order and uniqueness
seen = set()
result: list[str] = []
for p in parts:
if p not in seen:
seen.add(p)
result.append(p)
return result


def _plot_performance(
df: pl.DataFrame, model_name: str, output_dir: Path, timestamp: str
) -> None:
"""Wrap examples.plotting.plot_performance to ensure consistent theming.

The original benchmark implementation used simple seaborn styles (whitegrid / darkgrid).
Our example plotting utilities define a much darker, high-contrast *true* dark theme
(custom rc params overriding bg/fg colors). Reuse that logic here so the
benchmark dark plots match the example dark plots users see elsewhere.
"""
if df.is_empty():
return
stem = f"{model_name}_runtime_{timestamp}"
_examples_plot_performance(
df.select(["agents", "runtime_seconds", "backend"]),
output_dir=output_dir,
stem=stem,
# Prefer more concise, publication-style wording
title=f"{model_name.title()} runtime scaling",
)


@app.command()
def run(
models: Annotated[
str,
typer.Option(
help="Models to benchmark: boltzmann, sugarscape, or all",
callback=_parse_models,
),
] = "all",
agents: Annotated[
str,
typer.Option(
help="Agent count or range (start:stop:step)", callback=_parse_agents
),
] = "1000:5000:1000",
steps: Annotated[
int,
typer.Option(
min=0,
help="Number of steps per run.",
),
] = 100,
repeats: Annotated[int, typer.Option(help="Repeats per configuration.", min=1)] = 1,
seed: Annotated[int, typer.Option(help="Optional RNG seed.")] = 42,
save: Annotated[bool, typer.Option(help="Persist benchmark CSV results.")] = True,
plot: Annotated[bool, typer.Option(help="Render performance plots.")] = True,
results_dir: Annotated[
Path,
typer.Option(
help=(
"Base directory for benchmark outputs. A timestamped subdirectory "
"(e.g. results/20250101_120000) is created with CSV files at the root "
"and a 'plots/' subfolder for images."
),
),
] = Path(__file__).resolve().parent / "results",
Comment on lines +188 to +197
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid function call in default argument.

Path(__file__).resolve().parent is evaluated once at module load time. If the module is imported from different working directories, this could produce unexpected paths. Move the resolution inside the function body.

     results_dir: Annotated[
-        Path,
+        Path | None,
         typer.Option(
             help=(
                 "Base directory for benchmark outputs. A timestamped subdirectory "
                 "(e.g. results/20250101_120000) is created with CSV files at the root "
                 "and a 'plots/' subfolder for images."
             ),
         ),
-    ] = Path(__file__).resolve().parent / "results",
+    ] = None,
 ) -> None:
     """Run performance benchmarks for the models models."""
+    if results_dir is None:
+        results_dir = Path(__file__).resolve().parent / "results"
     runtime_typechecking = os.environ.get("MESA_FRAMES_RUNTIME_TYPECHECKING", "")
🧰 Tools
🪛 Ruff (0.14.7)

197-197: Do not perform function call in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
In benchmarks/cli.py around lines 188 to 197, the default argument
Path(__file__).resolve().parent / "results" is computed at module import time;
change the parameter to accept an optional Path (e.g., results_dir:
Optional[Path] = None) and move the Path(__file__).resolve().parent / "results"
computation into the function body so it runs at call time (ensure you import
Optional from typing). If results_dir is None inside the function, set
results_dir = Path(__file__).resolve().parent / "results" (or
Path(__file__).parent / "results" if you prefer to avoid resolve) before using
it.

) -> None:
"""Run performance benchmarks for the models models."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in docstring: "models models" should be "selected models".

-    """Run performance benchmarks for the models models."""
+    """Run performance benchmarks for the selected models."""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Run performance benchmarks for the models models."""
"""Run performance benchmarks for the selected models."""
🤖 Prompt for AI Agents
In benchmarks/cli.py around line 199, the docstring contains a typo "models
models"; update it to read "selected models" (e.g., change the docstring to
"""Run performance benchmarks for the selected models."""), preserving
surrounding formatting and punctuation.

runtime_typechecking = os.environ.get("MESA_FRAMES_RUNTIME_TYPECHECKING", "")
if runtime_typechecking and runtime_typechecking.lower() not in {"0", "false"}:
typer.secho(
"Warning: MESA_FRAMES_RUNTIME_TYPECHECKING is enabled; benchmarks may run significantly slower.",
fg=typer.colors.YELLOW,
)
rows: list[dict[str, object]] = []
# Single timestamp per CLI invocation so all model results are co-located.
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
# Create unified output layout: <results_dir>/<timestamp>/{CSV files, plots/}
base_results_dir = results_dir
timestamp_dir = (base_results_dir / timestamp).resolve()
plots_subdir: Path = timestamp_dir / "plots"
for model in models:
config = MODELS[model]
typer.echo(f"Benchmarking {model} with agents {agents}")
for agents_count in agents:
for repeat_idx in range(repeats):
run_seed = seed + repeat_idx
for backend in config.backends:
start = perf_counter()
backend.runner(agents_count, steps, run_seed)
runtime = perf_counter() - start
rows.append(
{
"model": model,
"backend": backend.name,
"agents": agents_count,
"steps": steps,
"seed": run_seed,
"repeat_idx": repeat_idx,
"runtime_seconds": runtime,
"timestamp": timestamp,
}
)
# Report completion of this run to the CLI
typer.echo(
f"Completed {backend.name} for model={model} agents={agents_count} steps={steps} seed={run_seed} repeat={repeat_idx} in {runtime:.3f}s"
)
# Finished all runs for this model
typer.echo(f"Finished benchmarking model {model}")

if not rows:
typer.echo("No benchmark data collected.")
return
df = pl.DataFrame(rows)
if save:
timestamp_dir.mkdir(parents=True, exist_ok=True)
for model in models:
model_df = df.filter(pl.col("model") == model)
csv_path = timestamp_dir / f"{model}_perf_{timestamp}.csv"
model_df.write_csv(csv_path)
typer.echo(f"Saved {model} results to {csv_path}")
if plot:
plots_subdir.mkdir(parents=True, exist_ok=True)
for model in models:
model_df = df.filter(pl.col("model") == model)
_plot_performance(model_df, model, plots_subdir, timestamp)
typer.echo(f"Saved {model} plots under {plots_subdir}")

typer.echo(
f"Unified benchmark outputs written under {timestamp_dir} (CSV files) and {plots_subdir} (plots)"
)


if __name__ == "__main__":
app()
Loading
Loading