Add calibration analysis: ECE, reliability diagrams, Brier decomposition by orpheuslummis · Pull Request #161 · forecastingresearch/forecastbench

orpheuslummis · 2026-03-02T18:38:08Z

Summary

Adds calibration as a first-class analysis feature: ECE, reliability diagrams, Brier decomposition (Murphy 1973), and sharpness — computed nightly alongside the existing leaderboard and displayed on a new /calibration/ page.

ForecastBench reports Brier score, BSS, Peer Score, and oracle equivalence but no calibration metrics. Calibration — does P=0.7 mean 70%? — is arguably the most safety-relevant property of a forecasting system. Recent work (KalshiBench, Lu 2025) finds systematic LLM overconfidence, but ForecastBench doesn't surface this.

What it adds:

compute_calibration_metrics() — per-model ECE, Brier decomposition (reliability, resolution, uncertainty), sharpness
compute_calibration_curve_data() — per-(model, bin) data for reliability diagrams
write_calibration_data() — CSV + JSON to public release bucket (same pattern as write_sota_graph_csv())
Wired into make_leaderboard() after oracle removal, written per leaderboard type
entrypoint.sh.template copies calibration files from bucket to assets/data/ before Jekyll build
/calibration/ page with D3 reliability diagram (diagonal = perfect calibration, circle size ∝ √n), model checkboxes, baseline/tournament toggle, metrics table sorted by ECE
Nav entry between Explore and Datasets

What it does not change: existing scoring functions, bootstrap simulation, leaderboard columns, GCP infrastructure, any existing code paths.

Note: Houtan added Plotly-based calibration plots in March 2024 (7729a9d) but they were server-side only and removed during the codebase restructure. This surfaces calibration on the website with proper metrics and interactive visualization.

Verification

6 unit tests pass (Brier decomposition identity reliability - resolution + uncertainty ≈ mean(brier), ECE bounds, curve shape)
tests/generate_test_calibration_data.py generates synthetic data with known calibration properties for local testing
Jekyll build succeeds, /calibration/ page renders correctly
Linters (black, isort, flake8, pydocstyle) pass on changed files

ForecastBench reports Brier score, BSS, Peer Score, and oracle equivalence but no calibration metrics. Calibration — does P=0.7 mean 70%? — is the most safety-relevant property of a forecasting system. This adds it as a parallel analysis step computed nightly alongside the existing leaderboard. Pipeline (src/leaderboard/main.py): - compute_calibration_metrics(): per-model ECE, Brier decomposition (reliability, resolution, uncertainty per Murphy 1973), sharpness - compute_calibration_curve_data(): per-(model, bin) data for reliability diagrams - write_calibration_data(): CSV + JSON to public release bucket - Wired into make_leaderboard() after oracle removal Website: - /calibration/ page with D3 reliability diagram and metrics table - Baseline/tournament toggle, model checkboxes, tooltips - entrypoint.sh copies calibration files from bucket to assets/data/ - Nav entry between Explore and Datasets Note: Houtan added Plotly-based calibration plots in March 2024 (7729a9d) but they were server-side only and removed during the codebase restructure. This surfaces calibration as a first-class website feature with proper metrics (ECE, Brier decomposition) and interactive visualization. Zero changes to existing scoring functions, bootstrap, or leaderboard output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

orpheuslummis force-pushed the calibration-analysis branch from 9840d0c to 63e16ea Compare March 2, 2026 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add calibration analysis: ECE, reliability diagrams, Brier decomposition#161

Add calibration analysis: ECE, reliability diagrams, Brier decomposition#161
orpheuslummis wants to merge 1 commit intoforecastingresearch:mainfrom
orpheuslummis:calibration-analysis

orpheuslummis commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

orpheuslummis commented Mar 2, 2026

Summary

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant