Skip to content

Add calibration analysis: ECE, reliability diagrams, Brier decomposition#161

Open
orpheuslummis wants to merge 1 commit intoforecastingresearch:mainfrom
orpheuslummis:calibration-analysis
Open

Add calibration analysis: ECE, reliability diagrams, Brier decomposition#161
orpheuslummis wants to merge 1 commit intoforecastingresearch:mainfrom
orpheuslummis:calibration-analysis

Conversation

@orpheuslummis
Copy link

Summary

Adds calibration as a first-class analysis feature: ECE, reliability diagrams, Brier decomposition (Murphy 1973), and sharpness — computed nightly alongside the existing leaderboard and displayed on a new /calibration/ page.

ForecastBench reports Brier score, BSS, Peer Score, and oracle equivalence but no calibration metrics. Calibration — does P=0.7 mean 70%? — is arguably the most safety-relevant property of a forecasting system. Recent work (KalshiBench, Lu 2025) finds systematic LLM overconfidence, but ForecastBench doesn't surface this.

What it adds:

  • compute_calibration_metrics() — per-model ECE, Brier decomposition (reliability, resolution, uncertainty), sharpness
  • compute_calibration_curve_data() — per-(model, bin) data for reliability diagrams
  • write_calibration_data() — CSV + JSON to public release bucket (same pattern as write_sota_graph_csv())
  • Wired into make_leaderboard() after oracle removal, written per leaderboard type
  • entrypoint.sh.template copies calibration files from bucket to assets/data/ before Jekyll build
  • /calibration/ page with D3 reliability diagram (diagonal = perfect calibration, circle size ∝ √n), model checkboxes, baseline/tournament toggle, metrics table sorted by ECE
  • Nav entry between Explore and Datasets

What it does not change: existing scoring functions, bootstrap simulation, leaderboard columns, GCP infrastructure, any existing code paths.

Note: Houtan added Plotly-based calibration plots in March 2024 (7729a9d) but they were server-side only and removed during the codebase restructure. This surfaces calibration on the website with proper metrics and interactive visualization.

Verification

  • 6 unit tests pass (Brier decomposition identity reliability - resolution + uncertainty ≈ mean(brier), ECE bounds, curve shape)
  • tests/generate_test_calibration_data.py generates synthetic data with known calibration properties for local testing
  • Jekyll build succeeds, /calibration/ page renders correctly
  • Linters (black, isort, flake8, pydocstyle) pass on changed files

ForecastBench reports Brier score, BSS, Peer Score, and oracle equivalence
but no calibration metrics. Calibration — does P=0.7 mean 70%? — is the
most safety-relevant property of a forecasting system. This adds it as a
parallel analysis step computed nightly alongside the existing leaderboard.

Pipeline (src/leaderboard/main.py):
- compute_calibration_metrics(): per-model ECE, Brier decomposition
  (reliability, resolution, uncertainty per Murphy 1973), sharpness
- compute_calibration_curve_data(): per-(model, bin) data for reliability
  diagrams
- write_calibration_data(): CSV + JSON to public release bucket
- Wired into make_leaderboard() after oracle removal

Website:
- /calibration/ page with D3 reliability diagram and metrics table
- Baseline/tournament toggle, model checkboxes, tooltips
- entrypoint.sh copies calibration files from bucket to assets/data/
- Nav entry between Explore and Datasets

Note: Houtan added Plotly-based calibration plots in March 2024 (7729a9d)
but they were server-side only and removed during the codebase restructure.
This surfaces calibration as a first-class website feature with proper
metrics (ECE, Brier decomposition) and interactive visualization.

Zero changes to existing scoring functions, bootstrap, or leaderboard output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@orpheuslummis orpheuslummis force-pushed the calibration-analysis branch from 9840d0c to 63e16ea Compare March 2, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant