Add calibration analysis: ECE, reliability diagrams, Brier decomposition#161
Open
orpheuslummis wants to merge 1 commit intoforecastingresearch:mainfrom
Open
Add calibration analysis: ECE, reliability diagrams, Brier decomposition#161orpheuslummis wants to merge 1 commit intoforecastingresearch:mainfrom
orpheuslummis wants to merge 1 commit intoforecastingresearch:mainfrom
Conversation
ForecastBench reports Brier score, BSS, Peer Score, and oracle equivalence but no calibration metrics. Calibration — does P=0.7 mean 70%? — is the most safety-relevant property of a forecasting system. This adds it as a parallel analysis step computed nightly alongside the existing leaderboard. Pipeline (src/leaderboard/main.py): - compute_calibration_metrics(): per-model ECE, Brier decomposition (reliability, resolution, uncertainty per Murphy 1973), sharpness - compute_calibration_curve_data(): per-(model, bin) data for reliability diagrams - write_calibration_data(): CSV + JSON to public release bucket - Wired into make_leaderboard() after oracle removal Website: - /calibration/ page with D3 reliability diagram and metrics table - Baseline/tournament toggle, model checkboxes, tooltips - entrypoint.sh copies calibration files from bucket to assets/data/ - Nav entry between Explore and Datasets Note: Houtan added Plotly-based calibration plots in March 2024 (7729a9d) but they were server-side only and removed during the codebase restructure. This surfaces calibration as a first-class website feature with proper metrics (ECE, Brier decomposition) and interactive visualization. Zero changes to existing scoring functions, bootstrap, or leaderboard output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9840d0c to
63e16ea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds calibration as a first-class analysis feature: ECE, reliability diagrams, Brier decomposition (Murphy 1973), and sharpness — computed nightly alongside the existing leaderboard and displayed on a new
/calibration/page.ForecastBench reports Brier score, BSS, Peer Score, and oracle equivalence but no calibration metrics. Calibration — does P=0.7 mean 70%? — is arguably the most safety-relevant property of a forecasting system. Recent work (KalshiBench, Lu 2025) finds systematic LLM overconfidence, but ForecastBench doesn't surface this.
What it adds:
compute_calibration_metrics()— per-model ECE, Brier decomposition (reliability, resolution, uncertainty), sharpnesscompute_calibration_curve_data()— per-(model, bin) data for reliability diagramswrite_calibration_data()— CSV + JSON to public release bucket (same pattern aswrite_sota_graph_csv())make_leaderboard()after oracle removal, written per leaderboard typeentrypoint.sh.templatecopies calibration files from bucket toassets/data/before Jekyll build/calibration/page with D3 reliability diagram (diagonal = perfect calibration, circle size ∝ √n), model checkboxes, baseline/tournament toggle, metrics table sorted by ECEWhat it does not change: existing scoring functions, bootstrap simulation, leaderboard columns, GCP infrastructure, any existing code paths.
Note: Houtan added Plotly-based calibration plots in March 2024 (7729a9d) but they were server-side only and removed during the codebase restructure. This surfaces calibration on the website with proper metrics and interactive visualization.
Verification
reliability - resolution + uncertainty ≈ mean(brier), ECE bounds, curve shape)tests/generate_test_calibration_data.pygenerates synthetic data with known calibration properties for local testing/calibration/page renders correctly