Add rel_depth and metric_depth evaluation modes #170
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add relative + metric depth + global scale evaluation (
rel_depth,metric_depth) to DA3 benchmarkWhat / Why
The benchmark already covers pose and reconstruction (
pose,recon_unposed,recon_posed). This PR adds depth-quality evaluation so we can measure how good the predicted depth maps are on datasets that provide GT depth, and do it consistently across any-view setups (varying #views, posed/unposed inference, etc.).This PR is related to evaluation depth estimation quality / global scale as MapAnything provides. #29.
The goal here is simple:
We tried to follow the common evaluation protocol used by MoGe-2 (scale/affine alignment variants) and MapAnything (AbsRel + inlier ratio at τ=1.03%, with multi-view sweeps). We did not run an exhaustive sweep across all scenes, but we sanity-checked outputs on 4 datasets (7Scenes / HiRoom / ScanNet++ / ETH3D) and the numbers look in the expected range.
What’s included
New evaluation modes
rel_depth: reports scale-invariant and affine-invariant depth metricsmetric_depth: reports raw metric depth metrics + scale diagnosticsBoth modes write JSON to:
workspace/evaluation/metric_results/<dataset>_rel_depth.jsonworkspace/evaluation/metric_results/<dataset>_metric_depth.jsonPrinter update
The benchmark printer now groups depth metrics by interpretation:
ETH3D GT loading fixes (already in your branch)
ETH3D depth/mask paths are derived from
dataset.data_root, and masks are searched in:masks_for_images/dslr_images/<image>.png(preferred)ground_truth_masks/dslr_images/<image>(fallback)This matches the actual dataset layout and avoids crashing when optional masks are missing.
How to run (repro)
Example reproduction (single ScanNet++ scene):
The benchmark still supports:
(unchanged)
Interpreting the metrics (quick cheatsheet)
All metrics are computed per-image, then averaged per-scene.
“Scale-inv AbsRel (med)” / “Scale-inv δ@…”
We find a single scalar
s_medsuch thatmedian(s_med * pred) == median(gt)on valid pixels, and evaluate onpred * s_med.s_medwill be close to 1.0 and the aligned numbers will be similar to raw metric numbers.“Affine-inv … (depth)”
Fits
a * pred + bto match GT depth (weighted toward relative errors). This removes both scale and shift.“Affine-inv … (disp)”
Fits an affine mapping in disparity (inverse depth), then converts back to depth.
“Metric …”
Raw metrics on the predicted depth without any alignment (i.e., “are you in meters?”).
Metric-depth error terms (in
*_metric_depth.json)rmse: root-mean-square error in depth units (usually meters). Sensitive to large outliers.rmse_log: RMSE in log-depth space (useslog(depth)), so it behaves more like a multiplicative / relative error.si_log: scale-invariant log RMSE, computed assqrt(mean(d^2) - mean(d)^2)withd = log(pred) - log(gt). This reduces the impact of a global scale shift and focuses more on shape.valid_pixels_pct: percentage of pixels that actually participated in evaluation after applying the GT validity mask (and any dataset-provided masks). This is computed per-image, then averaged per-scene.“Median scale (s)” + “Metric scale rel”
Median scale (s)is the median-alignment factors_med.Metric scale rel = |s_med - 1|summarizes how far the model’s absolute scale is from perfect metric.Metric scale log = |log(s_med)|is the same idea in log space (symmetric for over-/under-scaling).Valid pixels (%)tells you how much of the image actually participated in evaluation after GT validity + basic checks.Notes:
rmse/rmse_logandmetric_scale_rel/metric_scale_logredundant?They measure different things:
rmseandrmse_log: per-pixel depth error (shape + scale + local noise).metric_scale_relandmetric_scale_log: global scale error only (derived fromscale_med).They may correlate in practice, but scale diagnostics are still useful for quickly answering:
“Is the model wrong because the whole scene is scaled incorrectly, or because local depth is noisy?”
Worked example (ScanNet++
1ada7a0617)Console output
What this says (plain English)
s_med ≈ 1.015, meaning the median scale is ~+1.5% vs GT.Metric scale rel ≈ 0.017matches that: about 1.7% scale error overall.JSON outputs
scannetpp_metric_depth.jsoncontains raw metric-depth metrics + scale diagnosticsscannetpp_rel_depth.jsoncontains scale/affine-invariant metricsFor details, please see the following attached json files.
scannetpp_metric_depth.json
scannetpp_rel_depth.json
Files changed
src/depth_anything_3/bench/depth_metrics.py(new): depth metrics + alignment helperssrc/depth_anything_3/bench/evaluator.py: add modes + GT depth loading + JSON outputssrc/depth_anything_3/bench/print_metrics.py: grouped depth printersrc/depth_anything_3/bench/configs/eval_bench.yaml: expose new modesdocs/BENCHMARK.md: document new modes & example commandsNotes / Limitations
Checklist