Skip to content

Conversation

@intMinsu
Copy link

@intMinsu intMinsu commented Dec 13, 2025

Add relative + metric depth + global scale evaluation (rel_depth, metric_depth) to DA3 benchmark

What / Why

The benchmark already covers pose and reconstruction (pose, recon_unposed, recon_posed). This PR adds depth-quality evaluation so we can measure how good the predicted depth maps are on datasets that provide GT depth, and do it consistently across any-view setups (varying #views, posed/unposed inference, etc.).

This PR is related to evaluation depth estimation quality / global scale as MapAnything provides. #29.

The goal here is simple:

  • Relative depth quality (shape) → “does the depth map look right up to an unknown scale / affine?”
  • Metric depth quality (absolute) → “is the depth map correct in meters, without alignment?”
  • Metric scale accuracy → “if the model is almost metric, how far off is its scale?”

We tried to follow the common evaluation protocol used by MoGe-2 (scale/affine alignment variants) and MapAnything (AbsRel + inlier ratio at τ=1.03%, with multi-view sweeps). We did not run an exhaustive sweep across all scenes, but we sanity-checked outputs on 4 datasets (7Scenes / HiRoom / ScanNet++ / ETH3D) and the numbers look in the expected range.

What’s included

New evaluation modes

  • rel_depth: reports scale-invariant and affine-invariant depth metrics
  • metric_depth: reports raw metric depth metrics + scale diagnostics

Both modes write JSON to:

  • workspace/evaluation/metric_results/<dataset>_rel_depth.json
  • workspace/evaluation/metric_results/<dataset>_metric_depth.json

Printer update

The benchmark printer now groups depth metrics by interpretation:

  • Scale-invariant (median aligned)
  • Affine-invariant (depth and disparity)
  • Metric (no alignment)
  • Scale diagnostics (median scale factor, metric-scale error, valid pixels %)

ETH3D GT loading fixes (already in your branch)

ETH3D depth/mask paths are derived from dataset.data_root, and masks are searched in:

  1. masks_for_images/dslr_images/<image>.png (preferred)
  2. ground_truth_masks/dslr_images/<image> (fallback)

This matches the actual dataset layout and avoids crashing when optional masks are missing.

How to run (repro)

Example reproduction (single ScanNet++ scene):

python -m depth_anything_3.bench.evaluator \
model.path="depth-anything/DA3NESTED-GIANT-LARGE-1.1" \
eval.datasets=[scannetpp] \
eval.scenes="1ada7a0617" \
eval.ref_view_strategy="saddle_balanced" \
eval.modes=[rel_depth,metric_depth]

The benchmark still supports:

  • pose-only
  • recon-only
  • print-only
  • eval-only
    (unchanged)

Interpreting the metrics (quick cheatsheet)

All metrics are computed per-image, then averaged per-scene.

“Scale-inv AbsRel (med)” / “Scale-inv δ@…”

We find a single scalar s_med such that median(s_med * pred) == median(gt) on valid pixels, and evaluate on pred * s_med.

  • This removes global scale bias but keeps “shape” errors.
  • For models that are already close to metric, s_med will be close to 1.0 and the aligned numbers will be similar to raw metric numbers.

“Affine-inv … (depth)”

Fits a * pred + b to match GT depth (weighted toward relative errors). This removes both scale and shift.

  • Useful if a method’s output has an additive bias (common in some representations).

“Affine-inv … (disp)”

Fits an affine mapping in disparity (inverse depth), then converts back to depth.

  • This is commonly used when models are trained/predicted in disparity space (or behave closer to a disparity-affine transform).

“Metric …”

Raw metrics on the predicted depth without any alignment (i.e., “are you in meters?”).

Metric-depth error terms (in *_metric_depth.json)

  • rmse: root-mean-square error in depth units (usually meters). Sensitive to large outliers.
  • rmse_log: RMSE in log-depth space (uses log(depth)), so it behaves more like a multiplicative / relative error.
  • si_log: scale-invariant log RMSE, computed as sqrt(mean(d^2) - mean(d)^2) with d = log(pred) - log(gt). This reduces the impact of a global scale shift and focuses more on shape.
  • valid_pixels_pct: percentage of pixels that actually participated in evaluation after applying the GT validity mask (and any dataset-provided masks). This is computed per-image, then averaged per-scene.

“Median scale (s)” + “Metric scale rel”

  • Median scale (s) is the median-alignment factor s_med.
  • Metric scale rel = |s_med - 1| summarizes how far the model’s absolute scale is from perfect metric.
  • Metric scale log = |log(s_med)| is the same idea in log space (symmetric for over-/under-scaling).
  • Valid pixels (%) tells you how much of the image actually participated in evaluation after GT validity + basic checks.

Notes: rmse/rmse_log and metric_scale_rel/metric_scale_log redundant?

They measure different things:

  • rmse and rmse_log: per-pixel depth error (shape + scale + local noise).
  • metric_scale_rel and metric_scale_log: global scale error only (derived from scale_med).

They may correlate in practice, but scale diagnostics are still useful for quickly answering:
“Is the model wrong because the whole scene is scaled incorrectly, or because local depth is noisy?”

Worked example (ScanNet++ 1ada7a0617)

Console output

📏 DEPTH ESTIMATION
---------------------------------------------------------------------------------------
Metric                     Avg         HiRoom      ETH3D       7Scenes     ScanNet++
---------------------------------------------------------------------------------------
Scale-inv AbsRel (med)     0.0309      N/A         N/A         N/A         0.0309
Scale-inv δ@1.03 (med)     76.7523     N/A         N/A         N/A         76.7523
Scale-inv δ@1.25 (med)     97.8438     N/A         N/A         N/A         97.8438
---------------------------------------------------------------------------------------
Affine-inv AbsRel (depth)  0.0285      N/A         N/A         N/A         0.0285
Affine-inv δ@1.03 (depth)  81.0307     N/A         N/A         N/A         81.0307
Affine-inv δ@1.25 (depth)  97.8475     N/A         N/A         N/A         97.8475
Affine-inv AbsRel (disp)   0.0285      N/A         N/A         N/A         0.0285
Affine-inv δ@1.03 (disp)   81.7547     N/A         N/A         N/A         81.7547
Affine-inv δ@1.25 (disp)   97.8475     N/A         N/A         N/A         97.8475
---------------------------------------------------------------------------------------
Metric AbsRel              0.0309      N/A         N/A         N/A         0.0309
Metric δ@1.03              78.5885     N/A         N/A         N/A         78.5885
Metric δ@1.25              97.8394     N/A         N/A         N/A         97.8394
---------------------------------------------------------------------------------------
Median scale (s)           1.0153      N/A         N/A         N/A         1.0153
Metric scale rel           0.0170      N/A         N/A         N/A         0.0170
Valid pixels (%)           98.3680     N/A         N/A         N/A         98.3680

What this says (plain English)

  • The model is already very close to metric here: s_med ≈ 1.015, meaning the median scale is ~+1.5% vs GT.
  • Metric scale rel ≈ 0.017 matches that: about 1.7% scale error overall.
  • Scale-invariant AbsRel is ~0.031, so after removing global scale, the “shape” error is still around 3%.
  • Allowing a full affine correction improves AbsRel a bit (~0.0285), suggesting there’s a small systematic bias that a +b term can absorb.

JSON outputs

  • scannetpp_metric_depth.json contains raw metric-depth metrics + scale diagnostics
  • scannetpp_rel_depth.json contains scale/affine-invariant metrics

For details, please see the following attached json files.
scannetpp_metric_depth.json
scannetpp_rel_depth.json

Files changed

  • src/depth_anything_3/bench/depth_metrics.py (new): depth metrics + alignment helpers
  • src/depth_anything_3/bench/evaluator.py: add modes + GT depth loading + JSON outputs
  • src/depth_anything_3/bench/print_metrics.py: grouped depth printer
  • src/depth_anything_3/bench/configs/eval_bench.yaml: expose new modes
  • docs/BENCHMARK.md: document new modes & example commands

Notes / Limitations

  • This PR evaluates per-view depth maps against GT depth. It does not attempt to score “cross-view consistency” directly.
  • Masking uses dataset GT validity + dataset-provided masks where available. We don’t currently gate evaluation by model confidence (unless the dataset’s GT mask already removes regions like sky).

Checklist

  • Ran at least one scene for each dataset (ETH3D / 7Scenes / ScanNet++ / HiRoom)
  • Compare results with MoGe-2/MapAnything
  • Make depth evaluation code faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant