Add rel_depth and metric_depth evaluation modes #170

intMinsu · 2025-12-13T09:41:11Z

Add relative + metric depth + global scale evaluation (`rel_depth`, `metric_depth`) to DA3 benchmark

What / Why

The benchmark already covers pose and reconstruction (pose, recon_unposed, recon_posed). This PR adds depth-quality evaluation so we can measure how good the predicted depth maps are on datasets that provide GT depth, and do it consistently across any-view setups (varying #views, posed/unposed inference, etc.).

This PR is related to evaluation depth estimation quality / global scale as MapAnything provides. #29.

The goal here is simple:

Relative depth quality (shape) → “does the depth map look right up to an unknown scale / affine?”
Metric depth quality (absolute) → “is the depth map correct in meters, without alignment?”
Metric scale accuracy → “if the model is almost metric, how far off is its scale?”

We tried to follow the common evaluation protocol used by MoGe-2 (scale/affine alignment variants) and MapAnything (AbsRel + inlier ratio at τ=1.03%, with multi-view sweeps). We did not run an exhaustive sweep across all scenes, but we sanity-checked outputs on 4 datasets (7Scenes / HiRoom / ScanNet++ / ETH3D) and the numbers look in the expected range.

What’s included

New evaluation modes

rel_depth: reports scale-invariant and affine-invariant depth metrics
metric_depth: reports raw metric depth metrics + scale diagnostics

Both modes write JSON to:

workspace/evaluation/metric_results/<dataset>_rel_depth.json
workspace/evaluation/metric_results/<dataset>_metric_depth.json

Printer update

The benchmark printer now groups depth metrics by interpretation:

Scale-invariant (median aligned)
Affine-invariant (depth and disparity)
Metric (no alignment)
Scale diagnostics (median scale factor, metric-scale error, valid pixels %)

ETH3D GT loading fixes (already in your branch)

ETH3D depth/mask paths are derived from dataset.data_root, and masks are searched in:

masks_for_images/dslr_images/<image>.png (preferred)
ground_truth_masks/dslr_images/<image> (fallback)

This matches the actual dataset layout and avoids crashing when optional masks are missing.

How to run (repro)

Example reproduction (single ScanNet++ scene):

python -m depth_anything_3.bench.evaluator \
model.path="depth-anything/DA3NESTED-GIANT-LARGE-1.1" \
eval.datasets=[scannetpp] \
eval.scenes="1ada7a0617" \
eval.ref_view_strategy="saddle_balanced" \
eval.modes=[rel_depth,metric_depth]

The benchmark still supports:

pose-only
recon-only
print-only
eval-only
(unchanged)

Interpreting the metrics (quick cheatsheet)

All metrics are computed per-image, then averaged per-scene.

“Scale-inv AbsRel (med)” / “Scale-inv δ@…”

We find a single scalar s_med such that median(s_med * pred) == median(gt) on valid pixels, and evaluate on pred * s_med.

This removes global scale bias but keeps “shape” errors.
For models that are already close to metric, s_med will be close to 1.0 and the aligned numbers will be similar to raw metric numbers.

“Affine-inv … (depth)”

Fits a * pred + b to match GT depth (weighted toward relative errors). This removes both scale and shift.

Useful if a method’s output has an additive bias (common in some representations).

“Affine-inv … (disp)”

Fits an affine mapping in disparity (inverse depth), then converts back to depth.

This is commonly used when models are trained/predicted in disparity space (or behave closer to a disparity-affine transform).

“Metric …”

Raw metrics on the predicted depth without any alignment (i.e., “are you in meters?”).

Metric-depth error terms (in `*_metric_depth.json`)

rmse: root-mean-square error in depth units (usually meters). Sensitive to large outliers.
rmse_log: RMSE in log-depth space (uses log(depth)), so it behaves more like a multiplicative / relative error.
si_log: scale-invariant log RMSE, computed as sqrt(mean(d^2) - mean(d)^2) with d = log(pred) - log(gt). This reduces the impact of a global scale shift and focuses more on shape.
valid_pixels_pct: percentage of pixels that actually participated in evaluation after applying the GT validity mask (and any dataset-provided masks). This is computed per-image, then averaged per-scene.

“Median scale (s)” + “Metric scale rel”

Median scale (s) is the median-alignment factor s_med.
Metric scale rel = |s_med - 1| summarizes how far the model’s absolute scale is from perfect metric.
Metric scale log = |log(s_med)| is the same idea in log space (symmetric for over-/under-scaling).
Valid pixels (%) tells you how much of the image actually participated in evaluation after GT validity + basic checks.

Notes: `rmse`/`rmse_log` and `metric_scale_rel`/`metric_scale_log` redundant?

They measure different things:

rmse and rmse_log: per-pixel depth error (shape + scale + local noise).
metric_scale_rel and metric_scale_log: global scale error only (derived from scale_med).

They may correlate in practice, but scale diagnostics are still useful for quickly answering:
“Is the model wrong because the whole scene is scaled incorrectly, or because local depth is noisy?”

Worked example (ScanNet++ `1ada7a0617`)

Console output

📏 DEPTH ESTIMATION
---------------------------------------------------------------------------------------
Metric                     Avg         HiRoom      ETH3D       7Scenes     ScanNet++
---------------------------------------------------------------------------------------
Scale-inv AbsRel (med)     0.0309      N/A         N/A         N/A         0.0309
Scale-inv δ@1.03 (med)     76.7523     N/A         N/A         N/A         76.7523
Scale-inv δ@1.25 (med)     97.8438     N/A         N/A         N/A         97.8438
---------------------------------------------------------------------------------------
Affine-inv AbsRel (depth)  0.0285      N/A         N/A         N/A         0.0285
Affine-inv δ@1.03 (depth)  81.0307     N/A         N/A         N/A         81.0307
Affine-inv δ@1.25 (depth)  97.8475     N/A         N/A         N/A         97.8475
Affine-inv AbsRel (disp)   0.0285      N/A         N/A         N/A         0.0285
Affine-inv δ@1.03 (disp)   81.7547     N/A         N/A         N/A         81.7547
Affine-inv δ@1.25 (disp)   97.8475     N/A         N/A         N/A         97.8475
---------------------------------------------------------------------------------------
Metric AbsRel              0.0309      N/A         N/A         N/A         0.0309
Metric δ@1.03              78.5885     N/A         N/A         N/A         78.5885
Metric δ@1.25              97.8394     N/A         N/A         N/A         97.8394
---------------------------------------------------------------------------------------
Median scale (s)           1.0153      N/A         N/A         N/A         1.0153
Metric scale rel           0.0170      N/A         N/A         N/A         0.0170
Valid pixels (%)           98.3680     N/A         N/A         N/A         98.3680

What this says (plain English)

The model is already very close to metric here: s_med ≈ 1.015, meaning the median scale is ~+1.5% vs GT.
Metric scale rel ≈ 0.017 matches that: about 1.7% scale error overall.
Scale-invariant AbsRel is ~0.031, so after removing global scale, the “shape” error is still around 3%.
Allowing a full affine correction improves AbsRel a bit (~0.0285), suggesting there’s a small systematic bias that a +b term can absorb.

JSON outputs

scannetpp_metric_depth.json contains raw metric-depth metrics + scale diagnostics
scannetpp_rel_depth.json contains scale/affine-invariant metrics

For details, please see the following attached json files.
scannetpp_metric_depth.json
scannetpp_rel_depth.json

Files changed

src/depth_anything_3/bench/depth_metrics.py (new): depth metrics + alignment helpers
src/depth_anything_3/bench/evaluator.py: add modes + GT depth loading + JSON outputs
src/depth_anything_3/bench/print_metrics.py: grouped depth printer
src/depth_anything_3/bench/configs/eval_bench.yaml: expose new modes
docs/BENCHMARK.md: document new modes & example commands

Notes / Limitations

This PR evaluates per-view depth maps against GT depth. It does not attempt to score “cross-view consistency” directly.
Masking uses dataset GT validity + dataset-provided masks where available. We don’t currently gate evaluation by model confidence (unless the dataset’s GT mask already removes regions like sky).

Checklist

Ran at least one scene for each dataset (ETH3D / 7Scenes / ScanNet++ / HiRoom)
Compare results with MoGe-2/MapAnything
Make depth evaluation code faster

bench: add rel_depth and metric_depth evaluation modes

55748d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add rel_depth and metric_depth evaluation modes #170

Add rel_depth and metric_depth evaluation modes #170

Uh oh!

intMinsu commented Dec 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add rel_depth and metric_depth evaluation modes #170

Are you sure you want to change the base?

Add rel_depth and metric_depth evaluation modes #170

Uh oh!

Conversation

intMinsu commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add relative + metric depth + global scale evaluation (rel_depth, metric_depth) to DA3 benchmark

What / Why

What’s included

New evaluation modes

Printer update

ETH3D GT loading fixes (already in your branch)

How to run (repro)

Interpreting the metrics (quick cheatsheet)

“Scale-inv AbsRel (med)” / “Scale-inv δ@…”

“Affine-inv … (depth)”

“Affine-inv … (disp)”

“Metric …”

Metric-depth error terms (in *_metric_depth.json)

“Median scale (s)” + “Metric scale rel”

Notes: rmse/rmse_log and metric_scale_rel/metric_scale_log redundant?

Worked example (ScanNet++ 1ada7a0617)

Console output

What this says (plain English)

JSON outputs

Files changed

Notes / Limitations

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

intMinsu commented Dec 13, 2025 •

edited

Loading

Add relative + metric depth + global scale evaluation (`rel_depth`, `metric_depth`) to DA3 benchmark

Metric-depth error terms (in `*_metric_depth.json`)

Notes: `rmse`/`rmse_log` and `metric_scale_rel`/`metric_scale_log` redundant?

Worked example (ScanNet++ `1ada7a0617`)