Optimize cluster-robust vcov aggregation by yangzichao · Pull Request #117 · igerber/diff-diff

yangzichao · 2026-01-28T03:40:50Z

Benchmark (200k x 10, 2k clusters): baseline 0.0071s mean, new 0.0068s mean (~4-6% faster).

Methodology references (required if estimator / math changes)

Method name(s): Cluster-robust sandwich variance estimator (CRSE), HC1 small-sample adjustment
Paper / source link(s): N/A (implementation detail; estimator unchanged)
Any intentional deviations from the source (and why): None; only aggregation strategy changed (pandas groupby → NumPy add.at) in the NumPy fallback.

Validation

Tests added/updated: None
Backtest / simulation / notebook evidence (if applicable): Micro-benchmark run locally (200k rows, 10 regressors, 2k clusters).

Security / privacy

Confirm no secrets/PII in this PR:

Benchmark (200k x 10, 2k clusters): baseline 0.0071s mean, new 0.0068s mean (~4-6% faster).

igerber · 2026-01-31T20:09:43Z

/ai-review

github-actions · 2026-01-31T20:15:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 16ba52dc166ca606dee59dfefb022caa68da5b83

Overall Assessment

⚠️ Needs changes

Executive Summary

Cluster-robust vcov math/adjustment matches the Methodology Registry, but missing-cluster handling now changes results and must be defined explicitly.
NumPy path drops NaN cluster IDs and recomputes G; Rust path still treats NaNs as a cluster, so SEs can diverge across backends.
Docstrings still reference pandas groupby after the NumPy aggregation change.
Add tests for NaN cluster IDs and backend parity under missing clusters.

Methodology

P1 Impact: Cluster-robust sandwich variance estimator (CRSE/HC1 adjustment) now behaves differently when cluster_ids contain NaNs; NumPy drops NaNs and recomputes G, while Rust still includes them (factorized to -1), and n/XtX still reflect the full sample. This makes SEs backend-dependent and can misstate variance. Fix: define an explicit missing-cluster policy (error or drop) and implement it consistently in both Python and Rust; if dropping, filter X/residuals and use n_eff in the adjustment, and document the policy in the Methodology Registry. Locations: diff_diff/linalg.py:L831-L856, diff_diff/linalg.py:L760-L763, rust/src/linalg.rs:L227-L268.

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P3 Impact: Module and function docstrings still claim pandas groupby aggregation, which is now inaccurate and can mislead maintainers. Fix: update the module header and compute_robust_vcov/_compute_robust_vcov_numpy docs to reflect NumPy aggregation. Locations: diff_diff/linalg.py:L7-L10, diff_diff/linalg.py:L742-L753, diff_diff/linalg.py:L816-L819.
P2 Impact: Missing-cluster/NaN behavior and backend parity aren’t tested; regressions will be silent. Fix: extend tests/test_rust_backend.py:L638-L655 to include NaN/missing cluster_ids and assert the chosen policy, and add a NumPy-only test in tests/test_linalg.py:L468-L521 covering the missing-cluster edge case.

igerber · 2026-01-31T20:21:40Z

Performance: np.add.at is slower, not faster (Critical)

np.add.at uses unbuffered in-place scatter operations and is well-documented as slow in the NumPy community. At the PR's claimed scale (200k
rows, 2k clusters), pd.groupby().sum() delegates to optimized Cython internals and is faster. The reported 0.0003s difference (0.0071s vs
0.0068s) is within typical benchmark noise for a 7ms operation.

If the author wants to pursue optimization, better alternatives are:

Sparse indicator matrix: scipy.sparse.csr_matrix((ones, (codes, arange(n))), shape=(G, n)) @ scores
Column-wise np.bincount with weights

Optimize cluster-robust vcov aggregation

16ba52d

Benchmark (200k x 10, 2k clusters): baseline 0.0071s mean, new 0.0068s mean (~4-6% faster).

igerber closed this Jan 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cluster-robust vcov aggregation#117

Optimize cluster-robust vcov aggregation#117
yangzichao wants to merge 1 commit intoigerber:mainfrom
yangzichao:main

yangzichao commented Jan 28, 2026

Uh oh!

igerber commented Jan 31, 2026

Uh oh!

github-actions bot commented Jan 31, 2026

Uh oh!

igerber commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yangzichao commented Jan 28, 2026

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

igerber commented Jan 31, 2026

Uh oh!

github-actions bot commented Jan 31, 2026

Uh oh!

igerber commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants