Skip to content

Optimize cluster-robust vcov aggregation#117

Closed
yangzichao wants to merge 1 commit intoigerber:mainfrom
yangzichao:main
Closed

Optimize cluster-robust vcov aggregation#117
yangzichao wants to merge 1 commit intoigerber:mainfrom
yangzichao:main

Conversation

@yangzichao
Copy link
Copy Markdown

Benchmark (200k x 10, 2k clusters): baseline 0.0071s mean, new 0.0068s mean (~4-6% faster).

Methodology references (required if estimator / math changes)

  • Method name(s): Cluster-robust sandwich variance estimator (CRSE), HC1 small-sample adjustment
  • Paper / source link(s): N/A (implementation detail; estimator unchanged)
  • Any intentional deviations from the source (and why): None; only aggregation strategy changed (pandas groupby → NumPy add.at) in the NumPy fallback.

Validation

  • Tests added/updated: None
  • Backtest / simulation / notebook evidence (if applicable): Micro-benchmark run locally (200k rows, 10 regressors, 2k clusters).

Security / privacy

  • Confirm no secrets/PII in this PR:

Benchmark (200k x 10, 2k clusters): baseline 0.0071s mean, new 0.0068s mean (~4-6% faster).
@igerber
Copy link
Copy Markdown
Owner

igerber commented Jan 31, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 16ba52dc166ca606dee59dfefb022caa68da5b83


Overall Assessment

  • ⚠️ Needs changes

Executive Summary

  • Cluster-robust vcov math/adjustment matches the Methodology Registry, but missing-cluster handling now changes results and must be defined explicitly.
  • NumPy path drops NaN cluster IDs and recomputes G; Rust path still treats NaNs as a cluster, so SEs can diverge across backends.
  • Docstrings still reference pandas groupby after the NumPy aggregation change.
  • Add tests for NaN cluster IDs and backend parity under missing clusters.

Methodology

  • P1 Impact: Cluster-robust sandwich variance estimator (CRSE/HC1 adjustment) now behaves differently when cluster_ids contain NaNs; NumPy drops NaNs and recomputes G, while Rust still includes them (factorized to -1), and n/XtX still reflect the full sample. This makes SEs backend-dependent and can misstate variance. Fix: define an explicit missing-cluster policy (error or drop) and implement it consistently in both Python and Rust; if dropping, filter X/residuals and use n_eff in the adjustment, and document the policy in the Methodology Registry. Locations: diff_diff/linalg.py:L831-L856, diff_diff/linalg.py:L760-L763, rust/src/linalg.rs:L227-L268.

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P3 Impact: Module and function docstrings still claim pandas groupby aggregation, which is now inaccurate and can mislead maintainers. Fix: update the module header and compute_robust_vcov/_compute_robust_vcov_numpy docs to reflect NumPy aggregation. Locations: diff_diff/linalg.py:L7-L10, diff_diff/linalg.py:L742-L753, diff_diff/linalg.py:L816-L819.
  • P2 Impact: Missing-cluster/NaN behavior and backend parity aren’t tested; regressions will be silent. Fix: extend tests/test_rust_backend.py:L638-L655 to include NaN/missing cluster_ids and assert the chosen policy, and add a NumPy-only test in tests/test_linalg.py:L468-L521 covering the missing-cluster edge case.

@igerber
Copy link
Copy Markdown
Owner

igerber commented Jan 31, 2026

Performance: np.add.at is slower, not faster (Critical)

np.add.at uses unbuffered in-place scatter operations and is well-documented as slow in the NumPy community. At the PR's claimed scale (200k
rows, 2k clusters), pd.groupby().sum() delegates to optimized Cython internals and is faster. The reported 0.0003s difference (0.0071s vs
0.0068s) is within typical benchmark noise for a 7ms operation.

If the author wants to pursue optimization, better alternatives are:

  • Sparse indicator matrix: scipy.sparse.csr_matrix((ones, (codes, arange(n))), shape=(G, n)) @ scores
  • Column-wise np.bincount with weights

@igerber igerber closed this Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants