Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Changed
- Remove Rust outer-loop variance estimation for SyntheticDiD (placebo and bootstrap)
- Fixes SE mismatch between pure Python and Rust backends (different RNG sequences)
- Fixes Rust performance regression at 1k+ scale (memory bandwidth saturation from rayon parallelism)
- Inner Frank-Wolfe weight computation still uses Rust when available

### Documentation
- Re-run SyntheticDiD benchmarks against R after Frank-Wolfe methodology rewrite
- Updated `docs/benchmarks.rst` SDID validation results, performance tables, and known differences
- ATT now matches R to < 1e-10 (previously 0.3% diff) since both use Frank-Wolfe optimizer

## [2.3.0] - 2026-02-09

### Added
Expand Down
3 changes: 0 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,9 +167,6 @@ cross-platform compilation - no OpenBLAS or Intel MKL installation required.
- `compute_unit_distance_matrix()` - Parallel pairwise RMSE distance computation (4-8x speedup)
- `loocv_grid_search()` - Parallel LOOCV across tuning parameters (10-50x speedup)
- `bootstrap_trop_variance()` - Parallel bootstrap variance estimation (5-15x speedup)
- **`rust/src/sdid_variance.rs`** - SDID variance estimation acceleration:
- `placebo_variance_sdid()` - Parallel placebo SE computation (~8x speedup)
- `bootstrap_variance_sdid()` - Parallel bootstrap SE computation (~6x speedup)
- Uses pure-Rust `faer` library for linear algebra (no external BLAS/LAPACK dependencies)
- Cross-platform: builds on Linux, macOS, and Windows without additional setup
- Provides 4-8x speedup for SyntheticDiD, 5-20x speedup for TROP
Expand Down
12 changes: 0 additions & 12 deletions diff_diff/_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,6 @@
compute_time_weights as _rust_compute_time_weights,
compute_noise_level as _rust_compute_noise_level,
sc_weight_fw as _rust_sc_weight_fw,
# SDID variance estimation (parallel placebo and bootstrap)
placebo_variance_sdid as _rust_placebo_variance_sdid,
bootstrap_variance_sdid as _rust_bootstrap_variance_sdid,
)
_rust_available = True
except ImportError:
Expand All @@ -59,9 +56,6 @@
_rust_compute_time_weights = None
_rust_compute_noise_level = None
_rust_sc_weight_fw = None
# SDID variance estimation (parallel placebo and bootstrap)
_rust_placebo_variance_sdid = None
_rust_bootstrap_variance_sdid = None

# Determine final backend based on environment variable and availability
if _backend_env == 'python':
Expand All @@ -84,9 +78,6 @@
_rust_compute_time_weights = None
_rust_compute_noise_level = None
_rust_sc_weight_fw = None
# SDID variance estimation (parallel placebo and bootstrap)
_rust_placebo_variance_sdid = None
_rust_bootstrap_variance_sdid = None
elif _backend_env == 'rust':
# Force Rust mode - fail if not available
if not _rust_available:
Expand Down Expand Up @@ -118,7 +109,4 @@
'_rust_compute_time_weights',
'_rust_compute_noise_level',
'_rust_sc_weight_fw',
# SDID variance estimation (parallel placebo and bootstrap)
'_rust_placebo_variance_sdid',
'_rust_bootstrap_variance_sdid',
]
103 changes: 0 additions & 103 deletions diff_diff/synthetic_did.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,62 +583,6 @@ def _bootstrap_se(
])
n_pre = Y_pre_control.shape[0]

# Try Rust parallel implementation for ~6x speedup
from diff_diff._backend import HAS_RUST_BACKEND, _rust_bootstrap_variance_sdid

if HAS_RUST_BACKEND and _rust_bootstrap_variance_sdid is not None:
# Generate random seed when self.seed is None
rust_seed = self.seed if self.seed is not None else int(
np.random.default_rng(None).integers(0, 2**63)
)

se, boot_arr, n_failed = _rust_bootstrap_variance_sdid(
np.ascontiguousarray(Y_pre_control, dtype=np.float64),
np.ascontiguousarray(Y_post_control, dtype=np.float64),
np.ascontiguousarray(Y_pre_treated, dtype=np.float64),
np.ascontiguousarray(Y_post_treated, dtype=np.float64),
np.ascontiguousarray(unit_weights, dtype=np.float64),
np.ascontiguousarray(time_weights, dtype=np.float64),
self.n_bootstrap, rust_seed,
)
bootstrap_estimates = np.asarray(boot_arr)
n_successful = len(bootstrap_estimates)
failure_rate = 1 - (n_successful / self.n_bootstrap)

# Apply same warning/error logic as Python path
if n_successful == 0:
raise ValueError(
f"All {self.n_bootstrap} bootstrap iterations failed. "
f"This typically occurs when:\n"
f" - Sample size is too small for reliable resampling\n"
f" - Weight matrices are singular or near-singular\n"
f" - Insufficient pre-treatment periods for weight estimation\n"
f" - Too few control units relative to treated units\n"
f"Consider using variance_method='placebo' or increasing "
f"the regularization parameters (zeta_omega, zeta_lambda)."
)
elif n_successful == 1:
warnings.warn(
f"Only 1/{self.n_bootstrap} bootstrap iteration succeeded. "
f"Standard error cannot be computed reliably (requires at least 2). "
f"Returning SE=0.0. Consider using variance_method='placebo' or "
f"increasing the regularization (zeta_omega, zeta_lambda).",
UserWarning,
stacklevel=2,
)
return 0.0, bootstrap_estimates
elif failure_rate > 0.05:
warnings.warn(
f"Only {n_successful}/{self.n_bootstrap} bootstrap iterations succeeded "
f"({failure_rate:.1%} failure rate). Standard errors may be unreliable. "
f"This can occur with small samples or insufficient pre-treatment periods.",
UserWarning,
stacklevel=2,
)

return se, bootstrap_estimates

# Python fallback
bootstrap_estimates = []

for _ in range(self.n_bootstrap):
Expand Down Expand Up @@ -796,53 +740,6 @@ def _placebo_variance_se(
)
return 0.0, np.array([])

# Try Rust parallel implementation for ~8x speedup
from diff_diff._backend import HAS_RUST_BACKEND, _rust_placebo_variance_sdid

if HAS_RUST_BACKEND and _rust_placebo_variance_sdid is not None:
# Generate random seed when self.seed is None (matching Python's non-reproducible behavior)
rust_seed = self.seed if self.seed is not None else int(
np.random.default_rng(None).integers(0, 2**63)
)

se, placebo_arr = _rust_placebo_variance_sdid(
np.ascontiguousarray(Y_pre_control, dtype=np.float64),
np.ascontiguousarray(Y_post_control, dtype=np.float64),
np.ascontiguousarray(Y_pre_treated_mean, dtype=np.float64),
np.ascontiguousarray(Y_post_treated_mean, dtype=np.float64),
n_treated, zeta_omega, zeta_lambda, min_decrease,
True, # intercept
100, # max_iter_pre_sparsify
10000, # max_iter
replications, rust_seed,
)
placebo_estimates = np.asarray(placebo_arr)
n_successful = len(placebo_estimates)

# Apply same warning/error logic as Python path
if n_successful < 2:
warnings.warn(
f"Only {n_successful} placebo replications completed successfully. "
f"Standard error cannot be estimated reliably. "
f"Consider using variance_method='bootstrap' or increasing "
f"the number of control units.",
UserWarning,
stacklevel=3,
)
return 0.0, placebo_estimates

failure_rate = 1 - (n_successful / replications)
if failure_rate > 0.05:
warnings.warn(
f"Only {n_successful}/{replications} placebo replications succeeded "
f"({failure_rate:.1%} failure rate). Standard errors may be unreliable.",
UserWarning,
stacklevel=3,
)

return se, placebo_estimates

# Python fallback
placebo_estimates = []

for _ in range(replications):
Expand Down
141 changes: 69 additions & 72 deletions docs/benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,8 @@ Summary Table
- Yes
- **PASS**
* - SyntheticDiD
- 0.011
- 3.1%
- < 1e-10
- 0.3%
- Yes
- **PASS**

Expand Down Expand Up @@ -173,32 +173,36 @@ Synthetic DiD Results
:header-rows: 1

* - Metric
- diff-diff
- diff-diff (Pure)
- diff-diff (Rust)
- R synthdid
- Difference
* - ATT
- 3.851
- 3.840
- 0.011 (0.3%)
- 3.840
- 3.840
- < 1e-10
* - SE
- 0.106
- 0.103
- 3.1%
- 0.105
- 0.099
- 0.105
- 0.3% (pure)
* - Time (s)
- 0.017
- 7.49
- **433x faster**

**Validation**: PASS - Both ATT and SE estimates match closely. Both implementations
use placebo-based variance estimation (R's Algorithm 4 from Arkhangelsky et al. 2021).

The small SE difference (3.1%) is due to different unit/time weight optimization
algorithms:

- diff-diff uses projected gradient descent
- R synthdid uses Frank-Wolfe optimization with adaptive regularization

This leads to slightly different weights, which propagate to the placebo estimates.
- 3.41
- 1.65
- 8.19
- **2.4x faster** (pure)

**Validation**: PASS - ATT estimates are numerically identical across all
implementations. Both diff-diff and R's synthdid use Frank-Wolfe optimization
with two-pass sparsification and auto-computed regularization (``zeta_omega``,
``zeta_lambda``), producing identical unit and time weights. Both use
placebo-based variance estimation (Algorithm 4 from Arkhangelsky et al. 2021).

The small SE difference (0.3% at small scale, up to ~7% at larger scales) is
due to Monte Carlo variance in the placebo procedure, which randomly permutes
control units to construct pseudo-treated groups. Different random seeds across
implementations produce slightly different placebo samples.

Callaway-Sant'Anna Results
~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -362,49 +366,38 @@ Three-Way Performance Summary
- R (s)
- Python Pure (s)
- Python Rust (s)
- Rust/R
- Pure/R
- Rust/Pure
* - small
- 7.73
- 0.016
- 0.004
- **2147x**
- **4.4x**
- 8.19
- 3.41
- 1.65
- **2.4x**
- **2.1x**
* - 1k
- 108.1
- 0.068
- 0.101
- **1070x**
- 0.7x
- 111.7
- 24.0
- 76.1
- **4.7x**
- 0.3x
* - 5k
- 504.5
- 2.97
- 0.69
- **734x**
- **4.3x**
* - 10k
- 1107.2
- 19.54
- 2.58
- **429x**
- **7.6x**
* - 20k
- 2451.0
- 137.3
- 10.9
- **225x**
- **12.6x**
- 524.2
- 31.7
- 307.5
- **16.5x**
- 0.1x

.. note::

**SyntheticDiD Performance**: diff-diff achieves **225x to 2147x speedup** over
R's synthdid package. At 10k scale, R takes ~18 minutes while Python Rust
completes in 2.6 seconds. At 20k scale, R takes ~41 minutes while Python Rust
completes in 11 seconds. The Rust backend provides **4-13x additional speedup**
over pure Python for SyntheticDiD due to optimized simplex projection and
synthetic weight computation. ATT estimates differ slightly due to different
weight optimization algorithms (projected gradient descent vs Frank-Wolfe),
but confidence intervals overlap.
**SyntheticDiD Performance**: diff-diff's pure Python backend achieves
**2.4x to 16.5x speedup** over R's synthdid package using the same
Frank-Wolfe optimization algorithm. At 5k scale, R takes ~9 minutes while
pure Python completes in 32 seconds. ATT estimates are numerically identical
(< 1e-10 difference) since both implementations use the same Frank-Wolfe
optimizer with two-pass sparsification. The Rust backend provides a speedup
at small scale (2.1x over pure Python) but is slower at larger scales due to
overhead in the placebo variance estimation loop; this is a known area for
future optimization.

Dataset Sizes
~~~~~~~~~~~~~
Expand Down Expand Up @@ -457,29 +450,30 @@ Key Observations

- **BasicDiD/TWFE**: 2-17x faster than R at all scales
- **CallawaySantAnna**: 4-11x faster than R at all scales (vectorized WIF computation)
- **SyntheticDiD**: 225-2147x faster than R (R takes 41 minutes at 20k scale!)
- **SyntheticDiD**: 2.4-16.5x faster than R (pure Python), with both
implementations using the same Frank-Wolfe algorithm

2. **Rust backend benefit depends on the estimator**:

- **SyntheticDiD**: Rust provides **4-13x speedup** over pure Python due to
optimized simplex projection and synthetic weight computation
- **SyntheticDiD**: Rust provides speedup at small scale (2.1x) but is
slower at larger scales due to placebo variance loop overhead
- **BasicDiD/CallawaySantAnna**: Rust provides minimal benefit (~1x) since
these estimators use OLS/variance computations already optimized in NumPy/SciPy

3. **When to use Rust backend**:

- **SyntheticDiD**: Recommended - provides significant speedup (4-13x)
- **SyntheticDiD at small scale**: Rust is ~2x faster than pure Python
- **Bootstrap inference**: May help with parallelized iterations
- **BasicDiD/CallawaySantAnna**: Optional - pure Python is equally fast

4. **Scaling behavior**: Python implementations show excellent scaling behavior
across all estimators. SyntheticDiD is 225x faster than R at 20k scale.
CallawaySantAnna achieves **exact SE accuracy** (0.0% difference) while
being 4-11x faster than R through vectorized NumPy operations.
across all estimators. SyntheticDiD pure Python is 16.5x faster than R at
5k scale. CallawaySantAnna achieves **exact SE accuracy** (0.0% difference)
while being 4-11x faster than R through vectorized NumPy operations.

5. **No Rust required for most use cases**: Users without Rust/maturin can
install diff-diff and get full functionality with excellent performance.
Only SyntheticDiD benefits significantly from the Rust backend.
Pure Python is the fastest option for SyntheticDiD at 1k+ scales.

6. **CallawaySantAnna accuracy and speed**: As of v2.0.3, CallawaySantAnna
achieves both exact numerical accuracy (0.0% SE difference from R) AND
Expand Down Expand Up @@ -650,9 +644,11 @@ When to Trust Results
``treated * time_f | unit`` interaction syntax (unit FE absorbed). Both average
ATT and all period-level effects match to machine precision. Use with confidence.

- **SyntheticDiD**: Both point estimates (0.3% diff) and standard errors (3.1% diff)
match R closely. Use ``variance_method="placebo"`` (default) to match R's
inference. Results are fully validated.
- **SyntheticDiD**: Point estimates are numerically identical (< 1e-10 diff) and
standard errors match closely (0.3% diff at small scale). Both implementations
use Frank-Wolfe optimization with identical weights. Use
``variance_method="placebo"`` (default) to match R's inference. Results are
fully validated.

- **CallawaySantAnna**: Both group-time effects (ATT(g,t)) and overall ATT
aggregation match R exactly. Standard errors are numerically equivalent
Expand All @@ -668,9 +664,10 @@ Known Differences
2. **Aggregation Weights**: Overall ATT is a weighted average of ATT(g,t).
Weighting schemes may differ between implementations.

3. **Weight Optimization**: SyntheticDiD uses different optimization algorithms
for unit/time weights (projected gradient descent vs Frank-Wolfe). This leads
to slightly different weights but equivalent ATT estimates.
3. **Placebo Variance**: SyntheticDiD SE estimates differ slightly (0.3-7%)
across implementations due to Monte Carlo variance in the placebo procedure.
Point estimates and unit/time weights are numerically identical since both
implementations use the same Frank-Wolfe optimizer.

References
----------
Expand Down
Loading