igerber · igerber · Feb 15, 2026 · Feb 15, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Changed
+- Remove Rust outer-loop variance estimation for SyntheticDiD (placebo and bootstrap)
+  - Fixes SE mismatch between pure Python and Rust backends (different RNG sequences)
+  - Fixes Rust performance regression at 1k+ scale (memory bandwidth saturation from rayon parallelism)
+  - Inner Frank-Wolfe weight computation still uses Rust when available
+
+### Documentation
+- Re-run SyntheticDiD benchmarks against R after Frank-Wolfe methodology rewrite
+- Updated `docs/benchmarks.rst` SDID validation results, performance tables, and known differences
+- ATT now matches R to < 1e-10 (previously 0.3% diff) since both use Frank-Wolfe optimizer
+
 ## [2.3.0] - 2026-02-09
 
 ### Added

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -167,9 +167,6 @@ cross-platform compilation - no OpenBLAS or Intel MKL installation required.
     - `compute_unit_distance_matrix()` - Parallel pairwise RMSE distance computation (4-8x speedup)
     - `loocv_grid_search()` - Parallel LOOCV across tuning parameters (10-50x speedup)
     - `bootstrap_trop_variance()` - Parallel bootstrap variance estimation (5-15x speedup)
-  - **`rust/src/sdid_variance.rs`** - SDID variance estimation acceleration:
-    - `placebo_variance_sdid()` - Parallel placebo SE computation (~8x speedup)
-    - `bootstrap_variance_sdid()` - Parallel bootstrap SE computation (~6x speedup)
   - Uses pure-Rust `faer` library for linear algebra (no external BLAS/LAPACK dependencies)
   - Cross-platform: builds on Linux, macOS, and Windows without additional setup
   - Provides 4-8x speedup for SyntheticDiD, 5-20x speedup for TROP

diff --git a/diff_diff/_backend.py b/diff_diff/_backend.py
@@ -35,9 +35,6 @@
         compute_time_weights as _rust_compute_time_weights,
         compute_noise_level as _rust_compute_noise_level,
         sc_weight_fw as _rust_sc_weight_fw,
-        # SDID variance estimation (parallel placebo and bootstrap)
-        placebo_variance_sdid as _rust_placebo_variance_sdid,
-        bootstrap_variance_sdid as _rust_bootstrap_variance_sdid,
     )
     _rust_available = True
 except ImportError:
@@ -59,9 +56,6 @@
     _rust_compute_time_weights = None
     _rust_compute_noise_level = None
     _rust_sc_weight_fw = None
-    # SDID variance estimation (parallel placebo and bootstrap)
-    _rust_placebo_variance_sdid = None
-    _rust_bootstrap_variance_sdid = None
 
 # Determine final backend based on environment variable and availability
 if _backend_env == 'python':
@@ -84,9 +78,6 @@
     _rust_compute_time_weights = None
     _rust_compute_noise_level = None
     _rust_sc_weight_fw = None
-    # SDID variance estimation (parallel placebo and bootstrap)
-    _rust_placebo_variance_sdid = None
-    _rust_bootstrap_variance_sdid = None
 elif _backend_env == 'rust':
     # Force Rust mode - fail if not available
     if not _rust_available:
@@ -118,7 +109,4 @@
     '_rust_compute_time_weights',
     '_rust_compute_noise_level',
     '_rust_sc_weight_fw',
-    # SDID variance estimation (parallel placebo and bootstrap)
-    '_rust_placebo_variance_sdid',
-    '_rust_bootstrap_variance_sdid',
 ]
diff --git a/diff_diff/synthetic_did.py b/diff_diff/synthetic_did.py
@@ -583,62 +583,6 @@ def _bootstrap_se(
         ])
         n_pre = Y_pre_control.shape[0]
 
-        # Try Rust parallel implementation for ~6x speedup
-        from diff_diff._backend import HAS_RUST_BACKEND, _rust_bootstrap_variance_sdid
-
-        if HAS_RUST_BACKEND and _rust_bootstrap_variance_sdid is not None:
-            # Generate random seed when self.seed is None
-            rust_seed = self.seed if self.seed is not None else int(
-                np.random.default_rng(None).integers(0, 2**63)
-            )
-
-            se, boot_arr, n_failed = _rust_bootstrap_variance_sdid(
-                np.ascontiguousarray(Y_pre_control, dtype=np.float64),
-                np.ascontiguousarray(Y_post_control, dtype=np.float64),
-                np.ascontiguousarray(Y_pre_treated, dtype=np.float64),
-                np.ascontiguousarray(Y_post_treated, dtype=np.float64),
-                np.ascontiguousarray(unit_weights, dtype=np.float64),
-                np.ascontiguousarray(time_weights, dtype=np.float64),
-                self.n_bootstrap, rust_seed,
-            )
-            bootstrap_estimates = np.asarray(boot_arr)
-            n_successful = len(bootstrap_estimates)
-            failure_rate = 1 - (n_successful / self.n_bootstrap)
-
-            # Apply same warning/error logic as Python path
-            if n_successful == 0:
-                raise ValueError(
-                    f"All {self.n_bootstrap} bootstrap iterations failed. "
-                    f"This typically occurs when:\n"
-                    f"  - Sample size is too small for reliable resampling\n"
-                    f"  - Weight matrices are singular or near-singular\n"
-                    f"  - Insufficient pre-treatment periods for weight estimation\n"
-                    f"  - Too few control units relative to treated units\n"
-                    f"Consider using variance_method='placebo' or increasing "
-                    f"the regularization parameters (zeta_omega, zeta_lambda)."
-                )
-            elif n_successful == 1:
-                warnings.warn(
-                    f"Only 1/{self.n_bootstrap} bootstrap iteration succeeded. "
-                    f"Standard error cannot be computed reliably (requires at least 2). "
-                    f"Returning SE=0.0. Consider using variance_method='placebo' or "
-                    f"increasing the regularization (zeta_omega, zeta_lambda).",
-                    UserWarning,
-                    stacklevel=2,
-                )
-                return 0.0, bootstrap_estimates
-            elif failure_rate > 0.05:
-                warnings.warn(
-                    f"Only {n_successful}/{self.n_bootstrap} bootstrap iterations succeeded "
-                    f"({failure_rate:.1%} failure rate). Standard errors may be unreliable. "
-                    f"This can occur with small samples or insufficient pre-treatment periods.",
-                    UserWarning,
-                    stacklevel=2,
-                )
-
-            return se, bootstrap_estimates
-
-        # Python fallback
         bootstrap_estimates = []
 
         for _ in range(self.n_bootstrap):
@@ -796,53 +740,6 @@ def _placebo_variance_se(
             )
             return 0.0, np.array([])
 
-        # Try Rust parallel implementation for ~8x speedup
-        from diff_diff._backend import HAS_RUST_BACKEND, _rust_placebo_variance_sdid
-
-        if HAS_RUST_BACKEND and _rust_placebo_variance_sdid is not None:
-            # Generate random seed when self.seed is None (matching Python's non-reproducible behavior)
-            rust_seed = self.seed if self.seed is not None else int(
-                np.random.default_rng(None).integers(0, 2**63)
-            )
-
-            se, placebo_arr = _rust_placebo_variance_sdid(
-                np.ascontiguousarray(Y_pre_control, dtype=np.float64),
-                np.ascontiguousarray(Y_post_control, dtype=np.float64),
-                np.ascontiguousarray(Y_pre_treated_mean, dtype=np.float64),
-                np.ascontiguousarray(Y_post_treated_mean, dtype=np.float64),
-                n_treated, zeta_omega, zeta_lambda, min_decrease,
-                True,   # intercept
-                100,    # max_iter_pre_sparsify
-                10000,  # max_iter
-                replications, rust_seed,
-            )
-            placebo_estimates = np.asarray(placebo_arr)
-            n_successful = len(placebo_estimates)
-
-            # Apply same warning/error logic as Python path
-            if n_successful < 2:
-                warnings.warn(
-                    f"Only {n_successful} placebo replications completed successfully. "
-                    f"Standard error cannot be estimated reliably. "
-                    f"Consider using variance_method='bootstrap' or increasing "
-                    f"the number of control units.",
-                    UserWarning,
-                    stacklevel=3,
-                )
-                return 0.0, placebo_estimates
-
-            failure_rate = 1 - (n_successful / replications)
-            if failure_rate > 0.05:
-                warnings.warn(
-                    f"Only {n_successful}/{replications} placebo replications succeeded "
-                    f"({failure_rate:.1%} failure rate). Standard errors may be unreliable.",
-                    UserWarning,
-                    stacklevel=3,
-                )
-
-            return se, placebo_estimates
-
-        # Python fallback
         placebo_estimates = []
 
         for _ in range(replications):

diff --git a/docs/benchmarks.rst b/docs/benchmarks.rst
@@ -88,8 +88,8 @@ Summary Table
      - Yes
      - **PASS**
    * - SyntheticDiD
-     - 0.011
-     - 3.1%
+     - < 1e-10
+     - 0.3%
      - Yes
      - **PASS**
 
@@ -173,32 +173,36 @@ Synthetic DiD Results
    :header-rows: 1
 
    * - Metric
-     - diff-diff
+     - diff-diff (Pure)
+     - diff-diff (Rust)
      - R synthdid
      - Difference
    * - ATT
-     - 3.851
      - 3.840
-     - 0.011 (0.3%)
+     - 3.840
+     - 3.840
+     - < 1e-10
    * - SE
-     - 0.106
-     - 0.103
-     - 3.1%
+     - 0.105
+     - 0.099
+     - 0.105
+     - 0.3% (pure)
    * - Time (s)
-     - 0.017
-     - 7.49
-     - **433x faster**
-
-**Validation**: PASS - Both ATT and SE estimates match closely. Both implementations
-use placebo-based variance estimation (R's Algorithm 4 from Arkhangelsky et al. 2021).
-
-The small SE difference (3.1%) is due to different unit/time weight optimization
-algorithms:
-
-- diff-diff uses projected gradient descent
-- R synthdid uses Frank-Wolfe optimization with adaptive regularization
-
-This leads to slightly different weights, which propagate to the placebo estimates.
+     - 3.41
+     - 1.65
+     - 8.19
+     - **2.4x faster** (pure)
+
+**Validation**: PASS - ATT estimates are numerically identical across all
+implementations. Both diff-diff and R's synthdid use Frank-Wolfe optimization
+with two-pass sparsification and auto-computed regularization (``zeta_omega``,
+``zeta_lambda``), producing identical unit and time weights. Both use
+placebo-based variance estimation (Algorithm 4 from Arkhangelsky et al. 2021).
+
+The small SE difference (0.3% at small scale, up to ~7% at larger scales) is
+due to Monte Carlo variance in the placebo procedure, which randomly permutes
+control units to construct pseudo-treated groups. Different random seeds across
+implementations produce slightly different placebo samples.
 
 Callaway-Sant'Anna Results
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -362,49 +366,38 @@ Three-Way Performance Summary
      - R (s)
      - Python Pure (s)
      - Python Rust (s)
-     - Rust/R
+     - Pure/R
      - Rust/Pure
    * - small
-     - 7.73
-     - 0.016
-     - 0.004
-     - **2147x**
-     - **4.4x**
+     - 8.19
+     - 3.41
+     - 1.65
+     - **2.4x**
+     - **2.1x**
    * - 1k
-     - 108.1
-     - 0.068
-     - 0.101
-     - **1070x**
-     - 0.7x
+     - 111.7
+     - 24.0
+     - 76.1
+     - **4.7x**
+     - 0.3x
    * - 5k
-     - 504.5
-     - 2.97
-     - 0.69
-     - **734x**
-     - **4.3x**
-   * - 10k
-     - 1107.2
-     - 19.54
-     - 2.58
-     - **429x**
-     - **7.6x**
-   * - 20k
-     - 2451.0
-     - 137.3
-     - 10.9
-     - **225x**
-     - **12.6x**
+     - 524.2
+     - 31.7
+     - 307.5
+     - **16.5x**
+     - 0.1x
 
 .. note::
 
-   **SyntheticDiD Performance**: diff-diff achieves **225x to 2147x speedup** over
-   R's synthdid package. At 10k scale, R takes ~18 minutes while Python Rust
-   completes in 2.6 seconds. At 20k scale, R takes ~41 minutes while Python Rust
-   completes in 11 seconds. The Rust backend provides **4-13x additional speedup**
-   over pure Python for SyntheticDiD due to optimized simplex projection and
-   synthetic weight computation. ATT estimates differ slightly due to different
-   weight optimization algorithms (projected gradient descent vs Frank-Wolfe),
-   but confidence intervals overlap.
+   **SyntheticDiD Performance**: diff-diff's pure Python backend achieves
+   **2.4x to 16.5x speedup** over R's synthdid package using the same
+   Frank-Wolfe optimization algorithm. At 5k scale, R takes ~9 minutes while
+   pure Python completes in 32 seconds. ATT estimates are numerically identical
+   (< 1e-10 difference) since both implementations use the same Frank-Wolfe
+   optimizer with two-pass sparsification. The Rust backend provides a speedup
+   at small scale (2.1x over pure Python) but is slower at larger scales due to
+   overhead in the placebo variance estimation loop; this is a known area for
+   future optimization.
 
 Dataset Sizes
 ~~~~~~~~~~~~~
@@ -457,29 +450,30 @@ Key Observations
 
    - **BasicDiD/TWFE**: 2-17x faster than R at all scales
    - **CallawaySantAnna**: 4-11x faster than R at all scales (vectorized WIF computation)
-   - **SyntheticDiD**: 225-2147x faster than R (R takes 41 minutes at 20k scale!)
+   - **SyntheticDiD**: 2.4-16.5x faster than R (pure Python), with both
+     implementations using the same Frank-Wolfe algorithm
 
 2. **Rust backend benefit depends on the estimator**:
 
-   - **SyntheticDiD**: Rust provides **4-13x speedup** over pure Python due to
-     optimized simplex projection and synthetic weight computation
+   - **SyntheticDiD**: Rust provides speedup at small scale (2.1x) but is
+     slower at larger scales due to placebo variance loop overhead
    - **BasicDiD/CallawaySantAnna**: Rust provides minimal benefit (~1x) since
      these estimators use OLS/variance computations already optimized in NumPy/SciPy
 
 3. **When to use Rust backend**:
 
-   - **SyntheticDiD**: Recommended - provides significant speedup (4-13x)
+   - **SyntheticDiD at small scale**: Rust is ~2x faster than pure Python
    - **Bootstrap inference**: May help with parallelized iterations
    - **BasicDiD/CallawaySantAnna**: Optional - pure Python is equally fast
 
 4. **Scaling behavior**: Python implementations show excellent scaling behavior
-   across all estimators. SyntheticDiD is 225x faster than R at 20k scale.
-   CallawaySantAnna achieves **exact SE accuracy** (0.0% difference) while
-   being 4-11x faster than R through vectorized NumPy operations.
+   across all estimators. SyntheticDiD pure Python is 16.5x faster than R at
+   5k scale. CallawaySantAnna achieves **exact SE accuracy** (0.0% difference)
+   while being 4-11x faster than R through vectorized NumPy operations.
 
 5. **No Rust required for most use cases**: Users without Rust/maturin can
    install diff-diff and get full functionality with excellent performance.
-   Only SyntheticDiD benefits significantly from the Rust backend.
+   Pure Python is the fastest option for SyntheticDiD at 1k+ scales.
 
 6. **CallawaySantAnna accuracy and speed**: As of v2.0.3, CallawaySantAnna
    achieves both exact numerical accuracy (0.0% SE difference from R) AND
@@ -650,9 +644,11 @@ When to Trust Results
   ``treated * time_f | unit`` interaction syntax (unit FE absorbed). Both average
   ATT and all period-level effects match to machine precision. Use with confidence.
 
-- **SyntheticDiD**: Both point estimates (0.3% diff) and standard errors (3.1% diff)
-  match R closely. Use ``variance_method="placebo"`` (default) to match R's
-  inference. Results are fully validated.
+- **SyntheticDiD**: Point estimates are numerically identical (< 1e-10 diff) and
+  standard errors match closely (0.3% diff at small scale). Both implementations
+  use Frank-Wolfe optimization with identical weights. Use
+  ``variance_method="placebo"`` (default) to match R's inference. Results are
+  fully validated.
 
 - **CallawaySantAnna**: Both group-time effects (ATT(g,t)) and overall ATT
   aggregation match R exactly. Standard errors are numerically equivalent
@@ -668,9 +664,10 @@ Known Differences
 2. **Aggregation Weights**: Overall ATT is a weighted average of ATT(g,t).
    Weighting schemes may differ between implementations.
 
-3. **Weight Optimization**: SyntheticDiD uses different optimization algorithms
-   for unit/time weights (projected gradient descent vs Frank-Wolfe). This leads
-   to slightly different weights but equivalent ATT estimates.
+3. **Placebo Variance**: SyntheticDiD SE estimates differ slightly (0.3-7%)
+   across implementations due to Monte Carlo variance in the placebo procedure.
+   Point estimates and unit/time weights are numerically identical since both
+   implementations use the same Frank-Wolfe optimizer.
 
 References
 ----------