igerber · igerber · Feb 16, 2026 · Feb 16, 2026 · Feb 16, 2026 · Feb 16, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -360,6 +360,8 @@ See `docs/performance-plan.md` for full optimization details and `docs/benchmark
   - `08_triple_diff.ipynb` - Triple Difference (DDD) estimation with proper covariate handling
   - `09_real_world_examples.ipynb` - Real-world data examples (Card-Krueger, Castle Doctrine, Divorce Laws)
   - `10_trop.ipynb` - Triply Robust Panel (TROP) estimation with factor model adjustment
+  - `11_imputation_did.ipynb` - Imputation DiD (Borusyak et al. 2024), pre-trend test, efficiency comparison
+  - `12_two_stage_did.ipynb` - Two-Stage DiD (Gardner 2022), GMM sandwich variance, per-observation effects
 
 ### Benchmarks
 

diff --git a/docs/tutorials/12_two_stage_did.ipynb b/docs/tutorials/12_two_stage_did.ipynb
@@ -0,0 +1,250 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Two-Stage DiD (Gardner 2022)\n",
+    "\n",
+    "This tutorial demonstrates the `TwoStageDiD` estimator, which implements the two-stage difference-in-differences method from Gardner (2022), \"Two-stage differences in differences\", with inference from Butts & Gardner (2022), \"did2s: Two-Stage Difference-in-Differences\".\n",
+    "\n",
+    "**When to use TwoStageDiD:**\n",
+    "- Staggered adoption settings where you want **GMM sandwich variance** that accounts for first-stage estimation uncertainty\n",
+    "- When you want **per-observation treatment effects** (`treatment_effects` DataFrame) for granular analysis\n",
+    "- As a **robustness check** alongside ImputationDiD: identical point estimates with different inference confirm results are not an artifact of variance estimator choice"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "\n",
+    "from diff_diff import (\n",
+    "    TwoStageDiD, ImputationDiD, CallawaySantAnna,\n",
+    "    generate_staggered_data, plot_event_study\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Basic Usage\n",
+    "\n",
+    "The two-stage estimator follows a simple algorithm:\n",
+    "1. Estimate unit and time fixed effects using only **untreated observations** (never-treated + not-yet-treated periods)\n",
+    "2. Residualize **all** outcomes using those estimated FEs\n",
+    "3. Regress residualized outcomes on treatment indicators to obtain the ATT\n",
+    "\n",
+    "This avoids TWFE bias because the fixed effect model is estimated only on clean (untreated) data, preventing treated outcomes from contaminating the counterfactual."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate staggered adoption data with known treatment effect\n",
+    "data = generate_staggered_data(n_units=300, n_periods=10, treatment_effect=2.0, seed=42)\n",
+    "\n",
+    "# Fit the two-stage estimator\n",
+    "est = TwoStageDiD()\n",
+    "results = est.fit(data, outcome='outcome', unit='unit', time='period', first_treat='first_treat')\n",
+    "results.print_summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Event Study\n",
+    "\n",
+    "Event study aggregation estimates treatment effects at each relative time horizon, enabling visualization of dynamic effects and informal pre-trend assessment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fit with event study aggregation\n",
+    "est = TwoStageDiD()\n",
+    "results_es = est.fit(data, outcome='outcome', unit='unit', time='period',\n",
+    "                     first_treat='first_treat', aggregate='event_study')\n",
+    "\n",
+    "# Plot event study\n",
+    "plot_event_study(results_es, title='Two-Stage DiD Event Study')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# View event study effects as a table\n",
+    "results_es.to_dataframe(level='event_study')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Per-Observation Treatment Effects\n\nBoth `TwoStageDiD` and `ImputationDiD` provide a `treatment_effects` DataFrame containing one row per treated observation with:\n- `tau_hat`: the residualized outcome (actual outcome minus estimated counterfactual)\n- The unit and time columns (using the original column names from the input data, e.g., `unit` and `period`)\n- `rel_time`: relative time since treatment\n- `weight`: aggregation weight — `1/n_valid` for observations with finite `tau_hat`, `0` for NaN rows (e.g., rank-deficient cases)\n\nThis enables granular analysis: examining which units or periods drive the aggregate effect, detecting outliers, or constructing custom aggregation schemes."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Per-observation treatment effects (available from the basic fit)\n",
+    "te = results.treatment_effects\n",
+    "print(f\"Shape: {te.shape}\")\n",
+    "print(f\"Columns: {list(te.columns)}\")\n",
+    "print()\n",
+    "te.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Comparison with Other Estimators\n\nTwoStageDiD and ImputationDiD produce **identical point estimates** because both estimate fixed effects on untreated observations and use them to residualize outcomes. The key difference is the variance estimator: TwoStageDiD uses the GMM sandwich from Butts & Gardner (2022), while ImputationDiD uses the conservative variance from Borusyak et al. (2024, Theorem 3).\n\nCallawaySantAnna uses a fundamentally different estimation approach — computing group-time ATT(g,t) effects via outcome regression, IPW, or doubly robust methods, then aggregating — so point estimates may differ, especially under heterogeneous effects. It uses analytical influence-function standard errors by default, with optional multiplier bootstrap when `n_bootstrap > 0`.\n\n*Note: Tutorial 11 compared ImputationDiD against CallawaySantAnna and SunAbraham. Here we focus on the TwoStageDiD vs ImputationDiD point-estimate identity, with CallawaySantAnna as a widely used reference point. For SunAbraham comparisons, see Tutorial 11.*"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fit all three estimators on the same data\n",
+    "ts = TwoStageDiD().fit(data, outcome='outcome', unit='unit',\n",
+    "                       time='period', first_treat='first_treat')\n",
+    "imp = ImputationDiD().fit(data, outcome='outcome', unit='unit',\n",
+    "                          time='period', first_treat='first_treat')\n",
+    "cs = CallawaySantAnna().fit(data, outcome='outcome', unit='unit',\n",
+    "                            time='period', first_treat='first_treat')\n",
+    "\n",
+    "print(\"Estimator Comparison (True effect = 2.0)\")\n",
+    "print(\"=\" * 55)\n",
+    "print(f\"{'Estimator':<25} {'ATT':>8} {'SE':>8} {'CI Width':>10}\")\n",
+    "print(\"-\" * 55)\n",
+    "\n",
+    "for name, r in [(\"TwoStageDiD\", ts), (\"ImputationDiD\", imp), (\"CallawaySantAnna\", cs)]:\n",
+    "    ci_width = r.overall_conf_int[1] - r.overall_conf_int[0]\n",
+    "    print(f\"{name:<25} {r.overall_att:>8.3f} {r.overall_se:>8.3f} {ci_width:>10.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Group Aggregation\n",
+    "\n",
+    "Group aggregation estimates average treatment effects by treatment cohort (groups defined by first treatment period)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fit with group aggregation\n",
+    "results_grp = TwoStageDiD().fit(data, outcome='outcome', unit='unit',\n",
+    "                                 time='period', first_treat='first_treat',\n",
+    "                                 aggregate='group')\n",
+    "results_grp.to_dataframe(level='group')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Advanced Features\n",
+    "\n",
+    "### Anticipation\n",
+    "\n",
+    "If treatment effects begin before the official treatment date (e.g., firms change behavior in anticipation of a policy), use the `anticipation` parameter to shift the treatment onset back."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Compare ATT with and without anticipation\n",
+    "est_antic = TwoStageDiD(anticipation=1)\n",
+    "results_antic = est_antic.fit(data, outcome='outcome', unit='unit',\n",
+    "                               time='period', first_treat='first_treat')\n",
+    "print(f\"ATT (no anticipation):       {results.overall_att:.3f}\")\n",
+    "print(f\"ATT (1-period anticipation): {results_antic.overall_att:.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### GMM Sandwich vs Conservative Variance\n",
+    "\n",
+    "The key methodological distinction between TwoStageDiD and ImputationDiD is the variance estimator:\n",
+    "\n",
+    "- **ImputationDiD's conservative variance** (Theorem 3) is valid under heterogeneous treatment effects but may produce wider confidence intervals than necessary\n",
+    "- **TwoStageDiD's GMM sandwich** accounts for first-stage estimation uncertainty via an influence function correction term\n",
+    "- In practice they usually agree closely; large divergence signals potential specification concerns\n",
+    "- Bootstrap inference is also available via `n_bootstrap=199`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Horizon-by-horizon SE comparison\n",
+    "ts_es = TwoStageDiD().fit(data, outcome='outcome', unit='unit',\n",
+    "                           time='period', first_treat='first_treat',\n",
+    "                           aggregate='event_study')\n",
+    "imp_es = ImputationDiD().fit(data, outcome='outcome', unit='unit',\n",
+    "                              time='period', first_treat='first_treat',\n",
+    "                              aggregate='event_study')\n",
+    "\n",
+    "print(\"Horizon-by-Horizon Comparison: GMM Sandwich vs Conservative Variance\")\n",
+    "print(\"=\" * 70)\n",
+    "print(f\"{'Horizon':>8} {'Effect':>10} {'GMM SE':>10} {'Cons. SE':>10} {'Ratio':>8}\")\n",
+    "print(\"-\" * 70)\n",
+    "\n",
+    "for h in sorted(ts_es.event_study_effects.keys()):\n",
+    "    ts_eff = ts_es.event_study_effects[h]\n",
+    "    imp_eff = imp_es.event_study_effects[h]\n",
+    "    if ts_eff.get('n_obs', 0) == 0:\n",
+    "        print(f\"{h:>8} {'[ref]':>10} {'---':>10} {'---':>10} {'---':>8}\")\n",
+    "        continue\n",
+    "    effect = ts_eff['effect']\n",
+    "    gmm_se = ts_eff['se']\n",
+    "    cons_se = imp_eff['se']\n",
+    "    ratio = gmm_se / cons_se if cons_se > 0 else np.nan\n",
+    "    print(f\"{h:>8} {effect:>10.4f} {gmm_se:>10.4f} {cons_se:>10.4f} {ratio:>8.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Summary\n\n| Feature | TwoStageDiD | ImputationDiD | CallawaySantAnna |\n|---------|-------------|---------------|------------------|\n| **Approach** | Residualize via FE, regress on treatment | Impute Y(0) via FE model | Group-time ATT(g,t) |\n| **Point estimates** | Identical to ImputationDiD | Identical to TwoStageDiD | Different weighting |\n| **Variance** | GMM sandwich (influence function) | Conservative (Theorem 3) | Analytical influence function (optional bootstrap) |\n| **Per-obs effects** | Yes (`treatment_effects`) | Yes (`treatment_effects`) | No |\n| **Pre-trend test** | Via event study pre-periods | Yes (built-in F-test) | Via event study pre-periods |\n| **Best for** | Robustness check, granular effects | Maximum efficiency under homogeneity | Heterogeneous effects |\n\n**References:**\n- Gardner, J. (2022). Two-stage differences in differences. *arXiv:2207.05943*.\n- Butts, K. & Gardner, J. (2022). did2s: Two-Stage Difference-in-Differences. *R Journal*, 14(1), 162-173."
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}