Merge pull request #151 from PolicyEngine/maria/documentation_update

juaristi22 · web-flow · commit e6e5becc8995 · 2025-12-05T10:20:11.000+01:00
Some documentation nits
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -5,33 +5,7 @@ on:
     branches: [ main ]
 
 jobs:
-  Check-MDN-Changes:
-    runs-on: ubuntu-latest
-    outputs:
-      mdn_changed: ${{ steps.check.outputs.mdn_changed }}
-    steps:
-      - uses: actions/checkout@v3
-        with:
-          fetch-depth: 2
-      - name: Check for MDN-related file changes
-        id: check
-        run: |
-          # Get list of changed files in this push
-          CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD)
-          echo "Changed files:"
-          echo "$CHANGED_FILES"
-
-          # Check if any MDN-related files were changed
-          if echo "$CHANGED_FILES" | grep -qE "(mdn|MDN)"; then
-            echo "mdn_changed=true" >> $GITHUB_OUTPUT
-            echo "MDN-related files were changed"
-          else
-            echo "mdn_changed=false" >> $GITHUB_OUTPUT
-            echo "No MDN-related files were changed"
-          fi
-
   Test:
-    needs: Check-MDN-Changes
     runs-on: ubuntu-latest
     strategy:
       matrix:
@@ -56,12 +30,8 @@ jobs:
         run: |
           sudo Rscript -e 'install.packages("StatMatch", repos="https://cloud.r-project.org")'
           sudo Rscript -e 'install.packages("clue", repos="https://cloud.r-project.org")'
-      - name: Install full dependencies without MDN (Python 3.13)
-        if: matrix.python-version == '3.13' && needs.Check-MDN-Changes.outputs.mdn_changed != 'true'
-        run: |
-          uv pip install -e ".[dev,docs,matching,images]" --system
-      - name: Install full dependencies with MDN (Python 3.13)
-        if: matrix.python-version == '3.13' && needs.Check-MDN-Changes.outputs.mdn_changed == 'true'
+      - name: Install full dependencies (Python 3.13)
+        if: matrix.python-version == '3.13'
         run: |
           uv pip install -e ".[dev,docs,matching,mdn,images]" --system
       - name: Install minimal dependencies (Python 3.12)
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@ Microimpute enables variable imputation through a variety of statistical methods
 - **Ordinary Least Squares (OLS)**: Linear regression-based imputation
 - **Quantile Regression**: Distribution-aware regression imputation
 - **Quantile Random Forests (QRF)**: Non-parametric forest-based approach
+- **Mixture Density Networks (MDN)**: Neural network with Gaussian mixture approximation head
 
 ### Automated method selection
 - **AutoImpute**: Automatically compares and selects the best imputation method for your data
diff --git a/changelog_entry.yaml b/changelog_entry.yaml
@@ -0,0 +1,4 @@
+- bump: minor
+  changes:
+    added:
+    - Updates to documentation and Myst deployment.
diff --git a/docs/imputation-benchmarking/metrics.md b/docs/imputation-benchmarking/metrics.md
@@ -116,6 +116,8 @@ $$W_p(P, Q) = \left(\inf_{\gamma \in \Pi(P, Q)} \int_{X \times Y} d(x, y)^p d\ga
 
 where $\Pi(P, Q)$ denotes the set of all joint distributions whose marginals are $P$ and $Q$ respectively. The Wasserstein distance measures the minimum "work" required to transform one distribution into another, where work is the amount of distribution mass moved times the distance moved. Lower values indicate better preservation of the original distribution's shape.
 
+When sample weights are provided, the weighted Wasserstein distance accounts for varying observation importance, which is essential when comparing survey data with different sampling designs. We use scipy's `wasserstein_distance` implementation, which supports sample weights via the `u_weights` and `v_weights` parameters.
+
 ### Kullback-Leibler divergence
 
 For discrete distributions (categorical and boolean variables), KL divergence quantifies how one probability distribution diverges from a reference:
@@ -124,24 +126,56 @@ $$D_{KL}(P||Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right
 
 where $P$ is the reference distribution (original data), $Q$ is the approximation (imputed data), and $\mathcal{X}$ is the set of all possible categorical values. KL divergence measures how much information is lost when using the imputed distribution to approximate the true distribution. Lower values indicate better preservation of the original categorical distribution.
 
+When sample weights are provided, the probability distributions are computed as weighted proportions rather than simple counts, ensuring proper comparison of weighted survey data.
+
+### kl_divergence
+
+Computes the Kullback-Leibler divergence between two categorical distributions, with optional sample weights.
+
+```python
+def kl_divergence(
+    donor_values: np.ndarray,
+    receiver_values: np.ndarray,
+    donor_weights: Optional[np.ndarray] = None,
+    receiver_weights: Optional[np.ndarray] = None,
+) -> float
+```
+
+| Parameter | Type | Default used | Description |
+|-----------|------|---------|-------------|
+| donor_values | np.ndarray | - | Categorical values from donor data (reference distribution) |
+| receiver_values | np.ndarray | - | Categorical values from receiver data (approximation) |
+| donor_weights | np.ndarray | None | Optional sample weights for donor values |
+| receiver_weights | np.ndarray | None | Optional sample weights for receiver values |
+
+Returns KL divergence value (float >= 0), where 0 indicates identical distributions.
+
 ### compare_distributions
 
+Compares distributions between donor and receiver data, automatically selecting the appropriate metric based on variable type and supporting sample weights for survey data.
+
 ```python
 def compare_distributions(
     donor_data: pd.DataFrame,
     receiver_data: pd.DataFrame,
     imputed_variables: List[str],
+    donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
+    receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
 ) -> pd.DataFrame
 ```
 
-| Parameter | Type | Description |
-|-----------|------|-------------|
-| donor_data | pd.DataFrame | Original donor data |
-| receiver_data | pd.DataFrame | Receiver data with imputations |
-| imputed_variables | List[str] | Variables to compare |
+| Parameter | Type | Default used | Description |
+|-----------|------|---------|-------------|
+| donor_data | pd.DataFrame | - | Original donor data |
+| receiver_data | pd.DataFrame | - | Receiver data with imputations |
+| imputed_variables | List[str] | - | Variables to compare |
+| donor_weights | pd.Series or np.ndarray | None | Sample weights for donor data (must match donor_data length) |
+| receiver_weights | pd.Series or np.ndarray | None | Sample weights for receiver data (must match receiver_data length) |
 
 Returns a DataFrame with columns `Variable`, `Metric`, and `Distance`. The function automatically selects Wasserstein distance for numerical variables and KL divergence for categorical variables.
 
+Note that data must not contain null or infinite values. If your data contains such values, filter them before calling this function.
+
 ## Predictor analysis
 
 Understanding which predictors contribute most to imputation quality helps with feature selection and model interpretation. These tools analyze predictor-target relationships and evaluate sensitivity to predictor selection.
@@ -251,11 +285,13 @@ metrics_df = compare_metrics(
     imputed_variables=imputed_variables
 )
 
-# Evaluate distributional match
-dist_df = compare_distributions(
+# Evaluate distributional match with survey weights
+dist_df_weighted = compare_distributions(
     donor_data=donor,
     receiver_data=receiver_with_imputations,
-    imputed_variables=imputed_variables
+    imputed_variables=imputed_variables,
+    donor_weights=donor["sample_weight"],
+    receiver_weights=receiver["sample_weight"],
 )
 
 # Analyze predictor importance
diff --git a/docs/myst.yml b/docs/myst.yml
@@ -28,22 +28,28 @@ project:
         - file: models/quantreg/index
           children:
             - file: models/quantreg/quantreg-imputation
+        - file: models/mdn/index
+          children:
+            - file: models/mdn/mdn-imputation
     - title: Imputation and benchmarking
       children:
         - file: imputation-benchmarking/index
           children:
+            - file: imputation-benchmarking/preprocessing
+            - file: imputation-benchmarking/cross-validation
+            - file: imputation-benchmarking/metrics
+            - file: imputation-benchmarking/visualizations
             - file: imputation-benchmarking/benchmarking-methods
-            - file: imputation-benchmarking/imputing-across-surveys
     - title: AutoImpute
       children:
         - file: autoimpute/index
           children:
             - file: autoimpute/autoimpute
-    - title: SCF to CPS example
+    - title: Use cases
       children:
-        - file: examples/scf_to_cps/index
+        - file: use_cases/index
           children:
-            - file: examples/scf_to_cps/imputing-from-scf-to-cps
+            - file: use_cases/scf_to_cps/imputing-from-scf-to-cps
 site:
   options:
     logo: logo.png
diff --git a/microimpute/comparisons/autoimpute_helpers.py b/microimpute/comparisons/autoimpute_helpers.py
@@ -163,15 +163,20 @@ def prepare_data_for_imputation(
         predictor_log = [c for c in log_cols if c in predictors]
         predictor_asinh = [c for c in asinh_cols if c in predictors]
 
-        transformed_imputing, _ = preprocess_data(
-            imputing_data[predictors],
-            full_data=True,
-            train_size=train_size,
-            test_size=test_size,
-            normalize=predictor_normalize if predictor_normalize else False,
-            log_transform=predictor_log if predictor_log else False,
-            asinh_transform=predictor_asinh if predictor_asinh else False,
-        )
+        if predictor_normalize or predictor_log or predictor_asinh:
+            transformed_imputing, _ = preprocess_data(
+                imputing_data[predictors],
+                full_data=True,
+                train_size=train_size,
+                test_size=test_size,
+                normalize=(
+                    predictor_normalize if predictor_normalize else False
+                ),
+                log_transform=predictor_log if predictor_log else False,
+                asinh_transform=predictor_asinh if predictor_asinh else False,
+            )
+        else:
+            transformed_imputing = imputing_data[predictors].copy()
 
         training_data = transformed_training
         if weight_col:
diff --git a/microimpute/comparisons/metrics.py b/microimpute/comparisons/metrics.py
@@ -8,7 +8,7 @@
 """
 
 import logging
-from typing import Dict, List, Literal, Optional, Tuple
+from typing import Dict, List, Literal, Optional, Tuple, Union
 
 import numpy as np
 import pandas as pd
@@ -497,7 +497,10 @@ def compare_metrics(
 
 
 def kl_divergence(
-    donor_values: np.ndarray, receiver_values: np.ndarray
+    donor_values: np.ndarray,
+    receiver_values: np.ndarray,
+    donor_weights: Optional[np.ndarray] = None,
+    receiver_weights: Optional[np.ndarray] = None,
 ) -> float:
     """Calculate Kullback-Leibler (KL) Divergence between two categorical distributions.
 
@@ -512,6 +515,10 @@ def kl_divergence(
     Args:
         donor_values: Array of categorical values from donor data (reference distribution P).
         receiver_values: Array of categorical values from receiver data (approximation Q).
+        donor_weights: Optional weights for donor values. If provided, computes
+            weighted probability distribution.
+        receiver_weights: Optional weights for receiver values. If provided,
+            computes weighted probability distribution.
 
     Returns:
         KL divergence value >= 0, where 0 indicates identical distributions
@@ -536,9 +543,30 @@ def kl_divergence(
         np.unique(donor_values), np.unique(receiver_values)
     )
 
-    # Calculate probability distributions
-    donor_counts = pd.Series(donor_values).value_counts(normalize=True)
-    receiver_counts = pd.Series(receiver_values).value_counts(normalize=True)
+    # Calculate probability distributions (weighted if weights provided)
+    if donor_weights is not None:
+        # Compute weighted probabilities
+        donor_df = pd.DataFrame(
+            {"value": donor_values, "weight": donor_weights}
+        )
+        donor_grouped = donor_df.groupby("value")["weight"].sum()
+        donor_total = donor_grouped.sum()
+        donor_counts = donor_grouped / donor_total
+    else:
+        donor_counts = pd.Series(donor_values).value_counts(normalize=True)
+
+    if receiver_weights is not None:
+        # Compute weighted probabilities
+        receiver_df = pd.DataFrame(
+            {"value": receiver_values, "weight": receiver_weights}
+        )
+        receiver_grouped = receiver_df.groupby("value")["weight"].sum()
+        receiver_total = receiver_grouped.sum()
+        receiver_counts = receiver_grouped / receiver_total
+    else:
+        receiver_counts = pd.Series(receiver_values).value_counts(
+            normalize=True
+        )
 
     # Create probability arrays for all categories
     p_donor = np.array([donor_counts.get(cat, 0.0) for cat in all_categories])
@@ -563,6 +591,8 @@ def compare_distributions(
     donor_data: pd.DataFrame,
     receiver_data: pd.DataFrame,
     imputed_variables: List[str],
+    donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
+    receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
 ) -> pd.DataFrame:
     """Compare distributions between donor and receiver data for imputed variables.
 
@@ -574,6 +604,10 @@ def compare_distributions(
         donor_data: DataFrame containing original donor data.
         receiver_data: DataFrame containing receiver data with imputations.
         imputed_variables: List of variable names to compare.
+        donor_weights: Optional array or Series of sample weights for donor data.
+            Must have same length as donor_data.
+        receiver_weights: Optional array or Series of sample weights for receiver
+            data. Must have same length as receiver_data.
 
     Returns:
         DataFrame with columns 'Variable', 'Metric', and 'Distance' containing
@@ -608,14 +642,45 @@ def compare_distributions(
             receiver_data, imputed_variables, "receiver_data"
         )
 
+        # Convert weights to numpy arrays if provided
+        donor_weights_arr = None
+        receiver_weights_arr = None
+        if donor_weights is not None:
+            donor_weights_arr = np.asarray(donor_weights)
+            if len(donor_weights_arr) != len(donor_data):
+                raise ValueError(
+                    f"donor_weights length ({len(donor_weights_arr)}) must match "
+                    f"donor_data length ({len(donor_data)})"
+                )
+        if receiver_weights is not None:
+            receiver_weights_arr = np.asarray(receiver_weights)
+            if len(receiver_weights_arr) != len(receiver_data):
+                raise ValueError(
+                    f"receiver_weights length ({len(receiver_weights_arr)}) must "
+                    f"match receiver_data length ({len(receiver_data)})"
+                )
+
         results = []
 
         # Detect metric type and compute distance for each variable
         detector = VariableTypeDetector()
         for var in imputed_variables:
-            # Get values from both datasets
-            donor_values = donor_data[var].dropna().values
-            receiver_values = receiver_data[var].dropna().values
+            donor_values = donor_data[var].values
+            receiver_values = receiver_data[var].values
+
+            # Check for null values - these are not allowed when comparing
+            if np.any(pd.isna(donor_values)):
+                raise ValueError(
+                    f"Variable '{var}' in donor_data contains null values. "
+                    "Please remove or impute null values before comparing "
+                    "distributions."
+                )
+            if np.any(pd.isna(receiver_values)):
+                raise ValueError(
+                    f"Variable '{var}' in receiver_data contains null values. "
+                    "Please remove or impute null values before comparing "
+                    "distributions."
+                )
 
             if len(donor_values) == 0 or len(receiver_values) == 0:
                 log.warning(
@@ -633,14 +698,24 @@ def compare_distributions(
             if var_type in ["bool", "categorical", "numeric_categorical"]:
                 # Use KL Divergence for categorical
                 metric_name = "kl_divergence"
-                distance = kl_divergence(donor_values, receiver_values)
+                distance = kl_divergence(
+                    donor_values,
+                    receiver_values,
+                    donor_weights=donor_weights_arr,
+                    receiver_weights=receiver_weights_arr,
+                )
                 log.debug(
                     f"KL divergence for categorical variable '{var}': {distance:.6f}"
                 )
             else:
                 # Use Wasserstein Distance for numerical
                 metric_name = "wasserstein_distance"
-                distance = wasserstein_distance(donor_values, receiver_values)
+                distance = wasserstein_distance(
+                    donor_values,
+                    receiver_values,
+                    u_weights=donor_weights_arr,
+                    v_weights=receiver_weights_arr,
+                )
                 log.debug(
                     f"Wasserstein distance for numerical variable '{var}': {distance:.6f}"
                 )
diff --git a/microimpute/models/mdn.py b/microimpute/models/mdn.py
diff --git a/myst.yml b/myst.yml
diff --git a/tests/test_metrics.py b/tests/test_metrics.py