Skip to content

Commit e6e5bec

Browse files
authored
Merge pull request #151 from PolicyEngine/maria/documentation_update
Some documentation nits
2 parents a77ba47 + 1e562b3 commit e6e5bec

10 files changed

Lines changed: 552 additions & 90 deletions

File tree

.github/workflows/main.yml

Lines changed: 2 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -5,33 +5,7 @@ on:
55
branches: [ main ]
66

77
jobs:
8-
Check-MDN-Changes:
9-
runs-on: ubuntu-latest
10-
outputs:
11-
mdn_changed: ${{ steps.check.outputs.mdn_changed }}
12-
steps:
13-
- uses: actions/checkout@v3
14-
with:
15-
fetch-depth: 2
16-
- name: Check for MDN-related file changes
17-
id: check
18-
run: |
19-
# Get list of changed files in this push
20-
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD)
21-
echo "Changed files:"
22-
echo "$CHANGED_FILES"
23-
24-
# Check if any MDN-related files were changed
25-
if echo "$CHANGED_FILES" | grep -qE "(mdn|MDN)"; then
26-
echo "mdn_changed=true" >> $GITHUB_OUTPUT
27-
echo "MDN-related files were changed"
28-
else
29-
echo "mdn_changed=false" >> $GITHUB_OUTPUT
30-
echo "No MDN-related files were changed"
31-
fi
32-
338
Test:
34-
needs: Check-MDN-Changes
359
runs-on: ubuntu-latest
3610
strategy:
3711
matrix:
@@ -56,12 +30,8 @@ jobs:
5630
run: |
5731
sudo Rscript -e 'install.packages("StatMatch", repos="https://cloud.r-project.org")'
5832
sudo Rscript -e 'install.packages("clue", repos="https://cloud.r-project.org")'
59-
- name: Install full dependencies without MDN (Python 3.13)
60-
if: matrix.python-version == '3.13' && needs.Check-MDN-Changes.outputs.mdn_changed != 'true'
61-
run: |
62-
uv pip install -e ".[dev,docs,matching,images]" --system
63-
- name: Install full dependencies with MDN (Python 3.13)
64-
if: matrix.python-version == '3.13' && needs.Check-MDN-Changes.outputs.mdn_changed == 'true'
33+
- name: Install full dependencies (Python 3.13)
34+
if: matrix.python-version == '3.13'
6535
run: |
6636
uv pip install -e ".[dev,docs,matching,mdn,images]" --system
6737
- name: Install minimal dependencies (Python 3.12)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Microimpute enables variable imputation through a variety of statistical methods
99
- **Ordinary Least Squares (OLS)**: Linear regression-based imputation
1010
- **Quantile Regression**: Distribution-aware regression imputation
1111
- **Quantile Random Forests (QRF)**: Non-parametric forest-based approach
12+
- **Mixture Density Networks (MDN)**: Neural network with Gaussian mixture approximation head
1213

1314
### Automated method selection
1415
- **AutoImpute**: Automatically compares and selects the best imputation method for your data

changelog_entry.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
- bump: minor
2+
changes:
3+
added:
4+
- Updates to documentation and Myst deployment.

docs/imputation-benchmarking/metrics.md

Lines changed: 44 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,8 @@ $$W_p(P, Q) = \left(\inf_{\gamma \in \Pi(P, Q)} \int_{X \times Y} d(x, y)^p d\ga
116116

117117
where $\Pi(P, Q)$ denotes the set of all joint distributions whose marginals are $P$ and $Q$ respectively. The Wasserstein distance measures the minimum "work" required to transform one distribution into another, where work is the amount of distribution mass moved times the distance moved. Lower values indicate better preservation of the original distribution's shape.
118118

119+
When sample weights are provided, the weighted Wasserstein distance accounts for varying observation importance, which is essential when comparing survey data with different sampling designs. We use scipy's `wasserstein_distance` implementation, which supports sample weights via the `u_weights` and `v_weights` parameters.
120+
119121
### Kullback-Leibler divergence
120122

121123
For discrete distributions (categorical and boolean variables), KL divergence quantifies how one probability distribution diverges from a reference:
@@ -124,24 +126,56 @@ $$D_{KL}(P||Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right
124126

125127
where $P$ is the reference distribution (original data), $Q$ is the approximation (imputed data), and $\mathcal{X}$ is the set of all possible categorical values. KL divergence measures how much information is lost when using the imputed distribution to approximate the true distribution. Lower values indicate better preservation of the original categorical distribution.
126128

129+
When sample weights are provided, the probability distributions are computed as weighted proportions rather than simple counts, ensuring proper comparison of weighted survey data.
130+
131+
### kl_divergence
132+
133+
Computes the Kullback-Leibler divergence between two categorical distributions, with optional sample weights.
134+
135+
```python
136+
def kl_divergence(
137+
donor_values: np.ndarray,
138+
receiver_values: np.ndarray,
139+
donor_weights: Optional[np.ndarray] = None,
140+
receiver_weights: Optional[np.ndarray] = None,
141+
) -> float
142+
```
143+
144+
| Parameter | Type | Default used | Description |
145+
|-----------|------|---------|-------------|
146+
| donor_values | np.ndarray | - | Categorical values from donor data (reference distribution) |
147+
| receiver_values | np.ndarray | - | Categorical values from receiver data (approximation) |
148+
| donor_weights | np.ndarray | None | Optional sample weights for donor values |
149+
| receiver_weights | np.ndarray | None | Optional sample weights for receiver values |
150+
151+
Returns KL divergence value (float >= 0), where 0 indicates identical distributions.
152+
127153
### compare_distributions
128154

155+
Compares distributions between donor and receiver data, automatically selecting the appropriate metric based on variable type and supporting sample weights for survey data.
156+
129157
```python
130158
def compare_distributions(
131159
donor_data: pd.DataFrame,
132160
receiver_data: pd.DataFrame,
133161
imputed_variables: List[str],
162+
donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
163+
receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
134164
) -> pd.DataFrame
135165
```
136166

137-
| Parameter | Type | Description |
138-
|-----------|------|-------------|
139-
| donor_data | pd.DataFrame | Original donor data |
140-
| receiver_data | pd.DataFrame | Receiver data with imputations |
141-
| imputed_variables | List[str] | Variables to compare |
167+
| Parameter | Type | Default used | Description |
168+
|-----------|------|---------|-------------|
169+
| donor_data | pd.DataFrame | - | Original donor data |
170+
| receiver_data | pd.DataFrame | - | Receiver data with imputations |
171+
| imputed_variables | List[str] | - | Variables to compare |
172+
| donor_weights | pd.Series or np.ndarray | None | Sample weights for donor data (must match donor_data length) |
173+
| receiver_weights | pd.Series or np.ndarray | None | Sample weights for receiver data (must match receiver_data length) |
142174

143175
Returns a DataFrame with columns `Variable`, `Metric`, and `Distance`. The function automatically selects Wasserstein distance for numerical variables and KL divergence for categorical variables.
144176

177+
Note that data must not contain null or infinite values. If your data contains such values, filter them before calling this function.
178+
145179
## Predictor analysis
146180

147181
Understanding which predictors contribute most to imputation quality helps with feature selection and model interpretation. These tools analyze predictor-target relationships and evaluate sensitivity to predictor selection.
@@ -251,11 +285,13 @@ metrics_df = compare_metrics(
251285
imputed_variables=imputed_variables
252286
)
253287

254-
# Evaluate distributional match
255-
dist_df = compare_distributions(
288+
# Evaluate distributional match with survey weights
289+
dist_df_weighted = compare_distributions(
256290
donor_data=donor,
257291
receiver_data=receiver_with_imputations,
258-
imputed_variables=imputed_variables
292+
imputed_variables=imputed_variables,
293+
donor_weights=donor["sample_weight"],
294+
receiver_weights=receiver["sample_weight"],
259295
)
260296

261297
# Analyze predictor importance

docs/myst.yml

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,22 +28,28 @@ project:
2828
- file: models/quantreg/index
2929
children:
3030
- file: models/quantreg/quantreg-imputation
31+
- file: models/mdn/index
32+
children:
33+
- file: models/mdn/mdn-imputation
3134
- title: Imputation and benchmarking
3235
children:
3336
- file: imputation-benchmarking/index
3437
children:
38+
- file: imputation-benchmarking/preprocessing
39+
- file: imputation-benchmarking/cross-validation
40+
- file: imputation-benchmarking/metrics
41+
- file: imputation-benchmarking/visualizations
3542
- file: imputation-benchmarking/benchmarking-methods
36-
- file: imputation-benchmarking/imputing-across-surveys
3743
- title: AutoImpute
3844
children:
3945
- file: autoimpute/index
4046
children:
4147
- file: autoimpute/autoimpute
42-
- title: SCF to CPS example
48+
- title: Use cases
4349
children:
44-
- file: examples/scf_to_cps/index
50+
- file: use_cases/index
4551
children:
46-
- file: examples/scf_to_cps/imputing-from-scf-to-cps
52+
- file: use_cases/scf_to_cps/imputing-from-scf-to-cps
4753
site:
4854
options:
4955
logo: logo.png

microimpute/comparisons/autoimpute_helpers.py

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -163,15 +163,20 @@ def prepare_data_for_imputation(
163163
predictor_log = [c for c in log_cols if c in predictors]
164164
predictor_asinh = [c for c in asinh_cols if c in predictors]
165165

166-
transformed_imputing, _ = preprocess_data(
167-
imputing_data[predictors],
168-
full_data=True,
169-
train_size=train_size,
170-
test_size=test_size,
171-
normalize=predictor_normalize if predictor_normalize else False,
172-
log_transform=predictor_log if predictor_log else False,
173-
asinh_transform=predictor_asinh if predictor_asinh else False,
174-
)
166+
if predictor_normalize or predictor_log or predictor_asinh:
167+
transformed_imputing, _ = preprocess_data(
168+
imputing_data[predictors],
169+
full_data=True,
170+
train_size=train_size,
171+
test_size=test_size,
172+
normalize=(
173+
predictor_normalize if predictor_normalize else False
174+
),
175+
log_transform=predictor_log if predictor_log else False,
176+
asinh_transform=predictor_asinh if predictor_asinh else False,
177+
)
178+
else:
179+
transformed_imputing = imputing_data[predictors].copy()
175180

176181
training_data = transformed_training
177182
if weight_col:

microimpute/comparisons/metrics.py

Lines changed: 85 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"""
99

1010
import logging
11-
from typing import Dict, List, Literal, Optional, Tuple
11+
from typing import Dict, List, Literal, Optional, Tuple, Union
1212

1313
import numpy as np
1414
import pandas as pd
@@ -497,7 +497,10 @@ def compare_metrics(
497497

498498

499499
def kl_divergence(
500-
donor_values: np.ndarray, receiver_values: np.ndarray
500+
donor_values: np.ndarray,
501+
receiver_values: np.ndarray,
502+
donor_weights: Optional[np.ndarray] = None,
503+
receiver_weights: Optional[np.ndarray] = None,
501504
) -> float:
502505
"""Calculate Kullback-Leibler (KL) Divergence between two categorical distributions.
503506
@@ -512,6 +515,10 @@ def kl_divergence(
512515
Args:
513516
donor_values: Array of categorical values from donor data (reference distribution P).
514517
receiver_values: Array of categorical values from receiver data (approximation Q).
518+
donor_weights: Optional weights for donor values. If provided, computes
519+
weighted probability distribution.
520+
receiver_weights: Optional weights for receiver values. If provided,
521+
computes weighted probability distribution.
515522
516523
Returns:
517524
KL divergence value >= 0, where 0 indicates identical distributions
@@ -536,9 +543,30 @@ def kl_divergence(
536543
np.unique(donor_values), np.unique(receiver_values)
537544
)
538545

539-
# Calculate probability distributions
540-
donor_counts = pd.Series(donor_values).value_counts(normalize=True)
541-
receiver_counts = pd.Series(receiver_values).value_counts(normalize=True)
546+
# Calculate probability distributions (weighted if weights provided)
547+
if donor_weights is not None:
548+
# Compute weighted probabilities
549+
donor_df = pd.DataFrame(
550+
{"value": donor_values, "weight": donor_weights}
551+
)
552+
donor_grouped = donor_df.groupby("value")["weight"].sum()
553+
donor_total = donor_grouped.sum()
554+
donor_counts = donor_grouped / donor_total
555+
else:
556+
donor_counts = pd.Series(donor_values).value_counts(normalize=True)
557+
558+
if receiver_weights is not None:
559+
# Compute weighted probabilities
560+
receiver_df = pd.DataFrame(
561+
{"value": receiver_values, "weight": receiver_weights}
562+
)
563+
receiver_grouped = receiver_df.groupby("value")["weight"].sum()
564+
receiver_total = receiver_grouped.sum()
565+
receiver_counts = receiver_grouped / receiver_total
566+
else:
567+
receiver_counts = pd.Series(receiver_values).value_counts(
568+
normalize=True
569+
)
542570

543571
# Create probability arrays for all categories
544572
p_donor = np.array([donor_counts.get(cat, 0.0) for cat in all_categories])
@@ -563,6 +591,8 @@ def compare_distributions(
563591
donor_data: pd.DataFrame,
564592
receiver_data: pd.DataFrame,
565593
imputed_variables: List[str],
594+
donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
595+
receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
566596
) -> pd.DataFrame:
567597
"""Compare distributions between donor and receiver data for imputed variables.
568598
@@ -574,6 +604,10 @@ def compare_distributions(
574604
donor_data: DataFrame containing original donor data.
575605
receiver_data: DataFrame containing receiver data with imputations.
576606
imputed_variables: List of variable names to compare.
607+
donor_weights: Optional array or Series of sample weights for donor data.
608+
Must have same length as donor_data.
609+
receiver_weights: Optional array or Series of sample weights for receiver
610+
data. Must have same length as receiver_data.
577611
578612
Returns:
579613
DataFrame with columns 'Variable', 'Metric', and 'Distance' containing
@@ -608,14 +642,45 @@ def compare_distributions(
608642
receiver_data, imputed_variables, "receiver_data"
609643
)
610644

645+
# Convert weights to numpy arrays if provided
646+
donor_weights_arr = None
647+
receiver_weights_arr = None
648+
if donor_weights is not None:
649+
donor_weights_arr = np.asarray(donor_weights)
650+
if len(donor_weights_arr) != len(donor_data):
651+
raise ValueError(
652+
f"donor_weights length ({len(donor_weights_arr)}) must match "
653+
f"donor_data length ({len(donor_data)})"
654+
)
655+
if receiver_weights is not None:
656+
receiver_weights_arr = np.asarray(receiver_weights)
657+
if len(receiver_weights_arr) != len(receiver_data):
658+
raise ValueError(
659+
f"receiver_weights length ({len(receiver_weights_arr)}) must "
660+
f"match receiver_data length ({len(receiver_data)})"
661+
)
662+
611663
results = []
612664

613665
# Detect metric type and compute distance for each variable
614666
detector = VariableTypeDetector()
615667
for var in imputed_variables:
616-
# Get values from both datasets
617-
donor_values = donor_data[var].dropna().values
618-
receiver_values = receiver_data[var].dropna().values
668+
donor_values = donor_data[var].values
669+
receiver_values = receiver_data[var].values
670+
671+
# Check for null values - these are not allowed when comparing
672+
if np.any(pd.isna(donor_values)):
673+
raise ValueError(
674+
f"Variable '{var}' in donor_data contains null values. "
675+
"Please remove or impute null values before comparing "
676+
"distributions."
677+
)
678+
if np.any(pd.isna(receiver_values)):
679+
raise ValueError(
680+
f"Variable '{var}' in receiver_data contains null values. "
681+
"Please remove or impute null values before comparing "
682+
"distributions."
683+
)
619684

620685
if len(donor_values) == 0 or len(receiver_values) == 0:
621686
log.warning(
@@ -633,14 +698,24 @@ def compare_distributions(
633698
if var_type in ["bool", "categorical", "numeric_categorical"]:
634699
# Use KL Divergence for categorical
635700
metric_name = "kl_divergence"
636-
distance = kl_divergence(donor_values, receiver_values)
701+
distance = kl_divergence(
702+
donor_values,
703+
receiver_values,
704+
donor_weights=donor_weights_arr,
705+
receiver_weights=receiver_weights_arr,
706+
)
637707
log.debug(
638708
f"KL divergence for categorical variable '{var}': {distance:.6f}"
639709
)
640710
else:
641711
# Use Wasserstein Distance for numerical
642712
metric_name = "wasserstein_distance"
643-
distance = wasserstein_distance(donor_values, receiver_values)
713+
distance = wasserstein_distance(
714+
donor_values,
715+
receiver_values,
716+
u_weights=donor_weights_arr,
717+
v_weights=receiver_weights_arr,
718+
)
644719
log.debug(
645720
f"Wasserstein distance for numerical variable '{var}': {distance:.6f}"
646721
)

0 commit comments

Comments
 (0)