You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
where $\Pi(P, Q)$ denotes the set of all joint distributions whose marginals are $P$ and $Q$ respectively. The Wasserstein distance measures the minimum "work" required to transform one distribution into another, where work is the amount of distribution mass moved times the distance moved. Lower values indicate better preservation of the original distribution's shape.
118
118
119
+
When sample weights are provided, the weighted Wasserstein distance accounts for varying observation importance, which is essential when comparing survey data with different sampling designs. We use scipy's `wasserstein_distance` implementation, which supports sample weights via the `u_weights` and `v_weights` parameters.
120
+
119
121
### Kullback-Leibler divergence
120
122
121
123
For discrete distributions (categorical and boolean variables), KL divergence quantifies how one probability distribution diverges from a reference:
where $P$is the reference distribution (original data), $Q$is the approximation (imputed data), and$\mathcal{X}$ is the set of all possible categorical values. KL divergence measures how much information is lost when using the imputed distribution to approximate the true distribution. Lower values indicate better preservation of the original categorical distribution.
126
128
129
+
When sample weights are provided, the probability distributions are computed as weighted proportions rather than simple counts, ensuring proper comparison of weighted survey data.
130
+
131
+
### kl_divergence
132
+
133
+
Computes the Kullback-Leibler divergence between two categorical distributions, with optional sample weights.
134
+
135
+
```python
136
+
def kl_divergence(
137
+
donor_values: np.ndarray,
138
+
receiver_values: np.ndarray,
139
+
donor_weights: Optional[np.ndarray] =None,
140
+
receiver_weights: Optional[np.ndarray] =None,
141
+
) ->float
142
+
```
143
+
144
+
| Parameter | Type | Default used | Description |
145
+
|-----------|------|---------|-------------|
146
+
| donor_values | np.ndarray |-| Categorical values from donor data (reference distribution) |
147
+
| receiver_values | np.ndarray |-| Categorical values from receiver data (approximation) |
Returns KL divergence value (float>=0), where 0 indicates identical distributions.
152
+
127
153
### compare_distributions
128
154
155
+
Compares distributions between donor and receiver data, automatically selecting the appropriate metric based on variable typeand supporting sample weights for survey data.
| donor_data | pd.DataFrame | Original donor data |
140
-
| receiver_data | pd.DataFrame | Receiver data with imputations |
141
-
| imputed_variables | List[str] | Variables to compare |
167
+
| Parameter | Type | Default used | Description |
168
+
|-----------|------|---------|-------------|
169
+
| donor_data | pd.DataFrame |-| Original donor data |
170
+
| receiver_data | pd.DataFrame |-| Receiver data with imputations |
171
+
| imputed_variables | List[str] |-| Variables to compare |
172
+
| donor_weights | pd.Series or np.ndarray |None| Sample weights for donor data (must match donor_data length) |
173
+
| receiver_weights | pd.Series or np.ndarray |None| Sample weights for receiver data (must match receiver_data length) |
142
174
143
175
Returns a DataFrame with columns `Variable`, `Metric`, and`Distance`. The function automatically selects Wasserstein distance for numerical variables andKL divergence for categorical variables.
144
176
177
+
Note that data must not contain null or infinite values. If your data contains such values, filter them before calling this function.
178
+
145
179
## Predictor analysis
146
180
147
181
Understanding which predictors contribute most to imputation quality helps with feature selection and model interpretation. These tools analyze predictor-target relationships and evaluate sensitivity to predictor selection.
0 commit comments