AppliedStatistics_notes/missing_data.Rmd at master · norcalbiostat/AppliedStatistics_notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
# Missing Data {#mda}

Missing Data happens. Not always

* General: Item non-response. Individual pieces of data are missing.
* Unit non-response: Records have some background data on all units, but some units don’t respond to any question.
* Monotonone missing data: Variables can be ordered such that one block of variables more observed than the next.

> This is a very brief, and very rough overview of identification and treatment of missing data. For more details (enough for an entire class) see Flexible Imputation of Missing Data, 2nd Ed, by Stef van Buuren: https://stefvanbuuren.name/fimd/

```{block2,  type='rmdnote'}
This section uses functions from the following additional packages: `mice`,`MASS`, `VIM`, and `forestplot`.
```

Some examples use a modified version of the Parental HIV data set  [(Codebook)](https://www.norcalbiostat.com/data/ParhivCodebook.txt) that has had some missing data created for demonstration purposes.

```{r}
library(VIM); library(mice)
load("data/mi_example.Rdata") #not available to public
```

## Identifying missing data

* Missing data in `R` is denoted as `NA`
* Arithmetic functions on missing data will return missing
```{r}
survey <- MASS::survey # to avoid loading the MASS library which will conflict with dplyr
head(survey$Pulse)
mean(survey$Pulse)
```

The `summary()` function will always show missing.
```{r}
summary(survey$Pulse)
```

The `is.na()` function is helpful to identify rows with missing data
```{r}
table(is.na(survey$Pulse))
```

The function `table()` will not show NA by default.
```{r}
table(survey$M.I)
table(survey$M.I, useNA="always")
```

* What percent of the data set is missing?
```{r}
round(prop.table(table(is.na(survey)))*100,1)
```

4% of the data points are missing.

* How much missing is there per variable?
```{r}
prop.miss <- apply(survey, 2, function(x) round(sum(is.na(x))/NROW(x),4))
prop.miss
```

The amount of missing data per variable varies from 0 to 19%.


### Visualize missing patterns

Using `ggplot2`
```{r}
pmpv <- data.frame(variable = names(survey), pct.miss =prop.miss)

ggplot(pmpv, aes(x=variable, y=pct.miss)) +
  geom_bar(stat="identity") + ylab("Percent") + scale_y_continuous(labels=scales::percent, limits=c(0,1)) +
  geom_text(data=pmpv, aes(label=paste0(round(pct.miss*100,1),"%"), y=pct.miss+.025), size=4)
```

Using `mice`
```{r}
library(mice)
md.pattern(survey)
```

This somewhat ugly output tells us that 168 records have no missing data, 38 records are missing only `Pulse` and 20 are missing both `Height` and `M.I`.

Using `VIM`
```{r, fig.width=8, fig.height=4, results='hide'}
library(VIM)
aggr(survey, col=c('chartreuse3','mediumvioletred'),
              numbers=TRUE, sortVars=TRUE,
              labels=names(survey), cex.axis=.7,
              gap=3, ylab=c("Missing data","Pattern"))
```

The plot on the left is a simplified, and ordered version of the ggplot from above, except the bars appear to be inflated because the y-axis goes up to 15% instead of 100%.

The plot on the right shows the missing data patterns, and indicate that 71% of the records has complete cases, and that everyone who is missing `M.I.` is also missing Height.

Another plot that can be helpful to identify patterns of missing data is a `marginplot` (also from `VIM`).

* Two continuous variables are plotted against each other.
* Blue bivariate scatterplot and univariate boxplots are for the observations where values on both variables are observed.
* Red univariate dotplots and boxplots are drawn for the data that is only observed on one of the two variables in question.
* The darkred text indicates how many records are missing on both.

```{r}
marginplot(survey[,c(6,10)])
```

This shows us that the observations missing pulse have the same median height, but those missing height have a higher median pulse rate.

### Example: Parental HIV

#### Identify missing

Entire data set
```{r}
table(is.na(hiv)) |> prop.table()
```

Only 3.7% of all values in the data set are missing.


#### Examine missing data patterns of scale variables.

The parental bonding and BSI scale variables are aggregated variables, meaning they are sums or means of a handful of component variables. That means if any one component variable is missing, the entire scale is missing. _E.g. if y = x1+x2+x3, then y is missing if any of x1, x2 or x3 are missing. _

```{r}
scale.vars <- hiv %>% select(parent_care:bsi_psycho, gender, siblings, age)
aggr(scale.vars, sortVars=TRUE, combined=TRUE, numbers=TRUE, cex.axis=.7)
```

34.7% of records are missing both `bsi_overall` and `bsi_depress` This makes sense since `bsi_depress` is a subscale containing 9 component variables and the `bsi_overall` is an average of all 52.

Another 15.5% of records are missing `parental_overprotection`.


Is there a bivariate pattern between missing and observed values of `bsi_depress` and `parent_overprotection`?
```{r}
marginplot(hiv[,c('bsi_depress', 'parent_overprotection')])
```

When someone is missing `parent_overprotection`, they have a lower `bsi_depress` score. Those missing `bsi_depress` have a slightly lower `parental_overprotection` score. Only 4 individuals are missing both values.


## Effects of Nonresponse

Textbook example: Example reported in W.G. Cochran, Sampling Techniques, 3rd edition, 1977, ch. 13

> Consider data that come form an experimental sampling of fruit orcharts in North Carolina in 1946.
> Three successive mailings of the same questionnaire were sent to growers. For one of the questions
> the number of fruit trees, complete data were available for the population...

<br>

|      Ave. # trees         | # of growers |  % of pop’n  | Ave # trees/grower  |
|---------------------------|--------------|--------------|---------------------|
| 1st mailing responders	  |   300	       |     10		    |   456               |
| 2nd mailing responders    |   543	       |     17		    |   382               |
| 3rd mailing responders	  |   434	       |     14		    |   340               |
| Nonresponders 	          |  1839	       |     59		    |   290               |
|                           |  --------    |  --------    |    --------         |
|    Total population       |   3116       |    100       |    329              |


* The overall response rate was very low.
* The rate of non response is clearly related to the average number of trees per grower.
* The estimate of the average trees per grower can be calculated as a weighted average from responders $\bar{Y_{1}}$ and non responders $\bar{Y_{2}}$.

**Bias**: The difference between the observed estimate $\bar{y}_{1}$ and the true parameter $\mu$.

$$
  \begin{aligned}
  E(\bar{y}_{1}) - \mu & = \bar{Y_{1}} - \bar{Y} \\
  & = \bar{Y}_{1} - \left[(1-w)\bar{Y}_{1} - w\bar{Y}_{2}\right] \\
  & = w(\bar{Y}_{1} - \bar{Y}_{2})
  \end{aligned}
$$

Where $w$ is the proportion of non-response.

* The amount of bias is the product of the proportion of non-response and the difference in the means between the responders and the non-responders.
* The sample provides no information about $\bar{Y_{2}}$, the size of the bias is generally unknown without information gained from external data.


## Missing Data Mechanisms

Process by which some units observed, some units not observed

* **Missing Completely at Random (MCAR)**: The probability that a data point is missing is completely unrelated (independent) of any observed and unobserved data or parameters.
    - P(Y missing| X, Y) = P(Y missing)
    - Ex: Miscoding or forgetting to log in answer
* **Missing at Random (MAR)**: The probability that a data point is missing is independent can be explained or modeled by other observed variables.
    - P(Y missing|x, Y) = P(Y missing | X)
    - Ex: Y = age, X = sex
		    - Pr (Y miss| X = male) = 0.2
		    - Pr (Y miss| X = female) = 0.3
		    - Males people are less likely to fill out an income survey
		    - The missing data on income is related to gender
		    - After accounting for gender the missing data is unrelated to income.
* **Not missing at Random (NMAR)**: The probability that a data point is missing depends on the value of the variable in question.
    - P(Y missing | X, Y) = P (Y missing|X, Y)
    - Ex: Y = income, X = immigration status
        - Richer person may be less willing to disclose income
        - Undocumented immigrant may be less willing to disclose income

![](images/q.png) Write down an example of each.

Does it matter to inferences?  <span style ="color:red">**Yes!**</span>

### Demonstration via Simulation
What follows is just _one_ method of approaching this problem via code. Simulation is a frequently used technique to understand the behavior of a process over time or over repeated samples.

#### MCAR
1. Draw a random sample of size 100 from a standard Normal distribution (Z) and calculate the mean.
```{r}
set.seed(456) # setting a seed ensures the same numbers will be drawn each time
z <- rnorm(100)
mean.z <- mean(z)
mean.z
```

2. Delete data at a rate of $p$ and calculate the complete case (available) mean.
    - Sample 100 random Bernoulli (0/1) variables with probability $p$.
    ```{r}
    x <- rbinom(100, 1, p=.5)
    ```
    - Find out which elements are are 1's
    ```{r}
    delete.these <- which(x==1)
    ```
    - Set those elements in `z` to `NA`.
    ```{r}
    z[delete.these] <- NA
    ```
    - Calculate the complete case mean
    ```{r}
    mean(z, na.rm=TRUE)
    ```

3. Calculate the bias as the sample mean minus the true mean ($E(\hat\theta) - \theta$).
```{r}
mean(z, na.rm=TRUE) - mean.z
```

How does the bias change as a function of the proportion of missing? Let $p$ range from 0% to 99% and plot the bias as a function of $p$.

```{r, echo=-1}
par(oma=c(0,0,0,0))
calc.bias <- function(p){ # create a function to handle the repeated calculations
  mean(ifelse(rbinom(100, 1, p)==1, NA, z), na.rm=TRUE) - mean.z
}

p <- seq(0,.99,by=.01)

plot(c(0,1), c(-1, 1), type="n", ylab="Bias", xlab="Proportion of missing")
  points(p, sapply(p, calc.bias), pch=16)
  abline(h=0, lty=2, col="blue")
```


![](images/q.png) What is the behavior of the bias as $p$ increases? Look specifically at the position/location of the bias, and the variance/variability of the bias.

#### NMAR: Missing related to data
What if the rate of missing is related to the value of the outcome? Again, let's setup a simulation to see how this works.

1. Randomly draw 100 random data points from a Standard Normal distribution to serve as our population, and 100 uniform random values between 0 and 1 to serve as probabilities of the data being missing ($p=P(miss)$)
```{r}
Z <- rnorm(100)
p <- runif(100, 0, 1)
```

2. Sort both in ascending order, shove into a data frame and confirm that $p(miss)$ increases along with $z$.
```{r, fig.height=4, fig.width=4}
dta <- data.frame(Z=sort(Z), p=sort(p))
head(dta)
ggplot(dta, aes(x=p, y=Z)) + geom_point() + xlab("P(missing)") + ylab("Z~Normal(0,1)")
```

3. Set $Z$ missing with probability equal to the $p$ for that row. Create a new vector `dta$z.miss` that is either 0, or the value of `dta$Z` with probability `1-dta$p`. Then change all the 0's to `NA`.

```{r}
dta$Z.miss <- dta$Z * (1-rbinom(NROW(dta), 1, dta$p))
head(dta) # see structure of data to understand what is going on
dta$Z.miss[dta$Z.miss==0] <- NA
```

5. Calculate the complete case mean and the bias
```{r}
mean(dta$Z.miss, na.rm=TRUE)
mean(dta$Z) - mean(dta$Z.miss, na.rm=TRUE)
```

[](images/q.png) Did the complete case estimate over- or under-estimate the true mean? Is the bias positive or negative?


#### NMAR: Pure Censoring
Consider a hypothetical blood test to measure a hormone that is normally distributed in the blood with mean 10$\mu g$ and variance 1. However the test to detect the compound only can detect levels above 10.
```{r}
z <- rnorm(100, 10, 1)
y <- z
y[y<10] <- NA
mean(z) - mean(y, na.rm=TRUE)
```

[](images/q.png) Did the complete case estimate over- or under-estimate the true mean?


```{block2, type="rmdnote"}
When the data is not missing at random, the bias can be much greater.
```

```{block2, type="rmdcaution"}
Usually you don't know the missing data mechanism.
```

**Degrees of difficulty**

* MCAR: is easiest to deal with.
* MAR: we can live with it.
* NMAR: most difficult to handle.

**Evidence?**

What can we learn from evidence in the data set at hand?

* May be evidence in the data rule out MCAR - test responders vs. nonresponders.
    - Example: Responders tend to have higher/lower average education than nonresponders by t-test
    - Example: Response more likely in one geographic area than another by chi-square test
* No evidence in data set to rule out MAR (although there may be evidence from an external data source)

**What is plausible?**

* Cochran example: when human behavior is involved, MCAR must be viewed as an extremely special case that would often be violated in practice
* Missing data may be introduced by design (e.g., measure some variables, don’t measure others for reasons of cost, response burden), in which case MCAR would apply
* MAR is much more common than MCAR
* We cannot be too cavalier about assuming MAR, but anecdotal evidence shows that it often is plausible when conditioning on enough information

**Ignorable Missing**

* If missing-data mechanism is MCAR or MAR then nonresponse is said to be "ignorable".
* Origin of name: in likelihood-based inference, both the data model and missing-data mechanism are important but with MCAR or MAR, inference can be based solely on the data model, thus making inference much simpler
* "_Ignorability_" is a relative assumption:  missingness on income may be NMAR given only gender, but may be MAR given gender, age, occupation, region of the country

## General strategies

Strategies for handling missing data include:

* Complete-case/available-case analysis: drop cases that make analysis inconvenient.
* If variables are known to contribute to the missing values, then appropriate modeling can often account for the missingness.
* Imputation procedures: fill in missing values, then analyze completed data sets using complete-date methods
* Weighting procedures: modify "design weights" (i.e., inverse probabilities of selection from sampling plan) to account for probability of response
* Model-based approaches: develop model for partially missing data, base inferences on likelihood under that model


### Complete cases analysis
If not all variables observed, delete case from analysis

* Advantages:
    - Simplicity
    - Common sample for all estimates
* Disadvantages:
    - Loss of valid information
    - Bias due to violation of MCAR


### Available-case analysis
* Use all cases where the variable of interest is present
    - Potentially different sets of cases for means of X and Y
    - and complete pairs for $r_{XY}$
* Tempting to think that available-case analysis will be superior to complete-case analysis
* But it can distort relationships between variables by not using a common base of observations for all quantities being estimated.

### Imputation
Fill in missing values, analyze completed data set

* Advantage:
    * Rectangular data set easier to analyze
    * Analysis data set $n$ matches summary table $n$
* Disadvantage:
    * "Both seductive and dangerous" (Little and Rubin)
    * Can understate uncertainty due to missing values.
    * Can induce bias if imputing under the wrong model.

## Imputation Methods

This section demonstrates each imputation method on the `bsi_depress` scale variable from the parental HIV example. To recap, 37% of the data on this variable is missing.

Create an index of row numbers containing missing values. This will be used to fill in those missing values with a data value.
```{r}
miss.dep.idx<- which(is.na(hiv$bsi_depress))
head(miss.dep.idx)
```

For demonstration purposes I will also create a copy of the `bsi_depress` variable so that the original is not overwritten for each example.

### Unconditional mean substitution.
  - Impute all missing data using the mean of observed cases
  - <span style ="color:red">Artificially decreases the variance</span>

```{r}
bsi_depress.ums <- hiv$bsi_depress # copy
complete.case.mean <- mean(hiv$bsi_depress, na.rm=TRUE)
bsi_depress.ums[miss.dep.idx] <- complete.case.mean
```

```{r, echo= FALSE, fig.height=4, fig.width=4}
imp1.data <- data.frame(value = bsi_depress.ums)
imp1.data$imputed <- "FALSE"
imp1.data$imputed[miss.dep.idx] <- "TRUE"

ggboxplot(imp1.data, y="value", fill = "imputed", x = "imputed")
```

Only a single value was used to impute missing data.

### Hot deck imputation
    - Impute values by randomly sampling values from observed data.
    - Good for categorical data
    - Reasonable for MCAR and MAR
    - `hotdeck` function in `VIM` available

```{r}
bsi_depress.hotdeck<- hiv$bsi_depress # copy
hot.deck <- sample(na.omit(hiv$bsi_depress), size = length(miss.dep.idx))
bsi_depress.hotdeck[miss.dep.idx] <- hot.deck
```

```{r, echo= FALSE}
imp2.data <- data.frame(value = bsi_depress.hotdeck)
imp2.data$imputed <- "FALSE"
imp2.data$imputed[miss.dep.idx] <- "TRUE"

ggviolin(imp2.data,
         y="value", x="imputed",
         fill = "imputed",
         add = c("boxplot", "jitter"),
         alpha = .2)
```

The distribution of imputed values better matches the distribution of observed data, but the distribution (Q1, Q3) is shifted lower a little bit.


### Model based imputation

* Conditional Mean imputation: Use regression on observed variables to estimate missing values
    * Predictions only available for cases with no missing covariates
    * Imputed value is the model predicted mean $\hat{\mu}_{Y|X}$
    * Could use `VIM::regressionImp()` function
* Predictive Mean Matching: Fills in a value randomly by sampling observed values whose regression-predicted values are closest to the regression-predicted value for the missing point.
    * Cross between hot-deck and conditional mean
    * Categorical data can be imputed using classification models
    * Less biased than mean substitution
    * but SE's could be inflated
    * Typically used in multivariate imputation (so not shown here)


Model `bsi_depress` using gender, siblings and age as predictors using linear regression.

```{r}
reg.model <- lm(bsi_depress ~ gender + siblings + age, hiv)
need.imp  <- hiv[miss.dep.idx, c("gender", "siblings", "age")]
reg.imp.vals <- predict(reg.model, newdata = need.imp)
bsi_depress.lm <- hiv$bsi_depress # copy
bsi_depress.lm[miss.dep.idx] <- reg.imp.vals
```

```{r, echo= FALSE}
imp3.data <- data.frame(value = bsi_depress.lm)
imp3.data$imputed <- FALSE
imp3.data$imputed[miss.dep.idx] <- TRUE
imp3.data$method <- "lm"

ggviolin(imp3.data,
         y="value", x="imputed",
         fill = "imputed",
         add = c("boxplot", "jitter"),
         alpha = .2)
```

It seems like only values around 0.5 and 0.8 were imputed values for `bsi_depress`. The imputed values don't quite match the distribution of observed values. Regression imputation and PMM seem to perform extremely similarily.


### Adding a residual

* Impute regression value $\pm$ a randomly selected residual based on estimated residual variance
* Over the long-term, we can reduce bias, on the average

```{r}
set.seed(1337)
rmse <- sqrt(summary(reg.model)$sigma)
eps <- rnorm(length(miss.dep.idx), mean=0, sd=rmse)
bsi_depress.lm.resid <- hiv$bsi_depress # copy
bsi_depress.lm.resid[miss.dep.idx] <- reg.imp.vals + eps
```

```{r, echo= FALSE}
impe.data <- data.frame(value = bsi_depress.lm.resid)
impe.data$imputed <- "FALSE"
impe.data$imputed[miss.dep.idx] <- "TRUE"

ggviolin(impe.data,
         y="value", x="imputed",
         fill = "imputed",
         add = c("boxplot", "jitter"),
         alpha = .2)
```

Well, the distribution of imputed values is spread out a bit more, but the imputations do not respect the truncation at 0 this `bsi_depress` value has.

### Comparison of Estimates

Create a table and plot that compares the point estimates and intervals for the average bsi depression scale.

```{r}
single.imp <- bind_rows(
data.frame(value = na.omit(hiv$bsi_depress),  method = "Observed"),
  data.frame(value = bsi_depress.ums, method = "Mean Sub"),
  data.frame(value = bsi_depress.hotdeck, method = "Hot Deck"),
  data.frame(value = bsi_depress.lm, method = "Regression"),
  data.frame(value = bsi_depress.lm.resid, method = "Reg + eps"))

single.imp$method <- forcats::fct_relevel(single.imp$method ,
      c("Observed", "Mean Sub", "Hot Deck", "Regression", "Reg + eps"))

si.ss <- single.imp %>%
  group_by(method) %>%
  summarize(mean = mean(value),
            sd = sd(value),
            se = sd/sqrt(n()),
            cil = mean-1.96*se,
            ciu = mean+1.96*se)
si.ss
```

```{r}
ggviolin(single.imp, y = "value",
          fill = "method", x = "method",
          add = "boxplot",
          alpha = .2)

ggplot(si.ss, aes(x=mean, y = method, col=method)) +
  geom_point() + geom_errorbar(aes(xmin=cil, xmax=ciu), width=0.2) +
  scale_x_continuous(limits=c(.5, 1)) +
  theme_bw() + xlab("Average BSI Depression score") + ylab("")
```


…but we can do better.


## Multiple Imputation (MI)

### Goals
* Accurately reflect available information
* Avoid bias in estimates of quantities of interest
* Estimation could involve explicit or implicit model
* Accurately reflect uncertainty due to missingness

### Technique
1. For each missing value, impute $m$ estimates (usually $m$ = 5)
    - Imputation method must include a random component
2. Create $m$ complete data sets
3. Perform desired analysis on each of the $m$ complete data sets
4. **Pool** final estimates in a manner that accounts for the between, and within imputation variance.


![Diagram of Multiple Imputation process. Credit: https://stefvanbuuren.name/fimd/sec-nutshell.html](https://stefvanbuuren.name/fimd/fig/ch01-miflow-1.png)


### MI as a paradigm
* Logic: "Average over" uncertainty, don’t assume most likely scenario (single imputation) covers all plausible scenarios
* Principle: Want nominal 95% intervals to cover targets of estimation 95% of the time
* Simulation studies show that, when MAR assumption holds:
    - Proper imputations will yield close to nominal coverage (Rubin 87)
    - Improvement over single imputation is meaningful
    - Number of imputations can be modest - even 2 adequate for many purposes, so 5 is plenty

_Rubin 87: Multiple Imputation for Nonresponse in Surveys, Wiley, 1987)._

### Inference on MI (Pooling estimates)

Consider $m$ imputed data sets. For some quantity of interest $Q$ with squared $SE = U$, calculate $Q_{1}, Q_{2}, \ldots, Q_{m}$ and $U_{1}, U_{2}, \ldots, U_{m}$ (e.g., carry out $m$ regression analyses, obtain point estimates and SE from each).

Then calculate the average estimate $\bar{Q}$, the average variance $\bar{U}$, and the variance of the averages $B$.

$$
  \begin{aligned}
  \bar{Q} & = \sum^{m}_{i=1}Q_{i}/m \\
  \bar{U} & = \sum^{m}_{i=1}U_{i}/m \\
  B & = \frac{1}{m-1}\sum^{m}_{i=1}(Q_{i}-\bar{Q})^2
  \end{aligned}
$$

Then $T = \bar{U} + \frac{m+1}{m}B$ is the estimated total variance of $\bar{Q}$.

Significance tests and interval estimates can be based on

$$\frac{\bar{Q}-Q}{\sqrt{T}} \sim t_{df}, \mbox{ where } df = (m-1)(1+\frac{1}{m+1}\frac{\bar{U}}{B})^2$$


* df are similar to those for comparison of normal means with unequal variances, i.e., using Satterthwaite approximation.
* Ratio of (B = between-imputation variance) to (T = between + within-imputation variance) is known as the fraction of missing information (FMI).
    - The FMI has been proposed as a way to monitor ongoing data collection and estimate the potential bias resulting from survey non-responders [Wagner, 2018](https://academic.oup.com/poq/article-abstract/74/2/223/1936466?redirectedFrom=fulltext)


### Example

1. Create $m$ imputed datasets using linear regression plus a small amount of random noise so all the imputed values are not identical

```{r}
set.seed(1061)
dep.imp1 <- dep.imp2 <- dep.imp3 <- regressionImp(bsi_depress ~ gender + siblings + age, hiv)
dep.imp1$bsi_depress[miss.dep.idx] <- dep.imp1$bsi_depress[miss.dep.idx] +
  rnorm(length(miss.dep.idx), mean=0, sd=rmse/2)

dep.imp2$bsi_depress[miss.dep.idx] <- dep.imp2$bsi_depress[miss.dep.idx] +
  rnorm(length(miss.dep.idx), mean=0, sd=rmse/2)

dep.imp3$bsi_depress[miss.dep.idx] <- dep.imp3$bsi_depress[miss.dep.idx] +
  rnorm(length(miss.dep.idx), mean=0, sd=rmse/2)

```

Visualize the distributions of observed and imputed
```{r}
dep.mi <- bind_rows(
  data.frame(value = dep.imp1$bsi_depress, imputed = dep.imp1$bsi_depress_imp,
             imp = "dep.imp1"),
  data.frame(value = dep.imp2$bsi_depress, imputed = dep.imp2$bsi_depress_imp,
             imp ="dep.imp2"),
  data.frame(value = dep.imp3$bsi_depress, imputed = dep.imp3$bsi_depress_imp,
             imp ="dep.imp3"))

ggdensity(dep.mi, x = "value", color = "imputed", fill = "imputed",
          add = "mean", rug=TRUE, palette = "jco") +
  facet_wrap(~imp, ncol=1)
```

2. Calculate the point estimate $Q$ and the variance $U$ from each imputation.

```{r}
(Q <- c(mean(dep.imp1$bsi_depress),
        mean(dep.imp2$bsi_depress),
        mean(dep.imp3$bsi_depress)))

n.d <- length(dep.imp1$bsi_depress)
(U <- c(sd(dep.imp1$bsi_depress)/sqrt(n.d),
        sd(dep.imp2$bsi_depress)/sqrt(n.d),
        sd(dep.imp3$bsi_depress)/sqrt(n.d)))
```

3. Pool estimates and calculate a 95% CI

```{r}
Q.bar <- mean(Q)          # average estimate
U.bar <- mean(U)           # average variance
B <- sd(Q)                 # variance of averages
Tv <- U.bar + ((3+1)/3)*B  # Total variance of estimate

df <- 2*(1+(U.bar/(4*B))^2) # degress of freedom
t95 <- qt(.975, df) # critical value for 95% CI

mi.ss <- data.frame(
  method = "MI Reg",
  mean = Q.bar,
  se = sqrt(Tv),
  cil = Q.bar - t95*sqrt(Tv),
  ciu = Q.bar + t95*sqrt(Tv))

(imp.ss <- bind_rows(si.ss, mi.ss))
```

```{r}
ggplot(imp.ss, aes(x=mean, y = method, col=method)) +
  geom_point() + geom_errorbar(aes(xmin=cil, xmax=ciu), width=0.2) +
  scale_x_continuous(limits=c(-.3, 2)) +
  theme_bw() + xlab("Average BSI Depression score") + ylab("")
```


## Multiple Imputation using Chained Equations (MICE)

![](images/mice.jpg)

### Overview

* Generates multiple imputations for incomplete multivariate data by Gibbs sampling.
* Missing data can occur anywhere in the data.
* Impute an incomplete column by generating 'plausible' synthetic values given other columns in the data.
* For predictors that are incomplete themselves, the most recently generated imputations are used to complete the predictors prior to imputation of the target column.
* A separate univariate imputation model can be specified for each column.
* The default imputation method depends on the measurement level of the target column.

```{block2, type="rmdtip"}
Your best reference guide to this section of the notes is the bookdown version of Flexible Imputation of Missing Data, by Stef van Buuren:

https://stefvanbuuren.name/fimd/ch-multivariate.html

For a more technical details about how the `mice` function works in R, see:
https://www.jstatsoft.org/article/view/v045i03
```

### Process / Algorithm

Consider a data matrix with 3 variables $y_{1}$, $y_{2}$, $y_{3}$, each with missing values. At iteration $(\ell)$:

1. Fit a model on $y_{1}^{(\ell-1)}$ using current values of $y_{2}^{(\ell-1)}, y_{3}^{(\ell-1)}$
2. Impute missing $y_{1}$, generating $y_{1}^{(\ell)}$
3. Fit a model on $y_{2}^{(\ell-1)}$ using current versions of $y_{1}^{(\ell)}, y_{3}^{(\ell-1)}$
4. Impute missing  $y_{2}$, generating $y_{2}^{(\ell)}$
5. Fit a model on $y_{3}$ using current versions of $y_{1}^{(\ell)}, y_{2}^{(\ell)}$
6. Impute missing  $y_{3}$, generating $y_{3}^{(\ell)}$
7. Start next cycle using updated values $y_{1}^{(\ell)}, y_{2}^{(\ell)}, y_{3}^{(\ell)}$

where $(\ell)$ cycles from 1 to $L$, before an imputed value is drawn.

### Convergence

How many imputations ($m$) should we create and how many iterations ($L$) should I run between imputations?

* Original research from Rubin states that small amount of imputations ($m=5$) would be sufficient.
* Advances in computation have resulted in very efficient programs such as `mice` - so generating a larger number of imputations (say $m=40$) are more common [Pan, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934387/)
* You want the number of iterations between draws to be long enough that the Gibbs sampler has converged.
* There is no test or direct method for determing convergence.
    - Plot parameter against iteration number, one line per chain.
    - These lines should be intertwined together, without showing trends.
    - Convergence can be identified when the variance between lines is smaller (or at least no larger) than the variance within the lines.

```{block2, type="rmdimportant"}
**Mandatory Reading**

Read 6.5.2: Convergence https://stefvanbuuren.name/fimd/sec-algoptions.html
```


### Imputation Methods

Some built-in imputation methods in the `mice` package are:

* _pmm_: Predictive mean matching (any) **DEFAULT FOR NUMERIC**
* _norm.predict_: Linear regression, predicted values (numeric)
* _mean_: Unconditional mean imputation (numeric)
* _logreg_: Logistic regression (factor, 2 levels) **DEFAULT**
* _logreg.boot_: Logistic regression with bootstrap
* _polyreg_: Polytomous logistic regression (factor, >= 2 levels) **DEFAULT**
* _lda_: Linear discriminant analysis (factor, >= 2 categories)
* _cart_: Classification and regression trees (any)
* _rf_: Random forest imputations (any)

## Diagnostics

Q: How do I know if the imputed values are plausible?
A: Create diagnostic graphics that plot the observed and imputed values together.

https://stefvanbuuren.name/fimd/sec-diagnostics.html


## Example: Prescribed amount of missing.

We will demonstrate using the Palmer Penguins dataset where we can artificially create a prespecified percent of the data missing, (after dropping the 11 rows missing sex) This allows us to be able to  estimate the bias incurred by using these imputation methods.

For the `penguin` data ) out we set a seed and use the `prodNA()` function from the `missForest` package to create 10% missing values in this data set.
```{r}
library(missForest)
set.seed(12345) # Raspberry, I HATE raspberry!
pen.nomiss <- na.omit(pen)
pen.miss <- prodNA(pen.nomiss, noNA=0.1)
prop.table(table(is.na(pen.miss)))
```

Visualize missing data pattern.
```{r}
aggr(pen.miss, col=c('darkolivegreen3','salmon'),
              numbers=TRUE, sortVars=TRUE,
              labels=names(pen.miss), cex.axis=.7,
              gap=3, ylab=c("Missing data","Pattern"))
```

Here's another example of where only 10% of the data overall is missing, but it results in only 58% complete cases.


### Multiply impute the missing data using `mice()`
```{r}
imp_pen <- mice(pen.miss, m=10, maxit=25, meth="pmm", seed=500, printFlag=FALSE)
summary(imp_pen)
```

```{block2, type="rmdnote"}
The Stack Exchange post listed below has a great explanation/description of what each of these arguments control. It is a **very** good idea to understand these controls.

https://stats.stackexchange.com/questions/219013/how-do-the-number-of-imputations-the-maximum-iterations-affect-accuracy-in-mul/219049#219049
```

### Check the imputation method used on each variable.
```{r}
imp_pen$meth
```

Predictive mean matching was used for all variables, even `species` and `island`. This is reasonable because PMM is a hot deck method of imputation.

### Check Convergence
```{r}
plot(imp_pen, c("bill_length_mm", "body_mass_g", "bill_depth_mm"))
```

The variance across chains is no larger than the variance within chains.

### Look at the values generated for imputation
```{r}
imp_pen$imp$body_mass_g |> head()
```

This is just for us to see what this imputed data look like. Each column is an imputed value, each row is a row where an imputation for `body_mass_g` was needed. Notice only imputations are shown, no observed data is showing here.

### Create a complete data set by filling in the missing data using the imputations
```{r, eval=FALSE}
pen_1 <- complete(imp_pen, action=1)
```
Action=1 returns the first completed data set, action=2 returns the second completed data set, and so on.

#### Alternative - Stack the imputed data sets in _long_ format.
```{r}
pen_long <- complete(imp_pen, 'long')
```

By looking at the `names` of this new object we can confirm that there are indeed 10 complete data sets with $n=333$ in each.

```{r}
names(pen_long)
table(pen_long$.imp)
```


### Visualize Imputations
Let's compare the imputed values to the observed values to see if they are indeed "plausible" We want to see that the distribution of of the magenta points (imputed) matches the distribution of the blue ones (observed).

**Univariately**
```{r, fig.width=8}
densityplot(imp_pen)
```

**Multivariately**
```{r, fig.width=8, fig.height=8}
xyplot(imp_pen, bill_length_mm ~ bill_depth_mm + flipper_length_mm  | species + island, cex=.8, pch=16)
```

**Analyze and pool**
All of this imputation was done so we could actually perform an analysis!

Let's run a simple linear regression on `body_mass_g` as a function of `bill_length_mm`, `flipper_length_mm` and `species`.

```{r}
model <- with(imp_pen, lm(body_mass_g ~ bill_length_mm + flipper_length_mm + species))
summary(pool(model))
```

Pooled parameter estimates $\bar{Q}$ and their standard errors $\sqrt{T}$ are provided, along with a significance test (against $\beta_p=0$). Note with this output that a 95% interval must be calculated manually.

We can leverage the `gtsummary` package to tidy and print the results of a `mids` object, but the mice object has to be passed to `tbl_regression` BEFORE you pool. [ref SO post](https://stackoverflow.com/questions/65314702/using-tbl-regression-with-imputed-data-pooled-regression-models). This function needs to access features of the original model first, then will do the appropriate pooling and tidying.

```{r}
gtsummary::tbl_regression(model)
```


Additionally digging deeper into the object created by `pool(model)`, specifically the `pooled` list, we can pull out additional information including the number of missing values, the _fraction of missing information_ (`fmi`) as defined by Rubin (1987), and `lambda`, the proportion of total variance that is attributable to the missing data ($\lambda = (B + B/m)/T)$.

```{r}
kable(pool(model)$pooled[,c(1:4, 8:9)], digits=3)
```


### Calculating bias
The penguins data set used here had no missing data to begin with. So we can calculate the "true" parameter estimates...
```{r}
true.model <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm + species, data = pen.nomiss)
```
and find the difference in coefficients.

The variance of the multiply imputed estimates is larger because of the between-imputation variance.

```{r, fig.width=10}

tm.est <- true.model |> coef() |> broom::tidy() |> mutate(model = "True Model") |>
  rename(est = x)
tm.est$cil <- confint(true.model)[,1]
tm.est$ciu <- confint(true.model)[,2]
tm.est <- tm.est[-1,] # drop intercept

mi <- tbl_regression(model)$table_body |>
  select(names = label, est = estimate, cil=conf.low, ciu=conf.high) |>
  mutate(model = "MI") |> filter(!is.na(est))

pen.mi.compare <- bind_rows(tm.est, mi)
pen.mi.compare$names <- gsub("species", "", pen.mi.compare$names)

ggplot(pen.mi.compare, aes(x=est, y = names, col=model)) +
  geom_point() + geom_errorbar(aes(xmin=cil, xmax=ciu), width=0.2) +
  theme_bw()
```

```{r, echo=FALSE}
pen.mi.compare %>%
  tidyr::pivot_wider(id_cols = names, names_from = model, values_from = est) %>%
  mutate(bias = MI - `True Model`) %>% kable()
```

MI over estimates the difference in body mass between Chinstrap and Adelie, but underestiamtes that difference for Gentoo. There is also an underestimation of the relationship between bill length and body mass.


## Post MICE data management

Sometimes you'll have a need to do additional data management after imputation has been completed. Creating binary indicators of an event, re-creating scale variables etc. The general approach is to transform the imputed data into long format using `complete` **with the argument `include=TRUE`** , do the necessary data management, and then convert it back to a `mids` object type.

Continuing with the penguin example, let's create a new variable that is the ratio of bill length to depth.

Recapping prior steps of imputing, and then creating the completed long data set.

```{r, eval=2}
imp_pen <- mice(pen.miss, m=10, maxit=25, meth="pmm", seed=500, printFlag=FALSE)
pen_long <- complete(imp_pen, 'long', include=TRUE)
```

We create the new ratio variable on the long data:
```{r}
pen_long$ratio <- pen_long$bill_length_mm / pen_long$bill_depth_mm
```

Let's visualize this to see how different the distributions are across imputation. Notice imputation "0" still has missing data - this is a result of using `include = TRUE` and keeping the original data as part of the `pen_long` data.
```{r}
ggboxplot(pen_long, y="ratio", x="species", facet.by = ".imp")
```

Then convert the data back to `mids` object, specifying the variable name that identifies the imputation number.

```{r}
imp_pen1 <- as.mids(pen_long, .imp = ".imp")
```

Now we can conduct analyses such as an ANOVA (in linear model form) to see if this ratio differs significantly across the species.

```{r}
nova.ratio <- with(imp_pen1, lm(ratio ~ species))
pool(nova.ratio) |> summary()
```


## Final thoughts


> "In our experience with real and artificial data..., the practical conclusion appears to be that multiple imputation, when carefully done, can be safely used with real problems even when the ultimate user may be applying models or analyses not contemplated by the imputer." - Little & Rubin (Book, p. 218)


* Don't ignore missing data.
* Impute sensibly and multiple times.
* It's typically desirable to include many predictors in an imputation model, both to
    - improve precision of imputed values
    - make MAR assumption more plausible
* But the number of covariance parameters goes up as the square of the number of variables in the model,
  - implying practical limits on the number of variables for which parameters can be estimated well
* MI applies to subjects who have a general missingness pattern, i.e., they have measurements on some variables, but not on others.
* But, subjects can be lost to follow up due to death or other reasons (i.e., attrition).
* Here we have only baseline data, but not the outcome or other follow up data.
* If attrition subjects are eliminated from the sample, they can produce non-response or attrition bias.
* Use attrition weights.
    - Given a baseline profile, predict the probability that subject will
      stay and use the inverse probability as weight.
    - e.g., if for a given profile all subjects stay, then the predicted probability
      is 1 and the attrition weight is 1. Such a subject "counts once".
    - For another profile, the probability may be 0.5, attrition weight is
      1/.5 = 2 and that person "counts twice".
* For differential drop-out, or self-selected treatment, you can consider using Propensity Scores.


## Additional References

* Little, R. and Rubin, D. Statistical Analysis with Missing Data, 2nd Ed., Wiley, 2002
    - Standard reference
    - Requires some math
* Allison, P. Missing Data, Sage, 2001
    - Small and cheap
    - Requires very little math
* Multiple Imputation.com http://www.stefvanbuuren.nl/mi/

* Applied Missing Data Analysis with SPSS and (R) Studio https://bookdown.org/mwheymans/Book_MI/
* http://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
* http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/


Imputation methods for complex survey data and data not missing at random is an open research topic. Read more about this here: https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_mi_sect032.htm

![](https://shiring.github.io/netlify_images/mice_sketchnote_gxjsgc.jpg)
https://shirinsplayground.netlify.com/2017/11/mice_sketchnotes/