|
| 1 | +# Test Data Fixtures |
| 2 | + |
| 3 | +## hrs_edid_validation.csv |
| 4 | + |
| 5 | +**Source:** Dobkin, C., Finkelstein, A., Kluender, R., & Notowidigdo, M. J. (2018). |
| 6 | +"The Economic Consequences of Hospital Admissions." *American Economic Review*, 108(2), 308-352. |
| 7 | +Replication kit: https://www.openicpsr.org/openicpsr/project/116186/version/V1/view |
| 8 | + |
| 9 | +**Sample selection:** Follows Sun & Abraham (2021), as used by Chen, Sant'Anna & Xie (2025) |
| 10 | +Section 6: |
| 11 | + |
| 12 | +1. Read `HRS_long.dta` from the Dobkin et al. replication kit |
| 13 | +2. Keep waves 7-11, retain only individuals present in all 5 waves |
| 14 | +3. Filter to ever-hospitalized individuals with `first_hosp >= 8` |
| 15 | +4. Filter to ages 50-59 at hospitalization (`age_hosp`) |
| 16 | +5. Drop wave 11 (no valid comparison group) |
| 17 | +6. Recode `first_hosp == 11` as never-treated (`inf`) |
| 18 | + |
| 19 | +**Expected counts:** |
| 20 | + |
| 21 | +| Column | Values | |
| 22 | +|--------|--------| |
| 23 | +| Total individuals | 656 | |
| 24 | +| Waves | 7, 8, 9, 10 | |
| 25 | +| Rows | 2,624 | |
| 26 | +| G=8 | 252 | |
| 27 | +| G=9 | 176 | |
| 28 | +| G=10 | 163 | |
| 29 | +| G=inf | 65 | |
| 30 | + |
| 31 | +**Columns:** `unit` (hhidpn), `time` (wave), `outcome` (oop_spend, 2005 dollars), `first_treat` (first_hosp) |
| 32 | + |
| 33 | +**Regeneration:** Requires the Dobkin et al. replication kit (`.gitignore`d as `replication_data/`). |
| 34 | + |
| 35 | +```python |
| 36 | +import pandas as pd, numpy as np |
| 37 | +df = pd.read_stata("replication_data/116186-V1/Replication-Kit/HRS/Data/HRS_long.dta") |
| 38 | +sub = df[df["wave"].isin([7, 8, 9, 10, 11])] |
| 39 | +balanced = sub.groupby("hhidpn")["wave"].nunique() |
| 40 | +sub = sub[sub["hhidpn"].isin(balanced[balanced == 5].index)] |
| 41 | +sub = sub[sub["hhidpn"].isin(sub[sub["first_hosp"].notna()]["hhidpn"].unique())] |
| 42 | +fh = sub.groupby("hhidpn")["first_hosp"].first() |
| 43 | +sub = sub[sub["hhidpn"].isin(fh[fh >= 8].index)] |
| 44 | +ages = sub.groupby("hhidpn")["age_hosp"].first() |
| 45 | +sub = sub[sub["hhidpn"].isin(ages[(ages >= 50) & (ages <= 59)].index)] |
| 46 | +sub = sub[sub["wave"] <= 10] |
| 47 | +sub["first_treat"] = sub["first_hosp"].apply(lambda x: np.inf if x == 11 else int(x)) |
| 48 | +out = sub[["hhidpn", "wave", "oop_spend", "first_treat"]].copy() |
| 49 | +out.columns = ["unit", "time", "outcome", "first_treat"] |
| 50 | +out["unit"] = out["unit"].astype(int) |
| 51 | +out["time"] = out["time"].astype(int) |
| 52 | +out.sort_values(["unit", "time"]).reset_index(drop=True).to_csv( |
| 53 | + "tests/data/hrs_edid_validation.csv", index=False |
| 54 | +) |
| 55 | +``` |
0 commit comments