Skip to content

Add DGP functions to prep.py for all supported DiD designs#80

Merged
igerber merged 2 commits intomainfrom
feat/consolidate-dgp-functions
Jan 19, 2026
Merged

Add DGP functions to prep.py for all supported DiD designs#80
igerber merged 2 commits intomainfrom
feat/consolidate-dgp-functions

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Jan 19, 2026

Consolidate Data Generating Process functions from tutorials and tests into diff_diff/prep.py as reusable library utilities:

  • generate_staggered_data(): Staggered adoption DiD (CallawaySantAnna, SunAbraham)
  • generate_factor_data(): Factor model data (TROP, SyntheticDiD)
  • generate_ddd_data(): Triple Difference (DDD) designs
  • generate_panel_data(): Panel data with optional parallel trends violations
  • generate_event_study_data(): Event study with simultaneous treatment

Changes:

  • Add 5 new DGP functions to diff_diff/prep.py with full documentation
  • Export new functions from diff_diff/init.py
  • Add 33 tests covering all new functions in tests/test_prep.py
  • Update test files to use library functions where compatible
  • Update tutorials 02, 04, 07, 08, 10 to import from library
  • Fix pre-existing API bug in tutorial 07 (show_mdv -> mdv parameter)

Consolidate Data Generating Process functions from tutorials and tests
into diff_diff/prep.py as reusable library utilities:

- generate_staggered_data(): Staggered adoption DiD (CallawaySantAnna, SunAbraham)
- generate_factor_data(): Factor model data (TROP, SyntheticDiD)
- generate_ddd_data(): Triple Difference (DDD) designs
- generate_panel_data(): Panel data with optional parallel trends violations
- generate_event_study_data(): Event study with simultaneous treatment

Changes:
- Add 5 new DGP functions to diff_diff/prep.py with full documentation
- Export new functions from diff_diff/__init__.py
- Add 33 tests covering all new functions in tests/test_prep.py
- Update test files to use library functions where compatible
- Update tutorials 02, 04, 07, 08, 10 to import from library
- Fix pre-existing API bug in tutorial 07 (show_mdv -> mdv parameter)

Users can now generate synthetic data via:
  from diff_diff import generate_staggered_data, generate_factor_data, ...

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Jan 19, 2026

Code Review: PR #80 - Add DGP functions to prep.py for all supported DiD designs

Author: igerber
Branch: feat/consolidate-dgp-functions -> main
Files Changed: 11


Executive Summary

This PR consolidates Data Generating Process (DGP) functions from tutorials and tests into the main library (diff_diff/prep.py). It adds 5 new well-documented DGP functions, exports them properly, includes comprehensive test coverage (33 new tests), and updates tutorials/tests to use the new library functions. The PR also includes a legitimate bug fix in tutorial 07.


Part 1: Methodology Review

Statistical Correctness

All 5 new DGP functions implement correct data generating processes:

  1. generate_staggered_data() - Correctly implements staggered adoption with:

    • Cohort assignment to treatment periods
    • Dynamic treatment effects that grow with time since treatment: effect * (1 + growth * t)
    • Unit fixed effects and time trends
    • Proper handling of never-treated units (first_treat=0)
  2. generate_factor_data() - Correctly implements interactive fixed effects model:

    • Y_it = mu + alpha_i + beta_t + Lambda_i'F_t + tau*D_it + eps_it
    • Factor loadings are systematically shifted for treated units (creates confounding)
    • Appropriate for TROP/SyntheticDiD testing
  3. generate_ddd_data() - Correctly implements 2x2x2 triple difference structure:

    • All 8 cells (group x partition x time) populated
    • Second-order interactions included (grouppartition, grouptime, partition*time)
    • Treatment effect only applied to G=1, P=1, T=1 cell
  4. generate_panel_data() - Correctly implements panel with optional parallel trends violation:

    • When parallel_trends=False, treated units get steeper pre-treatment trend
    • Useful for testing diagnostics
  5. generate_event_study_data() - Correctly implements simultaneous treatment event study:

    • Event time column relative to treatment
    • Suitable for MultiPeriodDiD and HonestDiD testing

Bug Fix Validation

The PR fixes a legitimate API bug in tutorial 07 where show_mdv=True was used with plot_pretrends_power(). The correct parameter is mdv=<float>. This was verified by checking both the old tutorial code and the function signature in visualization.py:1400-1415.


Part 2: Issues Found

Critical Issues

None.

Medium Issues

None.

Minor Issues

  1. Newline at EOF removed in notebook files - Several notebook files have the trailing newline removed (e.g., 02_staggered_did.ipynb). This is cosmetic but can cause git diff noise.

  2. RuntimeWarnings during tests - The test suite produces divide by zero and overflow warnings in linalg.py and triple_diff.py. These appear to be pre-existing issues unrelated to this PR, but worth noting.


Part 3: Security Assessment

No security issues identified. The new functions:

  • Only generate synthetic data in-memory using numpy random generators
  • Do not perform I/O operations
  • Do not accept file paths or execute commands
  • Use proper parameter validation

Part 4: Documentation Assessment

Excellent documentation quality:

  • All 5 new functions have comprehensive docstrings with:

    • Full parameter descriptions with types and defaults
    • Returns section documenting output columns
    • Working Examples section showing usage
    • Notes section where applicable (e.g., factor model confounding)
  • Functions are properly exported in __init__.py with organized grouping

  • __all__ list updated to include new exports

  • Docstrings include usage examples with actual estimator integration

Minor gap: The CLAUDE.md file was not updated to document the new functions in the Module Structure section. However, since the functions follow the existing pattern in prep.py, this is a minor omission.


Part 5: Performance Assessment

No performance concerns:

  • All functions use vectorized numpy operations
  • np.random.default_rng() used for modern random number generation (not deprecated np.random.seed())
  • No unnecessary memory allocation or loops

Part 6: Maintainability Assessment

Positive:

  • Test files refactored to use library DGP functions via thin wrapper functions
  • Wrappers maintain backward compatibility (e.g., column naming like 'period' vs 'time')
  • DRY principle applied - DGP code consolidated from 5+ locations to 1
  • Tutorials simplified significantly (e.g., 48 lines -> 2 lines for data generation)

Code organization:

  • All new code added to existing prep.py module (appropriate location)
  • Test classes follow existing naming conventions (TestGenerateStaggeredData, etc.)
  • Imports organized correctly

Recommendations

Must Fix (before merge)

None.

Should Fix

  1. Add the new functions to CLAUDE.md's Module Structure section for diff_diff/prep.py

Nice to Have

  1. Consider restoring trailing newlines in notebook files for cleaner git history
  2. Investigate the RuntimeWarnings in linalg.py:162 and triple_diff.py:301-323 (pre-existing, not blocking)

Final Assessment

Category Rating Notes
Methodology All DGPs correctly implemented
Code Quality Clean, well-tested, follows patterns
Security No concerns
Documentation Excellent docstrings and examples
Performance Vectorized, efficient
Maintainability Reduces duplication, good organization

Overall Verdict: Approved

The PR achieves its goal of consolidating DGP functions into a reusable library API. The implementation is statistically correct, well-documented, and thoroughly tested. The tutorial updates reduce code duplication and the bug fix in tutorial 07 is legitimate. All 236 tests pass (94 in test_prep.py + 142 in related test files).


Review generated by Claude Code

- Add new DGP functions to CLAUDE.md Module Structure section
- Restore trailing newlines to modified notebook files
- Add RuntimeWarnings investigation items to TODO.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@igerber igerber merged commit 5d7dd8f into main Jan 19, 2026
4 checks passed
@igerber igerber deleted the feat/consolidate-dgp-functions branch January 19, 2026 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant