Skip to content

Evaluate pandas vs polars usage and consider migration strategy #3

@jalengg

Description

@jalengg

Current State

PyHealth currently depends on both pandas and polars:

# requirements.txt
pandas>=1.3.2
polars

Questions to Investigate

  1. Where is each library used?

    • Which modules/functions use pandas?
    • Which modules/functions use polars?
    • Is there overlap or duplication?
  2. Can we consolidate?

    • Is pandas only used for legacy compatibility?
    • Can existing pandas usage be migrated to polars?
    • What would be the breaking changes?
  3. Performance implications

    • Polars is generally faster for large datasets
    • What performance gains could we expect?
    • Are there any cases where pandas is still preferable?
  4. Maintenance burden

Potential Actions

  • Audit codebase for pandas usage
  • Audit codebase for polars usage
  • Benchmark performance differences on typical PyHealth workloads
  • Create migration plan if consolidation makes sense
  • Document decision and rationale

Context

This issue arose while fixing #2, where pandas version constraints (<2) caused Python 3.12 installation failures. Having both libraries suggests incomplete migration or unclear strategy.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions