Skip to content

Missing YAML config files in installed package #21

@jalengg

Description

@jalengg

Problem

When PyHealth is installed via pip (from git or PyPI), the YAML configuration files in pyhealth/datasets/configs/ are not included in the installed package. This causes FileNotFoundError when trying to use several datasets.

Error Example

from pyhealth.datasets import HALO_MIMIC3Dataset

dataset = HALO_MIMIC3Dataset(mimic3_dir="data")

Results in:

FileNotFoundError: [Errno 2] No such file or directory: 
'/usr/local/lib/python3.12/dist-packages/pyhealth/datasets/configs/hcup_ccs_2015_definitions_benchmark.yaml'

Affected Datasets

This issue affects multiple datasets that rely on YAML configuration files:

  • halo_mimic3.pyhcup_ccs_2015_definitions_benchmark.yaml
  • mimic3.pymimic3.yaml
  • mimic4.pymimic4_cxr.yaml, mimic4_ehr.yaml, mimic4_note.yaml
  • ehrshot.pyehrshot.yaml
  • covid19_cxr.pycovid19_cxr.yaml
  • medical_transcriptions.pymedical_transcriptions.yaml

Root Cause

The setup.py has include_package_data=True but there is no MANIFEST.in file to specify which non-Python files should be included in the package. By default, setuptools only includes .py files.

Solution

Create a MANIFEST.in file in the repository root:

include README.rst
include requirements.txt
include LICENSE
recursive-include pyhealth/datasets/configs *.yaml *.yml

This tells setuptools to include all YAML files in the configs directory when building the package.

Verification

After the fix, verify with:

python setup.py sdist
tar -tzf dist/pyhealth-*.tar.gz | grep "\.yaml"

Should show:

pyhealth-1.1.4/pyhealth/datasets/configs/covid19_cxr.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/ehrshot.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/hcup_ccs_2015_definitions_benchmark.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/medical_transcriptions.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/mimic3.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/mimic4_cxr.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/mimic4_ehr.yaml
pyhealth-1.1.4/pyhealth/datasets/configs/mimic4_note.yaml

Context

Discovered while fixing the HALO Colab notebook in PR sunlabuiuc#528. The notebook installs PyHealth from git and users encountered this error when trying to load the HALO_MIMIC3Dataset.

This is a project-wide packaging issue that affects any user installing PyHealth via pip rather than running from source.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions