Skip to content

Add subdirectory support for hf:// URLs and gs:// scheme for GCS#406

Merged
anth-volk merged 1 commit intomasterfrom
feature/hf-subdirectory-support
Nov 27, 2025
Merged

Add subdirectory support for hf:// URLs and gs:// scheme for GCS#406
anth-volk merged 1 commit intomasterfrom
feature/hf-subdirectory-support

Conversation

@baogorek
Copy link
Copy Markdown
Collaborator

Summary

  • Add subdirectory support for hf:// URLs (e.g., hf://owner/repo/path/to/file.h5)
  • Add gs:// URL scheme for Google Cloud Storage (e.g., gs://bucket/path/to/file.h5)

Changes

hf:// improvements

  • Add parse_hf_url() helper function to centralize URL parsing
  • Support subdirectory paths: hf://owner/repo/path/to/file.h5[@version]
  • Fix inconsistent parsing between dataset.py and simulation.py

gs:// new functionality

  • Add google_cloud.py module with parse_gs_url(), download_gcs_file(), upload_gcs_file()
  • Support URLs: gs://bucket/path/to/file.h5[@version]
  • google-cloud-storage is optional - raises helpful ImportError if not installed

Test plan

  • 6 new tests for parse_hf_url()
  • 10 new tests for parse_gs_url()
  • All 449 existing tests pass

Closes #405

🤖 Generated with Claude Code

@baogorek baogorek force-pushed the feature/hf-subdirectory-support branch 2 times, most recently from 17d4673 to 8920e59 Compare November 26, 2025 18:53
- Add parse_hf_url() helper function to centralize hf:// URL parsing
- Support subdirectory paths: hf://owner/repo/path/to/file.h5[@Version]
- Add google_cloud.py module with parse_gs_url(), download_gcs_file(), upload_gcs_file()
- Support gs:// URLs: gs://bucket/path/to/file.h5[@Version]
- Update dataset.py download/upload methods for both schemes
- Update simulation.py dataset initialization for both schemes
- google-cloud-storage is optional - raises helpful ImportError if not installed

Closes #405

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek baogorek force-pushed the feature/hf-subdirectory-support branch from 8920e59 to bf95770 Compare November 26, 2025 18:58
@baogorek
Copy link
Copy Markdown
Collaborator Author

I just tested this manually as well:

In [1]: from policyengine_us import Microsimulation

In [2]: sim = Microsimulation(dataset="hf://policyengine/test/states/RI.h5")

In [3]: sim.calculate('household_id').weights.sum()
Out[3]: np.float64(393240.457935619)

In [4]: sim = Microsimulation(dataset="gs://policyengine-us-data/states/RI.h5")

In [5]: sim.calculate('household_id').weights.sum()
Out[5]: np.float64(393240.457935619)

Copy link
Copy Markdown
Collaborator

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @baogorek! Tested locally and found no issues.

@anth-volk anth-volk merged commit 8698c06 into master Nov 27, 2025
14 checks passed
@anth-volk anth-volk deleted the feature/hf-subdirectory-support branch November 27, 2025 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add subdirectory support for hf:// URLs and gs:// scheme for Google Cloud Storage

2 participants