feat: Add Hugging Face Datasets plugin#629
feat: Add Hugging Face Datasets plugin#629andreahlert wants to merge 10 commits intoflyteorg:mainfrom
Conversation
be91055 to
17d010a
Compare
|
@pingsutw @cosmicBboy @kumare3 could you take a look? This ports the huggingface plugin from flytekit to v2, same approach as the polars plugin. |
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
Add a new plugin that provides native support for HuggingFace datasets.Dataset as a Flyte DataFrame type, enabling seamless serialization/deserialization through Parquet format. Features: - DataFrameEncoder/Decoder for datasets.Dataset <-> Parquet - Cloud storage support (S3, GCS, Azure) via fsspec storage options - Anonymous S3 fallback for public datasets - Column filtering on both encode and decode - Auto-registration via flyte.plugins.types entry point Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
b5f2178 to
da8a0a2
Compare
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
… infra - Use storage.get_configured_fsspec_kwargs() instead of get_storage() (fix review) - Add [tool.uv.sources] flyte editable for dev (match Anthropic/OpenAI) - Conftest: use LocalDB._get_db_path and reset _conn (match Polars after main) - Tests: patch flyte.storage._storage.get_storage; run.outputs()[0]; skip empty dataset to avoid CI flakiness Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
…, DataFrame Signed-off-by: André Ahlert <andre@aex.partners>
c5c1f84 to
c73f62f
Compare
|
Hi Ketan! With 2.0 out I’ve rebased and addressed several of your comments: using the public storage API instead of get_storage, and public flyte.io imports where possible. |
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
plugins/huggingface/src/flyteplugins/huggingface/df_transformer.py
Outdated
Show resolved
Hide resolved
aaff311 to
826a958
Compare
Renamed get_hf_storage_options to _get_storage_options (no public API exposure). Removed column filtering from encode to match the Polars plugin pattern. Removed misleading backwards-compatibility comment on module-level registration. Synced conftest cache isolation to use LocalDB._get_db_path. Signed-off-by: André Ahlert <andre@aex.partners>
826a958 to
633d250
Compare
Replace datasets.Dataset.from_parquet with pq.read_table + datasets.Dataset(table). from_parquet routes through HuggingFace's DownloadManager which caches files locally under ~/.cache/huggingface/datasets/ before reading. For Flyte-managed storage this is wasteful (double I/O) and bypasses the fsspec filesystem we already have configured. Using pq.read_table with the Flyte filesystem reads directly from storage with no intermediate cache, removes the NoCredentialsError anonymous fallback, and avoids relying on storage_options flowing through **kwargs in from_parquet. Signed-off-by: André Ahlert <andre@aex.partners>
…lesystem Both encode and decode now use pq.write_table/pq.read_table with the Flyte filesystem directly, removing the asymmetry where encode went through HuggingFace's to_parquet (fsspec.open) and decode used pyarrow. Removes _get_storage_options entirely along with its 8 unit tests. Enables the empty dataset test that was previously skipped due to from_parquet not handling empty parquet files. Signed-off-by: André Ahlert <andre@aex.partners>
Signed-off-by: André Ahlert <andre@aex.partners>
|
Addressed all comments. Switched both encode and decode to use pq.write_table/pq.read_table with the Flyte filesystem directly. This avoids HuggingFace's DownloadManager cache entirely and keeps credentials handled by Flyte's storage layer. As a side effect, _get_storage_options is gone and the empty dataset test is no longer skipped since that was a from_parquet limitation. Also dropped column filtering from encode to match the Polars plugin pattern. |
Summary
Port of the flytekit-huggingface plugin to flyte-sdk v2, enabling native support for
datasets.Datasetas a Flyte DataFrame type.datasets.Datasetwith Parquet serializationflyte.plugins.typesentry pointFollows the same patterns as the existing Polars plugin.
Usage Example
Test plan
flyte.TaskEnvironment