-
Couldn't load subscription status.
- Fork 680
Description
What happened?
Description
When using the Polars backend, calling read_parquet with multiple file paths (a list of specific Parquet files, not a partitioned directory) unexpectedly falls back to using pyarrow.dataset instead of the native Polars scan_parquet.
This behavior is misleading and unnecessary for non-partitioned use cases.
Modern Polars versions support reading multiple Parquet files directly via scan_parquet, so the current fallback causes degraded performance and confusing behavior.
⸻
🔍 Reproduction
import ibis
con = ibis.polars.connect()
t = con.read_parquet([
"s3://bucket/path/file1.parquet",
"s3://bucket/path/file2.parquet"
])
Expected:
Use Polars’ native scan_parquet (which supports lists and globs).
Actual:
Automatically switches to pyarrow.dataset → uses pl.scan_pyarrow_dataset, which performs worse and behaves differently.
⸻
🧠 Root Cause & Background
Looking at the Ibis implementation of read_parquet for the Polars engine:
if not isinstance(path, (str, Path)) and len(path) > 1:
self._import_pyarrow()
import pyarrow.dataset as ds
path = [normalize_filename(p) for p in path]
obj = pl.scan_pyarrow_dataset(
source=ds.dataset(path, format="parquet"), **kwargs
)
self._add_table(table_name, obj)
else:
path = normalize_filename(path)
self._add_table(table_name, pl.scan_parquet(path, **kwargs))
This fallback seems to have originated from Polars’ own recommendation in their documentation:
“Partitioned files: If you have a directory-nested (hive-style) partitioned dataset, you should use the scan_pyarrow_dataset() method to read that data instead.”
— Polars Docs: scan_parquet
However, in my case, I’m not using a partitioned directory — just providing a list of specific S3 file paths such as:
["s3://bucket/file1.parquet", "s3://bucket/file2.parquet"]
Polars handles this natively and efficiently using scan_parquet, without requiring PyArrow.
Therefore, this fallback penalizes valid, non-partitioned use cases by:
• Introducing an unnecessary PyArrow dependency and performance cost
• Misleading users who expect Ibis + Polars to behave like Polars itself
• Making it harder to apply Polars-native configuration or optimizations
⸻
💡 Proposed Solution
Allow users to opt out of the pyarrow.dataset fallback (or remove it entirely), for example by adding a flag like:
t = con.read_parquet([...], use_pyarrow_dataset=False)
or by detecting whether the provided paths represent a true partitioned dataset.
If the input is just a list of standalone file paths (not a directory or hive-style layout), scan_parquet should be used.
⸻
✅ Expected Behavior
read_parquet([...]) should use Polars’ native scan_parquet implementation for any list of paths that isn’t a partitioned directory.
⸻
🙋♂️ Personal Note
I’d be more than happy to open a Pull Request with a proposed solution — either adding an optional use_pyarrow_dataset parameter or improving the path-handling logic to better align with current Polars capabilities.
⸻
📎 Additional Context
Polars docs explicitly mention scan_pyarrow_dataset() only for partitioned datasets, not for general multi-file reads.
Modern Polars versions support reading multiple paths natively and efficiently with scan_parquet.
Removing or making the fallback configurable would make the Ibis–Polars integration more intuitive and performant.
What version of ibis are you using?
11.0.0 (or any)
What backend(s) are you using, if any?
Polars
Relevant log output
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status