Skip to content

bug: read_parquet with Polars engine downgrades to pyarrow.dataset when given multiple paths #11705

@AlmogAl

Description

@AlmogAl

What happened?

Description
When using the Polars backend, calling read_parquet with multiple file paths (a list of specific Parquet files, not a partitioned directory) unexpectedly falls back to using pyarrow.dataset instead of the native Polars scan_parquet.

This behavior is misleading and unnecessary for non-partitioned use cases.
Modern Polars versions support reading multiple Parquet files directly via scan_parquet, so the current fallback causes degraded performance and confusing behavior.

🔍 Reproduction

import ibis

con = ibis.polars.connect()
t = con.read_parquet([
"s3://bucket/path/file1.parquet",
"s3://bucket/path/file2.parquet"
])

Expected:
Use Polars’ native scan_parquet (which supports lists and globs).

Actual:
Automatically switches to pyarrow.dataset → uses pl.scan_pyarrow_dataset, which performs worse and behaves differently.

🧠 Root Cause & Background

Looking at the Ibis implementation of read_parquet for the Polars engine:

if not isinstance(path, (str, Path)) and len(path) > 1:
self._import_pyarrow()
import pyarrow.dataset as ds

path = [normalize_filename(p) for p in path]
obj = pl.scan_pyarrow_dataset(
    source=ds.dataset(path, format="parquet"), **kwargs
)
self._add_table(table_name, obj)

else:
path = normalize_filename(path)
self._add_table(table_name, pl.scan_parquet(path, **kwargs))

This fallback seems to have originated from Polars’ own recommendation in their documentation:

“Partitioned files: If you have a directory-nested (hive-style) partitioned dataset, you should use the scan_pyarrow_dataset() method to read that data instead.”
Polars Docs: scan_parquet

However, in my case, I’m not using a partitioned directory — just providing a list of specific S3 file paths such as:

["s3://bucket/file1.parquet", "s3://bucket/file2.parquet"]

Polars handles this natively and efficiently using scan_parquet, without requiring PyArrow.

Therefore, this fallback penalizes valid, non-partitioned use cases by:
• Introducing an unnecessary PyArrow dependency and performance cost
• Misleading users who expect Ibis + Polars to behave like Polars itself
• Making it harder to apply Polars-native configuration or optimizations

💡 Proposed Solution

Allow users to opt out of the pyarrow.dataset fallback (or remove it entirely), for example by adding a flag like:

t = con.read_parquet([...], use_pyarrow_dataset=False)

or by detecting whether the provided paths represent a true partitioned dataset.

If the input is just a list of standalone file paths (not a directory or hive-style layout), scan_parquet should be used.

✅ Expected Behavior

read_parquet([...]) should use Polars’ native scan_parquet implementation for any list of paths that isn’t a partitioned directory.

🙋‍♂️ Personal Note

I’d be more than happy to open a Pull Request with a proposed solution — either adding an optional use_pyarrow_dataset parameter or improving the path-handling logic to better align with current Polars capabilities.


📎 Additional Context

Polars docs explicitly mention scan_pyarrow_dataset() only for partitioned datasets, not for general multi-file reads.
Modern Polars versions support reading multiple paths natively and efficiently with scan_parquet.

Removing or making the fallback configurable would make the Ibis–Polars integration more intuitive and performant.

What version of ibis are you using?

11.0.0 (or any)

What backend(s) are you using, if any?

Polars

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior inside of ibis

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions