-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description of Bug:
It would be great if we could stream to S3:// through the data.source.coop endpoint using duckdb, similar to what we can do with other S3-compliant interfaces (Minio, Ceph, etc). It looks like data.source.coop requires the configuration:
s3 =
multipart_threshold = 44MB
which is understood by the aws cli client. Unfortunately, this option does not appear to be understood by other tools such as duckdb (or I think GDAL) that otherwise implement much but not all of the S3 interface. This causes writes from these common geospatial utilities to fail to source.coop. (Note, I understand this is not the same thing as multipart_chunksize, which is configurable at least for GDAL).
Steps to Reproduce:
import streamlit as st
import ibis
con = ibis.duckdb.connect()
# fill in creds
query = f'''
CREATE OR REPLACE SECRET {key} (
TYPE S3,
KEY_ID '{key}',
SECRET '{secret}',
ENDPOINT 'data.source.coop',
URL_STYLE 'path'
);
'''
con.raw_sql(query)
# Try to write a > 44 MB file:
(con
.read_parquet("s3://cboettig/gbif/app/redlined_cities_gbif.parquet")
.to_parquet( "s3://cboettig/gbif/app/redlined_cities_gbif2.parquet")
)
Expected Behavior:
Writes a copy of the parquet file to the bucket.
Actual Behavior:
Error:
---------------------------------------------------------------------------
HTTPException Traceback (most recent call last)
Cell In[9], line 3
1 (con
2 .read_parquet("s3://cboettig/gbif/app/redlined_cities_gbif.parquet")
----> 3 .to_parquet( "s3://cboettig/gbif/app/redlined_cities_gbif2.parquet")
4 )
File /opt/conda/lib/python3.12/site-packages/ibis/expr/types/core.py:608, in Expr.to_parquet(self, path, params, **kwargs)
563 @experimental
564 def to_parquet(
565 self,
(...)
569 **kwargs: Any,
570 ) -> None:
571 """Write the results of executing the given expression to a parquet file.
572
573 This method is eager and will execute the associated expression
(...)
606 :::
607 """
--> 608 self._find_backend(use_default=True).to_parquet(self, path, **kwargs)
File /opt/conda/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py:1550, in Backend.to_parquet(self, expr, path, params, **kwargs)
1548 args = ["FORMAT 'parquet'", *(f"{k.upper()} {v!r}" for k, v in kwargs.items())]
1549 copy_cmd = f"COPY ({query}) TO {str(path)!r} ({', '.join(args)})"
-> 1550 with self._safe_raw_sql(copy_cmd):
1551 pass
File /opt/conda/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
135 del self.args, self.kwds, self.func
136 try:
--> 137 return next(self.gen)
138 except StopIteration:
139 raise RuntimeError("generator didn't yield") from None
File /opt/conda/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py:323, in Backend._safe_raw_sql(self, *args, **kwargs)
321 @contextlib.contextmanager
322 def _safe_raw_sql(self, *args, **kwargs):
--> 323 yield self.raw_sql(*args, **kwargs)
File /opt/conda/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py:97, in Backend.raw_sql(self, query, **kwargs)
95 with contextlib.suppress(AttributeError):
96 query = query.sql(dialect=self.name)
---> 97 return self.con.execute(query, **kwargs)
HTTPException: HTTP Error: Unable to connect to URL https://data.source.coop/cboettig/gbif/app/redlined_cities_gbif2.parquet?partNumber=1&uploadId=mUB0k9fxk6YvYYJbN7SgpWfrCU2lzfLZ7FjULA2l_IzcigHjY15G06DTuYfoI70tBE01h5A9o.WOf3gnX8ranlinYQNvfJ7N5EZhmAkJ_6bnC2mO3deAIZZCPVfFe8pRuHRGYRgE5I2xY_wWiDC_tZ3WdDRyvL7QAqHuv7j5GXs- Payload Too Large (HTTP code 413)