Skip to content

Payload Too Large (HTTP code 413) -- Revisit Multipart threshold behavior? #40

@cboettig

Description

@cboettig

Description of Bug:

It would be great if we could stream to S3:// through the data.source.coop endpoint using duckdb, similar to what we can do with other S3-compliant interfaces (Minio, Ceph, etc). It looks like data.source.coop requires the configuration:

s3 =
  multipart_threshold = 44MB

which is understood by the aws cli client. Unfortunately, this option does not appear to be understood by other tools such as duckdb (or I think GDAL) that otherwise implement much but not all of the S3 interface. This causes writes from these common geospatial utilities to fail to source.coop. (Note, I understand this is not the same thing as multipart_chunksize, which is configurable at least for GDAL).

Steps to Reproduce:

import streamlit as st
import ibis
con = ibis.duckdb.connect()

# fill in creds
query = f'''
    CREATE OR REPLACE SECRET {key} (
        TYPE S3,
        KEY_ID '{key}',
        SECRET '{secret}',
        ENDPOINT 'data.source.coop',
        URL_STYLE 'path'
    );
    '''
    con.raw_sql(query)


# Try to write a > 44 MB file:

(con
.read_parquet("s3://cboettig/gbif/app/redlined_cities_gbif.parquet")
.to_parquet( "s3://cboettig/gbif/app/redlined_cities_gbif2.parquet")
)

Expected Behavior:

Writes a copy of the parquet file to the bucket.

Actual Behavior:

Error:

---------------------------------------------------------------------------
HTTPException                             Traceback (most recent call last)
Cell In[9], line 3
      1 (con
      2 .read_parquet("s3://cboettig/gbif/app/redlined_cities_gbif.parquet")
----> 3 .to_parquet( "s3://cboettig/gbif/app/redlined_cities_gbif2.parquet")
      4 )

File /opt/conda/lib/python3.12/site-packages/ibis/expr/types/core.py:608, in Expr.to_parquet(self, path, params, **kwargs)
    563 @experimental
    564 def to_parquet(
    565     self,
   (...)
    569     **kwargs: Any,
    570 ) -> None:
    571     """Write the results of executing the given expression to a parquet file.
    572 
    573     This method is eager and will execute the associated expression
   (...)
    606     :::
    607     """
--> 608     self._find_backend(use_default=True).to_parquet(self, path, **kwargs)

File /opt/conda/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py:1550, in Backend.to_parquet(self, expr, path, params, **kwargs)
   1548 args = ["FORMAT 'parquet'", *(f"{k.upper()} {v!r}" for k, v in kwargs.items())]
   1549 copy_cmd = f"COPY ({query}) TO {str(path)!r} ({', '.join(args)})"
-> 1550 with self._safe_raw_sql(copy_cmd):
   1551     pass

File /opt/conda/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File /opt/conda/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py:323, in Backend._safe_raw_sql(self, *args, **kwargs)
    321 @contextlib.contextmanager
    322 def _safe_raw_sql(self, *args, **kwargs):
--> 323     yield self.raw_sql(*args, **kwargs)

File /opt/conda/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py:97, in Backend.raw_sql(self, query, **kwargs)
     95 with contextlib.suppress(AttributeError):
     96     query = query.sql(dialect=self.name)
---> 97 return self.con.execute(query, **kwargs)

HTTPException: HTTP Error: Unable to connect to URL https://data.source.coop/cboettig/gbif/app/redlined_cities_gbif2.parquet?partNumber=1&uploadId=mUB0k9fxk6YvYYJbN7SgpWfrCU2lzfLZ7FjULA2l_IzcigHjY15G06DTuYfoI70tBE01h5A9o.WOf3gnX8ranlinYQNvfJ7N5EZhmAkJ_6bnC2mO3deAIZZCPVfFe8pRuHRGYRgE5I2xY_wWiDC_tZ3WdDRyvL7QAqHuv7j5GXs- Payload Too Large (HTTP code 413)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions