Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions plugins/duckdb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# DuckDB Plugin for Flyte

Run DuckDB SQL queries as Flyte tasks with parameterized inputs, extension support, and DataFrame output.

DuckDB is an embedded analytical database (like SQLite for OLAP). Queries execute locally and synchronously, so no remote credentials or connection setup is required.

## Installation

```bash
pip install flyteplugins-duckdb
```

## Quick start

```python
from flyteplugins.duckdb import DuckDB, DuckDBConfig

import flyte

config = DuckDBConfig()

query = DuckDB(
name="count_rows",
query_template="SELECT COUNT(*) AS total FROM 'data.parquet'",
plugin_config=config,
output_dataframe_type=pd.DataFrame,
)
```

## In-memory queries

By default, DuckDB runs in-memory. This is ideal for ad-hoc analytics and querying files directly:

```python
config = DuckDBConfig() # defaults to database_path=":memory:"

task = DuckDB(
name="analyze",
query_template="SELECT * FROM 'sales.parquet' WHERE amount > 100",
plugin_config=config,
output_dataframe_type=pd.DataFrame,
)
```

## File-based databases

To query a persistent DuckDB database file:

```python
config = DuckDBConfig(database_path="/data/analytics.duckdb")

task = DuckDB(
name="query_db",
query_template="SELECT * FROM customers LIMIT 10",
plugin_config=config,
output_dataframe_type=pd.DataFrame,
)
```

## Parameterized queries

Use `%(name)s` placeholders and typed `inputs`:

```python
lookup = DuckDB(
name="lookup_user",
query_template="SELECT * FROM 'users.parquet' WHERE id = %(user_id)s",
plugin_config=config,
inputs={"user_id": int},
output_dataframe_type=pd.DataFrame,
)
```

## Extensions

DuckDB supports extensions for additional functionality. Install and load them via `DuckDBConfig.extensions`:

```python
config = DuckDBConfig(extensions=["httpfs"])

task = DuckDB(
name="query_s3",
query_template="SELECT * FROM 's3://bucket/data.parquet' LIMIT 100",
plugin_config=config,
output_dataframe_type=pd.DataFrame,
)
```

Common extensions:
- `httpfs` - Read files from HTTP/S3
- `spatial` - Geospatial functions
- `json` - JSON processing
- `excel` - Read Excel files

## Reading results as DataFrames

Set `output_dataframe_type` to get query results as a pandas DataFrame:

```python
import pandas as pd

select_task = DuckDB(
name="get_data",
query_template="SELECT * FROM 'data.parquet'",
plugin_config=config,
output_dataframe_type=pd.DataFrame,
)
```

## Full example

```python
import pandas as pd
from flyteplugins.duckdb import DuckDB, DuckDBConfig

import flyte

config = DuckDBConfig(extensions=["httpfs"])

analyze_task = DuckDB(
name="analyze_sales",
query_template="SELECT region, SUM(amount) as total FROM 'sales.parquet' GROUP BY region",
plugin_config=config,
output_dataframe_type=pd.DataFrame,
)

duckdb_env = flyte.TaskEnvironment.from_task("duckdb_env", analyze_task)

env = flyte.TaskEnvironment(
name="example_env",
image=flyte.Image.from_debian_base().with_pip_packages("flyteplugins-duckdb"),
depends_on=[duckdb_env],
)


@env.task
async def main() -> float:
df = await analyze_task()
return df["total"].sum().item()


if __name__ == "__main__":
flyte.init_from_config()
run = flyte.with_runcontext(mode="remote").run(main)
print(run.url)
```
73 changes: 73 additions & 0 deletions plugins/duckdb/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
[project]
name = "flyteplugins-duckdb"
dynamic = ["version"]
description = "DuckDB plugin for flyte"
readme = "README.md"
authors = [{ name = "Andre Ahlert", email = "andreahlert@users.noreply.github.com" }]
requires-python = ">=3.10"
dependencies = [
"flyte[connector]",
"duckdb",
]

[project.entry-points."flyte.connectors"]
duckdb = "flyteplugins.duckdb.connector:DuckDBConnector"

[build-system]
requires = ["setuptools", "setuptools_scm"]
build-backend = "setuptools.build_meta"

[tool.setuptools]
include-package-data = true
license-files = ["licenses/*.txt", "LICENSE"]

[tool.setuptools.packages.find]
where = ["src"]
include = ["flyteplugins*"]

[tool.setuptools_scm]
root = "../../"

[tool.pytest.ini_options]
norecursedirs = []
log_cli = true
log_cli_level = 20
markers = []
asyncio_default_fixture_loop_scope = "function"

[tool.coverage.run]
branch = true

[tool.ruff]
line-length = 120

[tool.ruff.lint]
select = [
"E",
"W",
"F",
"I",
"PLW",
"YTT",
"ASYNC",
"C4",
"T10",
"EXE",
"ISC",
"LOG",
"PIE",
"Q",
"RSE",
"FLY",
"PGH",
"PLC",
"PLE",
"PLW",
"FURB",
"RUF",
]
ignore = ["PGH003", "PLC0415"]

[tool.ruff.lint.per-file-ignores]
"examples/*" = ["E402"]
"tests/*" = ["ASYNC230", "ASYNC240"]
52 changes: 52 additions & 0 deletions plugins/duckdb/src/flyteplugins/duckdb/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""
Key features:

- Run SQL queries against DuckDB (in-memory or file-based)
- Parameterized SQL queries with typed inputs
- Query Parquet, CSV, and JSON files directly
- Load DuckDB extensions (httpfs, spatial, etc.)
- Returns query results as DataFrames

Basic usage example:
```python
import flyte
from flyte.io import DataFrame
from flyteplugins.duckdb import DuckDB, DuckDBConfig

config = DuckDBConfig()

count_rows = DuckDB(
name="count_rows",
query_template="SELECT COUNT(*) AS total FROM 'data.parquet'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the data.parquet coming from?
Is this an input of type flyte.io.DataFrame

Then i would love to support

count_rows = DuckDB(
    name="count_rows",
    query_template="SELECT COUNT(*) AS total FROM '{input}'",
    plugin_config=config,
    input=DataFrame,   # This can be implicit
    output_dataframe_type=DataFrame,
)

Then you can pass a parquet, a pandasDataframe, a spark DataFrame or anything to it

plugin_config=config,
output_dataframe_type=DataFrame,
)

flyte.TaskEnvironment.from_task("duckdb_env", count_rows)

if __name__ == "__main__":
flyte.init_from_config()

# Run locally (connector runs in-process)
run = flyte.with_runcontext(mode="local").run(count_rows)

# Run remotely (connector runs on the control plane)
run = flyte.with_runcontext(mode="remote").run(count_rows)

print(run.url)
```
"""

from flyte.io._dataframe.dataframe import DataFrameTransformerEngine

from flyteplugins.duckdb.connector import DuckDBConnector
from flyteplugins.duckdb.dataframe import (
DuckDBToPandasDecodingHandler,
PandasToDuckDBEncodingHandler,
)
from flyteplugins.duckdb.task import DuckDB, DuckDBConfig

DataFrameTransformerEngine.register(PandasToDuckDBEncodingHandler())
DataFrameTransformerEngine.register(DuckDBToPandasDecodingHandler())

__all__ = ["DuckDB", "DuckDBConfig", "DuckDBConnector"]
Loading
Loading