Skip to content

polardb/duckdb-paimon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DuckDB Paimon Extension πŸ¦†

This extension enables DuckDB to read and query Apache Paimon format data directly β€” no ETL pipelines, no Flink/Spark clusters required. Just open a DuckDB shell and run SQL against your Paimon tables.

Similar to other extension, duckdb-paimon brings DuckDB's powerful local analytics to the Paimon data lake ecosystem.

About Apache Paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations. It innovatively combines lake format and LSM structure, bringing realtime streaming updates into the lake architecture.

Implementation

This extension is built on top of paimon-cpp, an open-source C++ library that provides native access to Paimon format data. It is the first library that brings native Paimon read/write capabilities to the C++ ecosystem.

Technical Highlights

  • Zero JVM dependency β€” No Java runtime required. Pure C++ implementation means minimal memory footprint and instant startup.
  • Apache Arrow data exchange β€” Data flows between paimon-cpp and DuckDB via Apache Arrow, the industry standard for columnar in-memory data, enabling zero-copy transfers with no serialization overhead.
  • Parallel scan architecture β€” Paimon tables are split into independent Splits, and DuckDB's multi-threaded execution engine reads them in parallel to fully utilize multi-core CPUs.
  • Secure credential management β€” OSS credentials are managed through DuckDB's native Secret Manager with scope isolation and automatic key redaction.

Features

  • Read Paimon table data (local and remote OSS)
  • Projection pushdown optimization
  • Multiple file format support (Parquet data files, ORC manifest files)
  • Catalog ATTACH support
  • DuckDB Secret-based OSS credential management

Use Cases

Lightweight Ad-hoc Queries on Realtime Lakehouses

Data is written into Paimon by Flink in real time. Analysts can query it directly on OSS using DuckDB + duckdb-paimon β€” no compute cluster needed, reducing query latency from minutes to seconds.

Data Validation & Quality Checks

Use DuckDB in CI/CD pipelines to run data quality assertions on Paimon tables, verifying that Flink job outputs meet expectations. Lightweight, fast, and dependency-free.

Data Exploration & Debugging

Data engineers developing Flink jobs can instantly inspect the current state of Paimon tables using DuckDB Shell, quickly locating data issues β€” far more efficient than launching a Flink SQL Client.

Cross-format Federated Queries

DuckDB natively supports Parquet, CSV, JSON, Iceberg, and more. Combined with duckdb-paimon, you can JOIN Paimon tables with other data sources without any data movement:

-- Join a Paimon orders table with a local CSV dimension table
SELECT o.order_id, o.amount, c.customer_name
FROM paimon_scan('oss://...', 'db', 'orders') o
JOIN read_csv('customers.csv') c ON o.customer_id = c.id;

Getting Started

Clone the repository:

git clone --recurse-submodules https://github.com/polardb/duckdb-paimon.git
cd duckdb-paimon

Note that --recurse-submodules will ensure DuckDB and paimon-cpp are pulled which are required to build the extension.

Building

GEN=ninja make

Running the Extension

To run the extension code, simply start the shell with ./build/release/duckdb. This shell will have the extension pre-loaded.

Now we can use the features from the extension directly in DuckDB:

Query Local Paimon Tables

SELECT * FROM paimon_scan('./data/testdb.db/testtbl');
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   f0    β”‚  f1   β”‚  f2   β”‚   f3   β”‚
β”‚ varchar β”‚ int32 β”‚ int32 β”‚ double β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Alice   β”‚     1 β”‚     0 β”‚   11.0 β”‚
β”‚ Bob     β”‚     1 β”‚     1 β”‚   12.1 β”‚
β”‚ Cathy   β”‚     1 β”‚     2 β”‚   13.2 β”‚
β”‚ David   β”‚     2 β”‚     0 β”‚   21.0 β”‚
β”‚ Eve     β”‚     2 β”‚     1 β”‚   22.1 β”‚
β”‚ Frank   β”‚     2 β”‚     2 β”‚   23.2 β”‚
β”‚ Grace   β”‚     3 β”‚     0 β”‚   31.0 β”‚
β”‚ Henry   β”‚     3 β”‚     1 β”‚   32.1 β”‚
β”‚ Iris    β”‚     3 β”‚     2 β”‚   33.2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query Remote OSS Paimon Tables

-- Configure OSS credentials
CREATE SECRET my_oss (
    TYPE paimon,
    key_id 'your-access-key-id',
    secret 'your-access-key-secret',
    endpoint 'oss-cn-hangzhou.aliyuncs.com'
);

-- Query Paimon tables on OSS
SELECT * FROM paimon_scan('oss://your-bucket/warehouse', 'your_db', 'your_table');

Attach as Catalog

ATTACH 'oss://my-bucket/warehouse' AS paimon_lake (TYPE paimon);

SHOW ALL TABLES;
DESCRIBE paimon_lake.sales_db.orders;

Running the Tests

make test

Related Projects

  • Apache Paimon β€” Realtime lakehouse format
  • paimon-cpp β€” Native C++ library for Paimon (underlying dependency)
  • DuckDB β€” Embeddable OLAP database
  • duckdb-iceberg β€” DuckDB's official Iceberg extension

Join the Community

We welcome contributions and discussions! If you have questions, ideas, or want to connect with other users and developers, join our community by clicking here or scan the QR code below:

DingTalk Group QR Code

About

DuckDB extension for accessing Apache Paimon. πŸ¦†

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors