Skip to content
/ pydala2 Public

poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars

License

Notifications You must be signed in to change notification settings

legout/pydala2

Repository files navigation

PyDala2

PyDala2

PyPI version License: MIT Documentation

Overview 📖

PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:

  • Smart dataset management with metadata optimization
  • Multi-format support (Parquet, CSV, JSON)
  • Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
  • Advanced querying with predicate pushdown
  • Schema management with automatic validation
  • Performance optimization with caching and partitioning
  • Catalog system for centralized dataset management

✨ Key Features

  • 🚀 High Performance: Built on Apache Arrow with optimized memory usage and processing speed
  • 📊 Smart Dataset Management: Efficient Parquet handling with metadata optimization and caching
  • 🔄 Multi-backend Support: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
  • 🔍 Advanced Querying: SQL-like filtering with predicate pushdown for maximum efficiency
  • 📋 Schema Management: Automatic validation, evolution, and tracking of data schemas
  • ⚡ Performance Optimization: Built-in caching, compression, and intelligent partitioning
  • 🛡️ Type Safety: Comprehensive validation and error handling throughout the library
  • 🏗️ Catalog System: Centralized dataset management across namespaces

🚀 Quick Start

Installation

# Install PyDala2
pip install pydala2

# Install with all optional dependencies
pip install pydala2[all]

# Install with specific backends
pip install pydala2[polars,duckdb]

Basic Usage

from pydala import ParquetDataset
import pandas as pd

# Create a dataset
dataset = ParquetDataset("data/my_dataset")

# Write data
data = pd.DataFrame({
    'id': range(100),
    'category': ['A', 'B', 'C'] * 33 + ['A'],
    'value': [i * 2 for i in range(100)]
})
dataset.write_to_dataset(
    data=data,
    partition_cols=['category']
)

# Read with filtering - automatic backend selection
result = dataset.filter("category IN ('A', 'B') AND value > 50")

# Export to different formats
df_polars = result.table.to_polars()  # or use shortcut: result.t.pl
df_pandas = result.table.df           # or result.t.df
duckdb_rel = result.table.ddb         # or result.t.ddb

Using Different Backends

# PyDala2 provides automatic backend selection
# Just access data in your preferred format:

# Polars LazyFrame (recommended for performance)
lazy_df = dataset.table.pl  # or dataset.t.pl
result = (
    lazy_df
    .filter(pl.col("value") > 100)
    .group_by("category")
    .agg(pl.mean("value"))
    .collect()
)

# DuckDB (for SQL queries)
result = dataset.ddb_con.sql("""
    SELECT category, AVG(value) as avg_value
    FROM dataset
    GROUP BY category
""").to_arrow()

# PyArrow Table (for columnar operations)
table = dataset.table.arrow  # or dataset.t.arrow

# Pandas DataFrame (for compatibility)
df_pandas = dataset.table.df  # or dataset.t.df

# Direct export methods
df_polars = dataset.table.to_polars(lazy=False)
table = dataset.table.to_arrow()
df_pandas = dataset.table.to_pandas()

Catalog Management

from pydala import Catalog

# Create catalog from YAML configuration
catalog = Catalog("catalog.yaml")

# YAML configuration example:
# tables:
#   sales_2023:
#     path: "/data/sales/2023"
#     filesystem: "local"
#   customers:
#     path: "/data/customers"
#     filesystem: "local"

# Query across datasets using automatic table loading
result = catalog.query("""
    SELECT
        s.*,
        c.customer_name,
        c.segment
    FROM sales_2023 s
    JOIN customers c ON s.customer_id = c.id
    WHERE s.date >= '2023-01-01'
""")

# Or access datasets directly
sales_dataset = catalog.get_dataset("sales_2023")
filtered_sales = sales_dataset.filter("amount > 1000")

📚 Documentation

Comprehensive documentation is available at pydala2.readthedocs.io:

Getting Started

User Guide

API Reference

Advanced Topics

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

📝 License

MIT License

About

poor man´s data lake - Simple api to efficiently query your parquet datasets using Duckdb or polars

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages