PyDala2 is a high-performance Python library for managing Parquet datasets with advanced metadata capabilities. Built on Apache Arrow, it provides efficient management of Parquet datasets with features including:
- Smart dataset management with metadata optimization
- Multi-format support (Parquet, CSV, JSON)
- Multi-backend integration (Polars, PyArrow, DuckDB, Pandas)
- Advanced querying with predicate pushdown
- Schema management with automatic validation
- Performance optimization with caching and partitioning
- Catalog system for centralized dataset management
- 🚀 High Performance: Built on Apache Arrow with optimized memory usage and processing speed
- 📊 Smart Dataset Management: Efficient Parquet handling with metadata optimization and caching
- 🔄 Multi-backend Support: Seamlessly switch between Polars, PyArrow, DuckDB, and Pandas
- 🔍 Advanced Querying: SQL-like filtering with predicate pushdown for maximum efficiency
- 📋 Schema Management: Automatic validation, evolution, and tracking of data schemas
- ⚡ Performance Optimization: Built-in caching, compression, and intelligent partitioning
- 🛡️ Type Safety: Comprehensive validation and error handling throughout the library
- 🏗️ Catalog System: Centralized dataset management across namespaces
# Install PyDala2
pip install pydala2
# Install with all optional dependencies
pip install pydala2[all]
# Install with specific backends
pip install pydala2[polars,duckdb]from pydala import ParquetDataset
import pandas as pd
# Create a dataset
dataset = ParquetDataset("data/my_dataset")
# Write data
data = pd.DataFrame({
'id': range(100),
'category': ['A', 'B', 'C'] * 33 + ['A'],
'value': [i * 2 for i in range(100)]
})
dataset.write_to_dataset(
data=data,
partition_cols=['category']
)
# Read with filtering - automatic backend selection
result = dataset.filter("category IN ('A', 'B') AND value > 50")
# Export to different formats
df_polars = result.table.to_polars() # or use shortcut: result.t.pl
df_pandas = result.table.df # or result.t.df
duckdb_rel = result.table.ddb # or result.t.ddb# PyDala2 provides automatic backend selection
# Just access data in your preferred format:
# Polars LazyFrame (recommended for performance)
lazy_df = dataset.table.pl # or dataset.t.pl
result = (
lazy_df
.filter(pl.col("value") > 100)
.group_by("category")
.agg(pl.mean("value"))
.collect()
)
# DuckDB (for SQL queries)
result = dataset.ddb_con.sql("""
SELECT category, AVG(value) as avg_value
FROM dataset
GROUP BY category
""").to_arrow()
# PyArrow Table (for columnar operations)
table = dataset.table.arrow # or dataset.t.arrow
# Pandas DataFrame (for compatibility)
df_pandas = dataset.table.df # or dataset.t.df
# Direct export methods
df_polars = dataset.table.to_polars(lazy=False)
table = dataset.table.to_arrow()
df_pandas = dataset.table.to_pandas()from pydala import Catalog
# Create catalog from YAML configuration
catalog = Catalog("catalog.yaml")
# YAML configuration example:
# tables:
# sales_2023:
# path: "/data/sales/2023"
# filesystem: "local"
# customers:
# path: "/data/customers"
# filesystem: "local"
# Query across datasets using automatic table loading
result = catalog.query("""
SELECT
s.*,
c.customer_name,
c.segment
FROM sales_2023 s
JOIN customers c ON s.customer_id = c.id
WHERE s.date >= '2023-01-01'
""")
# Or access datasets directly
sales_dataset = catalog.get_dataset("sales_2023")
filtered_sales = sales_dataset.filter("amount > 1000")Comprehensive documentation is available at pydala2.readthedocs.io:
- Core Classes
- Dataset Classes
- Table Operations
- Metadata Management
- Catalog System
- Filesystem
- Utilities
Contributions are welcome! Please see our Contributing Guide for details.
