Spark Kindling Framework is a comprehensive solution for building robust data pipelines on Apache Spark, specifically designed for notebook-based environments like Microsoft Fabric and Azure Databricks (better support for local python-first development coming soon). It provides a declarative, dependency-injection-driven approach to defining and executing data transformations while maintaining strict governance and observability.
- Introduction - Framework overview and architecture
- Data Entities - Data entity management system
- Data Pipes - Transformation pipeline system
- Entity Providers - Storage abstraction system
- Logging & Tracing - Observability foundation
- Utilities - Common utilities and helper functions
- Watermarking - Change tracking and incremental processing
- File Ingestion - Built-in file ingestion capabilities
- Stage Processing - Pipeline stage orchestration
- Setup Guide - Installation and configuration
The framework consists of several modular components:
- Dependency Injection Engine - Provides IoC container for loose coupling
- Data Entities - Entity registry and storage abstraction
- Data Pipes - Transformation pipeline definition and execution
- Watermarking - Change tracking for incremental processing
- File Ingestion - File pattern discovery and loading
- Stage Processing - Orchestration of multi-stage pipelines
- Common Transforms - Reusable data transformation utilities
- Logging & Tracing - Comprehensive observability features
# Import Kindling framework
from kindling.data_entities import *
from kindling.data_pipes import *
from kindling.injection import *
# Define data entity
@DataEntities.entity(
entityid="customers.raw",
name="Raw Customer Data",
partition_columns=["country"],
merge_columns=["customer_id"],
tags={"domain": "customer", "layer": "bronze"},
schema=customer_schema
)
# Create data transformation pipe
@DataPipes.pipe(
pipeid="customers.transform",
name="Transform Customers",
tags={"layer": "silver"},
input_entity_ids=["customers.raw"],
output_entity_id="customers.silver",
output_type="table"
)
def transform_customers(customers_raw):
# Transformation logic here
return customers_raw.filter(...)
# Execute pipeline
executer = GlobalInjector.get(DataPipesExecution)
executer.run_datapipes(["customers.transform"])- Declarative definitions for data entities and transformations
- Dependency injection for loose coupling and testability
- Delta Lake integration for reliable storage and time travel
- Watermarking for change tracking and incremental loads
- Pluggable providers for storage, paths, and execution strategies
- Observability through logging and tracing
- Flexible configuration for different environments
Requirements Python 3.8+ Apache Spark 3.x Azure Synapse Artifacts SDK (for use with Synapse)
This framework builds upon several excellent open source projects:
- Apache Spark - Unified analytics engine for large-scale data processing (Apache 2.0)
- inject - Python dependency injection framework (Apache 2.0)
- blinker - Python signal pubsub framework (MIT)
- dynaconf - Configuration management for Python (MIT)
- pytest - Testing framework (MIT)
- azure.synapse.artifacts - Azure Synapse SDK (MIT)
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This is open source software provided without warranty or guaranteed support.
Commercial Support: Professional support, training, and consulting services are available from Software Engineering Professionals, Inc. Contact us at contact information.
This project is licensed under the MIT License - see the LICENSE file for details.
Software Engineering Professionals, Inc. 16080 Westfield Blvd. Carmel, IN 46033 www.sep.com
This framework was developed to solve real-world data processing challenges encountered across multiple enterprise engagements. We're grateful to our clients who have helped shape the requirements and validate the approach.
Note: This framework is maintained by SEP and used across multiple projects. If you're using this framework and encounter issues or have suggestions, please open an issue or submit a pull request.