This project demonstrates how to build a real-time data pipeline that streams data from PostgreSQL to Snowflake using Apache Kafka as the messaging backbone. The pipeline enables near real-time data replication and analytics capabilities.
PostgreSQL -> Kafka Connect (Source) -> Kafka -> Kafka Connect (Sink) -> Snowflake
- Docker and Docker Compose
- Python 3.7+
- PostgreSQL database
- Snowflake account
- Kafka Connect with required connectors
- PostgreSQL: Source database where the data originates
- Apache Kafka: Distributed streaming platform
- Kafka Connect: Framework for streaming data between Kafka and other systems
- Snowflake: Cloud data warehouse destination
- Clone this repository
- Set up your environment variables
- Start the services using Docker Compose:
docker compose up -d
- Configure Kafka Connect:
cd create_connector python push_kafka_connect.py
The project includes several important configuration files:
docker-compose.yml: Defines the required servicescreate_connector/config.json: Kafka Connect configurationprometheus.yml: Monitoring configurationgrafana_dashboard.yml: Dashboard configuration
The pipeline includes monitoring capabilities using:
- Prometheus for metrics collection
- Grafana for visualization
Security features implemented:
- RSA key authentication for Snowflake
- FIPS compliance with custom security modules
connect-plugins/: Contains required Kafka connector JARscreate_connector/: Connector configuration and setup scriptsseed_data/: Test data and seeding scriptsuser/: Security keys and certificates
Feel free to submit issues and enhancement requests.
This project is licensed under the MIT License - see the LICENSE file for details.