Airflow, PySpark, Iceberg, and DuckDB: build local environment with k8s or Docker Compose with AI (almost)
This is an attempt to create a working project with AI for PySpark with Airflow, Iceberg, DuckDB, and DBT.
To skip the introduction, go to detailed explanation.
Initially, I asked Cursor.ai to create a sample project with different options using the latest Airflow 3.x and Spark 4.x. With the option to run with minIO instead of S3 storage.
You can see cursor_init_chat.md. The structure looked nice, and the initial code was at an average level, but nothing worked despite the "tests" cursor claiming it passed. After a few attempts, I tried to talk with Gemini (gemini.md) - improved a few things, but still failed to get something running. Then, I tried getting help from ChatGPT (chatGpt_airflow3.md).
Then, with the help of reading documentation, refactoring by myself, and very specific/concrete questions to both ChatGPT and Gemini, I finally succeeded in running the app and submitting it to the Spark master.
Funny things:
- all 3 AIs suggested some mixed configuration for airflow 3.x (the latest) and previous 2.x. So airflow failed to start.
- all 3 AIs tried to put configuration for Spark 4 and Iceberg, but only asked a direct question to Gemini (after I found in the documentation) if Iceberg works with Spark 4, and it answered that it is still in work
So, downgraded Spark to 3.5.6.
Finally, despite getting a few wrong suggestions from Gemini and ChatGPT about the right spark configurations, succeeded in running Airflow, to show data, and to write it to minIO with and without Iceberg.
As a bonus, I added a simple DAG using DuckDB without any Spark.
UPDATE:
- Shortly after finishing this, Apache Iceberg released libraries to support Spark 4.x, so I updated all dependencies.
- Again, shortly after publishing the blog, Airflow released version 3.1.0, so I updated it, too.
- I decided it could be nice to make all dags to work with k8s, so I added relevant dags for k8s and added a script for creating local k8s.
So, you can read my blog article about the journey :) Airflow, PySpark, Iceberg: build local environment with AI (almost) or just proceed to the project description here.

Disclaimer:
This structure is only for learning and/or development purposes.
Don't use it as is in production - adjust it.
Some of the components use an insecure approach.
I added support for different profiles distinguished by environment variables,
However, I have never tested it except for hadoop with minIO, but it shouldn't be problems with both glue and hive (if I haven't missed some env variable :) ).
Please be smart, update/change it according to your needs.
The purpose of this project is to create a local/dev playground for running Airflow, Spark, and DBT.
- Workflow orchestration with Airflow
- Distributed data processing with Spark, with and without Spark
- Data transformation and modeling with dbt
- Lightweight analytics with DuckDB
- Storing the streaming data into Iceberg tables with Iceberg Kafka Connect
- Multi-Environment Support: Run locally with Docker Compose or deploy to Kubernetes using Helm
- Multiple Catalog Backends: Support for Iceberg catalogs (Hadoop, AWS Glue, Hive Metastore - I didn't test the last two, but they should work, since I had similar project it worked)
- Flexible Storage Options: Works with MinIO (local S3-compatible storage) or AWS S3
- Apache Airflow 3.x - Workflow orchestration platform
- Web UI (API Server) - User interface and REST API
- Scheduler - Orchestration brain
- DAG Processor - DAG parsing service
- Triggerer - Async task management
- PostgreSQL - Metadata database
- LocalExecutor - Job execution (development mode)
- Apache Spark 4.x - Distributed data processing
- Spark Master - Resource management (for Docker Compose only)
- Spark Workers - Distributed computation
- Spark History Server - Job monitoring and debugging (for Docker Compose only)
- DuckDB on Airflow Workers
- Apache Iceberg - Table format for data lakes
- MinIO - S3-compatible object storage
- Multiple Catalog Options: Hadoop, AWS Glue, Hive Metastore
- Kafka Cluster - single node kafka cluster for streaming data
-
- Includes Kafka-UI to view and manage Kafka Cluster
- Kafka Connect Iceberg Sink - Kafka connect to stream data from kafka topic to Iceberg Table
Fake Data Generator - to tests Kafka Connect I created small python web service with single Rest API that receives number of message, generates and sends these fake messages to Kafka topic.
- dbt (data build tool) - SQL-based data transformation and modeling
- Staging models - Data cleaning and standardization
- Marts models - Business logic and aggregations
This project can be built for two types of local environments:
For production-like environments, deploy to Kubernetes using the provided Helm charts and management script.
The quickest way to get started. Run the entire stack locally using Docker:
cd helm/
./my_helm -t=v1 --buildSee helm/README.md for detailed Kubernetes deployment instructions, but here is a little summary: It builds a Postgres DB, a MinIO pod, and uploads the example file, creates Airflow containers using the Airflow official helm chart, and installs Kubeflow Spark Operator, creates RBAC roles, role bindings, and other staff. I decided to skip Spark History Server in k8s :).
Note: If I remember correctly, to add the Spark History Server, you just need to change spark-defaults.conf to point to the S3 bucket. Something like that:
spark.eventLog.dir s3://bucket/spark_events # spark to write events to
spark.history.fs.logDirectory s3://bucket/spark_events # history server to read events from
and of course, create Spark History deployment.
Important note about Spark:
- All Spark jobs use the Kubernetes controller as the Spark master
- There are two use cases for Spark and k8s:
- Using built-in Apache Spark Kubernetes integration in dbt_k8s_dag.py where spark-driver on dbt container asks k8s controller for executors.
- Using Kubeflow Spark Operator and SparkKubernetesOperator in Airflow.
The quickest way to get started. Run the entire stack locally using Docker:
./docker_compose.sh up -d --buildThis creates:
- Airflow Web UI (known in the new version as api-server) - It is responsible for showing you a nice UI and allowing other systems to interact via REST.
- Airflow Scheduler - This one is actual orchestration. The brain part of it. Decides when and what to run according to the scheduling defined by the developers of DAGs.
- DAG Processor - the new part in Airflow 3.x. Once, this functionality was a part of Schedular in Airflow 2.x. Now this service is responsible for parsing DAGs, It allows decoupling the DAG parsing and updating the DAG code without interrupting the work of all other Airflow services.
- Triggerer - this one is a new part of Airflow 3.x, too. If you have had hundreds of jobs in previous Airflow, you are familiar with the pain point of triggering/waiting/crashing DAGs. This new service functions like an asynchronous queue processor, managing thousands of tasks without occupying a worker slot.
- Postgres - the place where Airflow stores all its data, statuses, etc.
- Airflow Executors - the place where your job is done. In the current project, I am using LocalExecutor, which runs on a Scheduler instance. However, in production, you'll use something more flexible, such as CeleryExecutor or KubernetesExecutor.
______ 1. For example, in the case of DBT dag, the Airflow executor (worker) calls DBT container via ssh.
______ 2. In the case of Spark, the Airflow executor (worker) is a service where the Spark submit operation is executed (meaning, this is Spark Driver), the job itself is done on the Spark cluster (spark-master, spark-workers).
- airflow-init - container that runs initial scripts, creates users, etc. It dies immediately after it has finished.
- Spark Master - The Master performs tasks like scheduling, monitoring, and resource allocation.
- 3 Spark Workers - worker runs the executor process where the individual tasks are assigned by the Spark Driver.
- Spark History Server - the server to view and analyze spark jobs that have already finished.
- DBT container
Access Points:
- Airflow UI: http://localhost:8080 (admin/admin)
- Spark Master UI: http://localhost:8081
- Spark History Server: http://localhost:18080
- MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
- Fake Data Generator: Swagger http://localhost:8090/docs
- Kafka Connect REST API: http://localhost:8083/connectors/iceberg-kafka-connect/
- Kafka UI: http://localhost:8084
I wanted to create dockerfiles in such a way that I can use them for both the Docker Compose and the Kubernetes. It means, you can find inside dockerfiles the parts that relevant only for running via Docker Compose or and the Kubernetes.
For example, for k8s, all containers mount the templates folder. The idea is to use those templates during Apache Spark k8s operations. It's possible to use those templates from S3; I just tried to simplify the process. However, I have uploaded templates to minIO, so you can try this approach if you'd like.
- Dockerfile.airflow
- The part that deals with private/public key is relevant only for Docker Compose scenario (and don't use it in production either).
- Dockerfile.dbt
- The part that deals with private/public key and SSH server permissions is relevant only for Docker Compose scenario, too (and don't use it in production either).
- The part with adding
sparkuser to container is because in k8s dbt container becomes Spark Driver, and it's the reason why I add SPARK_DRIVER_HOST into dbt profiles and dbt_template, since executors should communicate with Spark Driver.
- Dockerfile.spark
- Dockerfile.fake small python app for generating fake data and sending it to kafka.
- Dockerfile.kconnect docker files used to create kafka connect image with iceberg-kafka-connect.
Note: To reduce the time of building images and since Dockerfile.airflow and Dockerfile.dbt both use Dockerfile.spark, those two images use a multi-stage build based on already existing spark image.
It means you should always ensure that py-spark-spark exists before building the other two. If you don't want to deal with it, just use the scripts I have created:
- For Docker Compose docker_compose.sh
- For k8s and Helm: helm/my_helm.sh
The project includes several example DAGs demonstrating different patterns:
Simple logging example (runs with both Docker Compose and Kubernetes)
Lightweight analytics without Spark (runs with both Docker Compose and Kubernetes). It reads a file from Object Storage, performs simple grouping by query, and stores the result back in Object Storage.
Simple job that creates a dataframe, runs some SQL operation, and writes data to the Object Storage.
- Docker Compose - uses remote Spark cluster (part of docker-compose-airflow.yml)
- k8s - uses Airflow
SparkKubernetesOperatorto trigger creation (via Kubernetes Controller and Kubeflow Spark Operator) Spark Cluster: driver and executors.
Spark job reads data from the Object Storage, performs some grouping by operation, and stores data in the Iceberg table (backed by Object Storage) and catalog implementation (Hadoop, Glue, or Hive).
- Docker Compose - uses remote Spark cluster (part of docker-compose-airflow.yml)
- k8s - uses Airflow
SparkKubernetesOperatorto trigger creation (via Kubernetes Controller and Kubeflow Spark Operator) Spark Cluster: driver and executors.
Same as the previous one, but using DBT. Really overengineered use case, but it's for learning purposes - pardon me. Here is the example again, which includes too much code due to my desire to reuse the same staff for both Docker Compose and k8s. Inside the dbt/profiles.yml, you can see all Spark configuration relevant for k8s. If you use Docker Compose only, a remote Spark Cluster, you can delete all staff similar to this if statement:
"{{ env_var('SPARK_DRIVER_HOST') if env_var('SPARK_MASTER_URL', 'local[*]').startswith('k8s://') else 'dbt' }}"Otherwise, you can just remove everything after if as follows:
"{{ env_var('SPARK_DRIVER_HOST')"- Docker Compose - but don't use it in such a way. Very unsecured.
- k8s
For more information, refer to dbt here.
This dag is for debugging purposes. It executes spark sql over Iceberg tables. It should be executed manually, and it has 2 parameters:
- warehouse name - data warehouse name
- spark-sql - sql to execute
The 5 first lines of result is printed in logs
py_spark/
├── README.md # Detailed technical documentation
├── dbt/ # dbt data transformation project
│ └── README.md # dbt-specific documentation
├── helm/ # Kubernetes/Helm deployment
│ └── README.md # Helm deployment guide
├── dags/ # Airflow DAG definitions
├── kafka/ # the python applications related to kafka
│ └── data_generator # python code for "fake" service to generate fake data
├── docker-compose-airflow.yml # Docker Compose configuration
├── Dockerfile.airflow # Airflow container image
├── Dockerfile.spark # Spark container image
├── Dockerfile.dbt # dbt container image
├── Dockerfile.fake # dbt container image
├── scripts/ # Helper scripts
├── data/ # Sample data files
└── requirements.txt # Python dependencies
- Docker and Docker Compose (for local development)
- Kubernetes cluster (based on Docker Desktop) and Helm (for Kubernetes deployment)
- Python 3.11+ (if running components outside containers)
-
Clone the repository
cd /path/to/py_spark -
Build and start all services
./docker_compose.sh up -d --build
-
Access Airflow UI
- Navigate to http://localhost:8080
- Login with admin/admin
- Enable and trigger sample DAGs
-
Monitor Spark jobs
- Spark Master UI: http://localhost:8081
- Spark History Server: http://localhost:18080
-
Check MinIO storage
- MinIO Console: http://localhost:9001
- Login with minioadmin/minioadmin
Access Airflow UI
- Port Forward airflow-api-server:
kubectl port-forward svc/airflow-api-server 8080:8080 --namespace py-spark - Navigate to http://localhost:8080
- Login with admin/admin
- Enable and trigger sample DAGs
For Kubernetes deployment instructions, see helm/README.md.
The project uses environment variables for configuration. Example files are provided:
env.minio.example- Configuration for MinIO storage (tested)env.glue.example- Configuration for AWS Glue Catalog (not tested)
The platform supports multiple Iceberg catalog implementations:
- Hadoop Catalog - File-based catalog (default for local development)
- AWS Glue Catalog - AWS managed catalog service (not tested, but should work)
- Hive Metastore - Traditional Hive catalog (not tested)
-
Main README - Comprehensive technical documentation covering:
- Detailed architecture and component descriptions
- Dockerfile explanations for each service
- DAG implementation details
- Configuration and troubleshooting
-
Helm Deployment Guide - Kubernetes deployment documentation:
- Helm chart management with
my_helm.shscript - Service-by-service deployment options
- Upgrade and maintenance procedures
- Kubernetes-specific troubleshooting
- Helm chart management with
-
dbt Documentation - Data transformation layer guide:
- dbt project structure and models
- Multi-catalog configuration (Hadoop, Glue, Hive)
- Environment variable reference
- Spark configuration for dbt
- Integration with Airflow and Kubernetes
This project is designed for learning and development purposes.
For production use, you should:
- Replace insecure SSH-based communication patterns
- Implement proper secrets management (not plain environment variables)
- Use production-grade executors (CeleryExecutor or KubernetesExecutor)
- Configure proper resource limits and monitoring
- Use external object storage (S3) instead of MinIO
- Implement proper authentication and authorization
- Set up proper logging and alerting
- Airflow: 3.1.0
- Spark: 4.0.0
- Iceberg: 1.10.0
- Python: 3.11
- dbt-core: Latest with dbt-spark adapter
- PostgreSQL: Latest (Airflow metadata)
- Kafka: 4.1.0
This project was developed with assistance from AI tools (Cursor.ai, IntelliJ Junie, IntelliJ Gemini plug-in, ChatGPT, Gemini, Claude). While AI provided a good starting structure, significant manual refinement was required to create a working system. The project serves as a realistic example of AI-assisted development, including both its potential and limitations.