By pure chance, I stumbled upon Ansh Lamba’s post on my LinkedIn feed and was instantly inspired by his clear, hands-on approach to data engineering, showcased through a project using an aviation dataset. Aviation is a subject I absolutely love and have previously explored in depth during my Ironhack bootcamp capstone project.
The project aligns perfectly with my recent Databricks Data Analyst Associate and Databricks Data Engineer Associate certifications, aiming to provide a comprehensive understanding of modern data engineering practices within the Databricks ecosystem.


♡ Ansh Lamba – for sharing incredible hands-on tutorials and making complex Data Engineering concepts feel simple (and fun!)
♡ Databricks Free Edition – huge thanks to Databricks for giving everyone a free Lakehouse sandbox to learn and level up in Data & AI
♡ dbdiagram.io – you guys rock, not only with your incredible tools but also your dedicated documentation
♡ GitHub – for providing the ultimate platform for collaboration, version control, and open-source learning
- SQL
- Python
- Apache Spark
- Spark Structured Streaming
- Databricks
- Auto Loader
- Lakeflow Declarative Pipelines
- Unity Catalog Volumes
- Designing a dimensional data model (star schema)
- Implementing Medallion Architecture
- Incremental data ingestion with Spark Streaming and Databricks Auto Loader
- Parameter parsing and control flow in Databricks jobs
- Building declarative pipelines with Lakeflow
- Creating automated Slowly Changing Dimensions (SCD) builders and fact builders
- Medallion Architecture Implementation
- The Dataset
- The Setup
- 1. From Raw to Bronze Layer
- 2. From Bronze do Silver Layer
- 3. From Silver to Gold Layer
- 4. Stakeholder's Flight KPI Dashboard
The project is structured around the Medallion Architecture, which consists of three layers:
Data Ingestion:- Incremental data loading using Spark Structured Streaming and Databricks Auto Loader.
Source Data:- Primarily focuses on file-based data sources, reflecting common big data scenarios.
Dynamic Solution:- Emphasizes building dynamic, deployable notebooks for the bronze layer using parameter parsing and control flow in Databricks jobs, rather than static solutions.
Data Transformation:- Utilizes Databricks Lakeflow Declarative Pipelines (formerly Delta Live Tables - DLT) for data transformations.
Advanced Techniques:- Incorporates PySpark, function calling, and other newly introduced features within Databricks Lakeflow for an enhanced development experience.
Dynamic Approach:- Similar to the bronze layer, the silver layer will also be built using a dynamic approach for deployable notebooks.
Core of the Project:- This layer is highlighted as the "heart" of the project due to its innovative approach.
Dimensional Data Model:- Building a star schema.
Automated Builders:- A key innovation is the creation of a Slowly Changing Dimensions (SCD) builder and an automatic fact builder. These builders will automate the creation and deployment of SCDs and fact tables, making the solution reusable across multiple data warehouses. This reflects an industry-level approach to data warehouse development.
The source data for the project is indeed made available by Ansh Lamba in this repository on GitHub.
Based on the defined business problem, the data model design aims to represent the data efficiently for reusability, flexibility and scalability.
The dataset includes multiple versions and categories of these dimensions to facilitate mastering concepts like Slowly Changing Dimensions (SCD) and incremental data loading.
The dataset is structured around a central fact table and three dimension tables, which are foundational to building a star schema.
-
create your databricks free edition account
-
create a volume to store csv files, via ui under catalog or via code, we go for the code option
-
create a folder flights under workspaces to store the notebooks
-
start a new notebook to create our setup code
Code example from the Setup notebook:
# Create Volumes for Layered Data
%sql
CREATE VOLUME flights.bronze.bronze_volume;
CREATE VOLUME flights.silver.silver_volume;
CREATE VOLUME flights.gold.gold_volume;
Databricks Auto Loader "incrementally and efficiently processes new data files as they arrive in cloud storage. It provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory. Auto Loader has support for both Python and SQL in Lakeflow Declarative Pipelines." Source: Databricks
Code example from the Bronze Incremental Data Ingestion notebook:
-
Load Data with Spark
- The CloudFiles format automatically detects new files and infers schema, storing checkpoints for incremental loading
df = spark.readStream.format("cloudfiles")\
.option("cloudFiles.format", "csv")\
.option("cloudFiles.schemaLocation", f"/Volumes/flights/bronze/bronze_volume/{src_value}/checkpoint")\
.option("cloudFiles.schemaEvolutionMode", "rescue")\
.load(f"/Volumes/flights/raw/rawvolume/rawdata/{src_value}/")
- Write Data to Bronze Layer
- Writes the ingested raw data into the Bronze layer as Delta table
- The writeStream function ensures data is appended and checkpointed for streaming pipelines
df.writeStream.format("delta")\
.outputMode("append")\
.trigger(once=True)\
.option("checkpointLocation", f"/Volumes/flights/bronze/bronze_volume/{src_value}/checkpoint")\
.option("path", f"/Volumes/flights/bronze/bronze_volume/{src_value}/data")\
.start()
- Bronze Layer Volumnes
airports,flights,passengers, andbookings:
Another crucial aspect of Databricks Autoloader "is its ability to provide exact-once processing semantics for ingested files. As new files are discovered in the monitored cloud storage location, their metadata is durably persisted in a cloud-based checkpoint location using a scalable key-value store (Apache RocksDB). (...) This fault tolerance and exactly-once processing semantics are achieved natively by Autoloader, without requiring any manual state management or custom code, simplifying the development and maintenance of reliable and idempotent data ingestion pipelines." Source: Chaos Genius
Here is an example where you can find the rocksDB folder in this project:
Using the Bronze Incremental Data Ingestion notebook, we dynamically ingest raw source files dim_airports, dim_flights, dim_passengers, and fact_bookings from the /Volumes/flights/raw/rawvolume/rawdata/ hierarchy into the bronze layer as Delta data.
Example:
In /Volumes/flights/raw/rawvolume/rawdata/customers/ the dim_passengers.csv file is sored and contains 200 passenger records.
Because the pipeline uses Auto Loader with checkpointing, rerunning the notebook is idempotent and processes only newly arrived files.
When you add an incremental file (e.g., dim_passengers_increment.csv with 20 new rows) to the same folder and rerun:
- those rows are appended to the
dim_passengersbronze table
Repeat the pattern for the other datasets by adding:
dim_airports> uploaddim_airports_increment.csvdim_flights> uploaddim_flights_increment.csvfact_bookings> uploadfact_bookings_increment.csv
The full code is in the Source Parameters notebook
-
Define Source Parameters
- Create an array of dictionaries called src_array, which lists the different data sources (airports, bookings, customers, flights)
- This array is used to parameterize data ingestion workflows across multiple source datasets
src_array = [
{"src": "airports"},
{"src": "bookings"},
{"src": "customers"},
{"src": "flights"}
]
-
Store Task Values in Databricks Jobs
- Store the src_array as a key-value pair (output_key)
- These values can be accessed by downstream tasks in a Databricks Job to dynamically control their execution
dbutils.jobs.taskValues.set(key = "output_key", value = src_array)
-
Navigate to the
Jobs and Pipelinesmenu to create a new Job:- The Python job configuration file is available at this link.
The successful job execution now allows seamless ingestion of new data into the raw volumes, with dynamic and incremental processing
-
Silver Layer Pipeline
-
Define Bronze-to-Silver Table stage_bookings
- This Delta Live Table (DLT) loads streaming data from the bookings folder in the Bronze layer
- It uses spark.readStream to continuously read Delta files and registers the output as a managed DLT table
@dlt.table(
name = "stage_bookings"
)
def stage_bookings():
df = spark.readStream.format("delta")\
.load("/Volumes/flights/bronze/bronze_volume/bookings/data/")
return df
- Apply Transformations to Bookings
- This DLT view applies transformations to the stage_bookings data, including type casting, adding timestamps, and dropping _rescued_data columns
@dlt.view(
name = "transformations_bookings"
)
def transformations_bookings():
df = spark.readStream.table("stage_bookings")
df = df.withColumn("amount", col("amount").cast(DoubleType()))\
.withColumn("modified_date", current_timestamp())\
.withColumn("booking_date", to_date(col("booking_date")))\
.drop("_rescued_data")
return df
-
Next, navigate to the
Jobs and Pipelinesmenu to create a new Pipeline-
The code was first developed in the Silver Layer notebook.
-
It was then transferred to the Bookings Silver Transformation python script for production use.
-
-
And here we have our Bookings Silver tables:
silver_bookingsandstage_bookings
-
This pipeline reads flight data from a Bronze Delta Lake volume, applies basic transformations, and loads it into a Silver table using Databricks
AUTO CDC. Click here to learn more about Databricks AUTO CDC -
transformations_flights (DLT View)
-
Reads streaming data from the bronze path.
-
Adds
modified_datewith current timestamp (used for CDC sequencing). -
Casts
flight_dateto DATE type. -
Drops the
_rescued_datacolumn (used for error handling in schema inference).
-
@dlt.view(
name = "transformations_flights"
)
def transformations_flights():
df = spark.readStream.format("delta")\
.load("/Volumes/flights/bronze/bronze_volume/flights/data/")\
.withColumn("modified_date", current_timestamp())\
.withColumn("flight_date", to_date(col("flight_date")))\
.drop("_rescued_data")
return df
-
silver_flights (Streaming Table)
-
Created using dlt.create_streaming_table.
-
Populated by create_auto_cdc_flow from the transformations_flights view.
-
Uses
flight_idas the unique key andmodified_dateas the sequencing column. -
Stores records as Type 1 Slowly Changing Dimension (SCD).
-
dlt.create_streaming_table("silver_flights")
dlt.create_auto_cdc_flow(
target = "silver_flights",
source = "transformations_flights",
keys = ["flight_id"],
sequence_by = col("modified_date"),
stored_as_scd_type = 1
)
-
Navigate to the
Jobs and Pipelinesmenu to create a new Pipeline. -
In the Flights Project Pipeline file you can find the full python script for production use.
-
Let’s explore how a DLT Pipeline dry run looks:
-
Created a unified Silver business view by joining the tables (bookings, pax, flights, airports) using Delta Live Tables.
-
Used dlt.readStream() to stream data from silver-layer tables, enabling near real-time insights for business consumption.
-
Simplified downstream analytics by consolidating key entities into a single, ready-to-query DLT table (silver_business).
@dlt.table(
name = "silver_business"
)
def silver_business():
df = dlt.readStream("silver_bookings")\
.join(dlt.readStream("silver_pax"), ["passenger_id"])\
.join(dlt.readStream("silver_flights"), ["flight_id"])\
.join(dlt.readStream("silver_airports"), ["airport_id"])\
.drop("modified_date")
return df
- Monitoring the DLT Pipeline – successfully completed
-
Materialized View
-
Although not strictly required in the pipeline, the following table was added to demonstrate how to create a materialized view using Delta Live Tables (DLT).
-
This uses dlt.read() instead of dlt.read_stream(), which means it’s treated like a batch (non-streaming) table—similar to a materialized view.
-
It mirrors the silver_business table for demonstration purposes.
-
Helpful when showing how DLT handles static (non-streaming) tables vs. streaming inputs.
-
@dlt.table(
name = "silver_business_mv"
)
def silver_business_mv():
df = dlt.read("silver_business")
return df
-
New run: monitoring the DLT Pipeline – nearing completion: last step running, others completed
-
After re-running the pipeline, no data was reloaded thanks to Auto Loader’s schema inference and file tracking.
-
Since the files were already processed, Databricks automatically skipped them—ensuring idempotent and efficient execution.
-
This Gold Layer Notebook (full code here) transforms Silver Layer tables into cleaned and structured Gold Layer dimension tables for analytics.
It handles multiple source tables from flights.silver:
silver_pax→dim_passengerssilver_flights→dim_flightssilver_airports→dim_airports
Techniques used:
- Parameterized configuration (via code cells)
- Surrogate key generation
- Support for backdated refresh and CDC
These parameters define how the silver_flights table will be transformed into the dim_flights dimension in the gold layer:
# Catalog Name
catalog_name = "flights"
# Key Columns
key_cols = "['flight_id']"
key_cols_list = eval(key_cols)
# CDC Column
cdc_col = "modified_date"
# Back-dated Refresh
# Back-dated refresh is an extension of incremental refresh, allowing you to specify a date range
# within the historical data to be refreshed.
backdated_refresh = ""
# Source Object
source_object = "silver_flights"
# Source Schema
source_schema = "silver"
# Target Schema
target_schema = "gold"
# Target Object
target_object = "dim_flights"
# Surrogate Key
surrogate_key = "DimFlightsKey"
This block determines the last_load timestamp used to filter records during incremental ingestion.
It supports both:
- Standard incremental loads (based on the CDC column)
- Backdated refreshes (e.g. for reprocessing historical ranges)
# Determine the last load timestamp for incremental processing
# Case 1: No backdated refresh specified
if len(backdated_refresh) == 0:
# If the target table exists, get the max value from the CDC column
if spark.catalog.tableExists(f"{catalog_name}.{target_schema}.{target_object}"):
last_load = spark.sql(
f"SELECT max({cdc_col}) FROM {catalog_name}.{target_schema}.{target_object}"
).collect()[0][0]
# If the target table doesn't exist, fall back to a default starting point
else:
last_load = "1900-01-01 00:00:00"
# Case 2: Backdated refresh is provided
else:
last_load = backdated_refresh
# Output the resolved timestamp (for debug/logging)
last_load
- Using the previously calculated last_load timestamp, this query filters the source table (silver layer) to load only new or updated records based on the CDC (Change Data Capture) column —
modified_date. This is the core of the incremental ingestion logic. - The logic dynamically adapts to the presence (or absence) of the target table.
- It uses the maximum CDC value from the target to define the “last load”, enabling seamless dynamic incremental loading.
# Read only the new or updated records from the Silver layer since the last load
df_src = spark.sql(
f"SELECT * FROM flights.{source_schema}.{source_object} WHERE {cdc_col} >= '{last_load}'"
)
- This section prepares the target DataFrame (df_target) to compare and merge against the incoming incremental data (df_src).
- It handles both cases: when the target table already exists, and when it doesn’t (e.g., during the initial load).
if spark.catalog.tableExists(f"{catalog_name}.{target_schema}.{target_object}"):
# Prepare key columns to fetch existing records from the target table
key_cols_str_incremental = ", ".join(key_cols_list)
df_target = spark.sql(f"""
SELECT {key_cols_str_incremental}, {surrogate_key}, create_date, update_date
FROM {catalog_name}.{target_schema}.{target_object}
""")
else:
# Prepare a dummy empty structure to allow merge logic even if the table doesn't exist yet
key_cols_str_initial = [f"'' AS {i}" for i in key_cols_list]
key_cols_str_initial = ", ".join(key_cols_str_initial)
df_target = spark.sql(f"""
SELECT {key_cols_str_initial},
CAST('0' AS INT) AS {surrogate_key},
CAST('1900-01-01 00:00:00' AS timestamp) AS create_date,
CAST('1900-01-01 00:00:00' AS timestamp) AS update_date
WHERE 1=0
""")
- Adds audit metadata to support data lineage and traceability.
- Ensures surrogate keys are unique and sequential.
- Prepares both old and new datasets for the upcoming merge/upsert step.
# Adds a current timestamp to the update_date column for all records that already exist in the target. This signals that the record was modified.
df_old_enrich = df_old.withColumn("update_date", current_timestamp())
- This adds three fields to new records:
surrogate_key: A unique identifier for each new row.create_date: Timestamp of initial creation.update_date: Timestamp of initial creation (same as create_date on insert).
monotonically_increasing_id()is used to ensure uniqueness across partitions when generating surrogate keys.
# Check if the target table already exists in the catalog
if spark.catalog.tableExists(f"{catalog_name}.{target_schema}.{target_object}"):
# Get the current maximum surrogate key from the target table
max_surrogate_key = spark.sql(f"""
SELECT MAX({surrogate_key})
FROM {catalog_name}.{target_schema}.{target_object}
""").collect()[0][0]
# Enrich the new records with:
# - A new surrogate key (offset from max + unique ID)
# - Current timestamps for create_date and update_date
df_new_enrich = df_new.withColumn(f"{surrogate_key}",
lit(max_surrogate_key) + lit(1) + monotonically_increasing_id())\
.withColumn("create_date", current_timestamp())\
.withColumn("update_date", current_timestamp())
else:
# If the table does not exist, start surrogate keys from 1
max_surrogate_key = 0
# Enrich new records with surrogate key starting at 1,
# and set current timestamps for create_date and update_date
df_new_enrich = df_new.withColumn(f"{surrogate_key}",
lit(max_surrogate_key) + lit(1) + monotonically_increasing_id())\
.withColumn("create_date", current_timestamp())\
.withColumn("update_date", current_timestamp())
- Checks if the target Delta table exists before writing.
- Performs merge (UPSERT) if the table exists, using a surrogate key.
- Inserts all data if the table doesn’t exist yet.
# Check if the target Delta table already exists in the metastore
if spark.catalog.tableExists(f"{catalog_name}.{target_schema}.{target_object}"):
# Load the existing Delta table as a DeltaTable object to perform merge operations (UPSERT)
dlt_obj = DeltaTable.forName(spark, f"{catalog_name}.{target_schema}.{target_object}")
# Perform the UPSERT using the surrogate key as the matching condition
# - If a match is found and the new change data capture column (cdc_col) is more recent or equal, update all fields.
# - If no match is found, insert the new record.
dlt_obj.alias("trg").merge(df_union.alias("src"), f"trg.{surrogate_key} = src.{surrogate_key}")\
.whenMatchedUpdateAll(condition = f"src.{cdc_col} >= trg.{cdc_col}")\
.whenNotMatchedInsertAll()\
.execute()
else:
# If the table does not exist, create it and load the data using 'append' mode
# This initializes the Delta table with the unioned dataset
df_union.write.format("delta")\
.mode("append")\
.saveAsTable(f"{catalog_name}.{target_schema}.{target_object}")
-
Type 1 handles changes by overwriting existing records without keeping historical data.
-
Useful when historical accuracy isn't required, and only the latest values matter.
-
Implemented using upsert logic based on surrogate keys and
modified_date. -
In this project we have examples for this tables:
dim_passengers→dim_passengers_scd.csvfiledim_flights→dim_flights_scd.csvfiledim_airports→dim_airports_scd.csvfile
-
Identify Updated Data for SCD Testing
- Locate updated test data files for SCDs (e.g. passengers_scd).
- Confirm that certain records (like P0049) have modified attributes (e.g., name changed to “Justin Thomas”).
-
Query the Existing Record
- Use a query such as SELECT * FROM passengers WHERE passenger_id = 'P0049' to compare the current value before running the pipeline.
| Name | Gender | Nationality |
|-----------------|--------|-------------|
| Douglas Montoya | Female | Pakistan |
-
Upload the Modified Data
- Upload the updated
dim_passengers_scd.csvdata file into the raw volume.
- Upload the updated
-
Run the Autoloader Pipeline
- Trigger the Autoloader (Bronze) pipeline to incrementally load the new data.
- Confirm new rows have been added (e.g., record count increases from 220 to 235).
-
Run the DLT (Silver) Pipeline
- Go to Workflows → Pipelines and run the Dimension Transformation (DT) pipeline.
- This pipeline performs the upsert based on modified_date.
-
Confirm New and Updated Records
- Query the silver-level passengers table:
SELECT * FROM flights.silver.passengers - Confirm that records are either updated or newly inserted according to the business logic.
- Query the silver-level passengers table:
-
Run the Dimension Builder
- Run the notebook responsible for building dimensions from the silver table.
- This will further upsert the data into the gold-level dimensional models.
-
Final Validation
- Query the dimension table and confirm that the record (e.g., P0049) reflects the updated value (Justin Thomas).
SELECT * FROM flights.gold.dim_passengers WHERE passenger_id = 'P0049'
| Name | Gender | Nationality |
|---------------|--------|-------------|
| Justin Thomas | Female | Tokelau |
This Gold Layer Notebook (full code here) transforms Silver Layer tables into a cleaned and structured Gold Layer fact table ready for analytics.
Key techniques used:
- Parameterized configuration (via code cells)
- Surrogate key generation
- Support for backdated refresh and Change Data Capture (CDC)
- Dynamically constructs a SQL query that joins the fact table to its dimension tables.
- Adds surrogate key columns:
dim_passengers→DimPassengersKey. - Applies incremental filtering using the previously calculated
last_loadtimestamp.
def generate_fact_query_incremental(fact_table, dimensions, fact_columns, cdc_col, processing_date):
fact_alias = "f"
# Base columns to select
select_columns = [f"{fact_alias}.{col}" for col in fact_columns]
# Build joins dynamically
join_clauses = []
for dim in dimensions:
table_full = dim["table"]
alias = dim["alias"]
table_name = table_full.split('.')[-1]
# Convert table name to PascalCase
pascal_name = ''.join(word.capitalize() for word in table_name.split('_'))
surrogate_key = f"{alias}.{pascal_name}Key"
select_columns.append(surrogate_key)
# Build ON clause
on_conditions = [
f"{fact_alias}.{fk} = {alias}.{dk}" for fk, dk in dim["join_keys"]
]
join_clause = f"LEFT JOIN {table_full} {alias} ON " + " AND ".join(on_conditions)
join_clauses.append(join_clause)
# Final SELECT and JOIN clauses
select_clause = ",\n ".join(select_columns)
joins = "\n".join(join_clauses)
# WHERE clause for incremental filtering
where_clause = f"{fact_alias}.{cdc_col} >= DATE('{last_load}')"
# Final query
query = f"""
SELECT
{select_clause}
FROM
{fact_table} {fact_alias}
{joins}
WHERE
{where_clause}
""".strip()
return query
- Dynamically builds the merge condition using the list of fact key columns.
- Checks if the target Delta table exists before writing.
- Performs merge (UPSERT) using the fact keys and a CDC (change data capture) column to resolve updates.
- Inserts all data if the table doesn’t exist yet.
# Fact Key Columns Merge Condition
fact_key_cols_str = " AND ".join([f"src.{col} = trg.{col}" for col in fact_key_cols])
# Check if the target Delta table already exists in the metastore
if spark.catalog.tableExists(f"{catalog_name}.{target_schema}.{target_object}"):
# Load the existing Delta table as a DeltaTable object to perform merge operations (UPSERT)
dlt_obj = DeltaTable.forName(spark, f"{catalog_name}.{target_schema}.{target_object}")
# Perform the UPSERT using the fact keys as the matching condition
# If a match is found and the new change data capture column (cdc_col) is more recent or equal,
# update all fields.
# If no match is found, insert the new record.
dlt_obj.alias("trg").merge(df_fact.alias("src"), fact_key_cols_str)\
.whenMatchedUpdateAll(condition = f"src.{cdc_col} >= trg.{cdc_col}")\
.whenNotMatchedInsertAll()\
.execute()
else:
# If the table does not exist, create it and load the data using 'append' mode
# This initializes the Delta table with the fact dataset
df_fact.write.format("delta")\
.mode("append")\
.saveAsTable(f"{catalog_name}.{target_schema}.{target_object}")
As an additional step beyond the video course, this dashboard was developed to showcase the insights derived from the curated Gold layer.
This dashboard provides a consolidated view of key performance indicators (KPIs) for the hypothetical Travel Agency’s sales and operations during the selected date range.
It supports decision-making by offering insights into sales trends, airline performance, passenger distribution, and top destinations.
- Functionality: Enables dynamic filtering of all metrics in the dashboard based on a custom time window.
- Current Range Displayed: July 1 to July 21, 2025
- Decorative section for branding and visual identification.
- Highlights the Travel Agency domain.
- Purpose: Displays total daily sales across different city locations.
- Insight:
- Berryport is the top-performing location, followed by Bennettcide and Cooperview.
- Sales figures decrease progressively across the ranked locations.
- Metric Unit: Approximate revenue in units (currency or ticket value, TBD).
- Purpose: Shows number of flight bookings per airline.
- Additional Insight:
- A “Profit Line” is visually represented to indicate performance targets or thresholds.
- Purpose: Interactive slider to filter airlines by a specified sales range.
- Current Range: 309.36 to 1448.27 (assumed unit: currency or average ticket value)
- Purpose: Visual map highlighting the origin countries of passengers.
- Regions Highlighted: Includes countries from Africa, Latin America, Asia, and Oceania.
- Tool: Mapbox-powered interactive layer for regional insights.
- Purpose: Lists most visited cities based on booking data.
- Interaction: Scrollable for exploring all destinations.
Thank you!












