Problem Statement •
Solution •
Project Details •
Running the Project
Dashboard •
Contact •
Acknowledgments
This project demonstrates deploying an end to end data pipeline on cloud that transforms raw grocery sales data into actionable business insights. Using GCP, Spark, Airflow, DBT, and Looker Studio, the project automate data ingestion, transformation, and visualization to answer critical business questions such as:
- Who are our top-performing salespeople and customers?
- Which regions generate the most revenue?
- How do product categories contribute to revenue?
- What are our hot selling products?
By focusing on these metrics, the project emulates the day-to-day work of a data engineer supporting data analyst for decision-makering in sales and marketing.
Transaction records come in every day, and retailers must make data-driven decisions to effectively manage their business. Without tools that automatically process this raw data, these records would be nothing more than meaningless numbers. Retailers could struggle with fragmented data and a lack of automated processing. Therefore, a scalable, scheduled data processing pipeline is essential.
Once the records are cleaned and organized, retailers can extract key business insights such as:
- Customer Behavior: Who are our repeat customers? What is their spending pattern?
- Sales Team Performance: Which employees generate the most revenue?
- Geographic Analysis: Which cities are leading in sales?
To address those problems, this project implemented a scalable ELT pipeline that will:
- Ingestes and processes over 6.7 million transaction records automatically using Airflow
- Generates product recommendations for customers using Spark
- Uses DBT to model and transform data into fact and dimension tables
- Visualizes data on Looker Studio
In the end of this ELT pipeline, a dashboard will be built to provide some business insights such as:
- Top 10 customers by order value
- Top 10 employees by sales
- Top 10 cities by revenue
- Hot selling products
- Revenue trends over time
- Revenue produce by product Category
- Customer segmentation by average order value
The data pipeline automates the data processing workflow utilizing the following tech stacks:
Apache Airflowfor data pipeline orchestrationGoogle Cloud Platform (GCP)services for data lake, data warehouse and virtual machineApache Sparkfor data processingApache Livyfor using REST API ti trigger Spark jobsDBTfor data modeling and transformationGitHub Actionsfor CI/CDTerraformfor cloud resources provisionLooker Studiofor building dashboard
This pipeline contains the following parts of precessing data, and all tasks are build and run on the cloud:
-
Data Extraction: In Airflow a task called get will download data files from Kaggle Grocery Sales Database by KaggleAPI.
- This dataset is consists of sales transactions, customer demographics, product details, employee records, and geographical information across multiple cities.
- The Grocery Sales Dataset, sourced from Kaggle, comprises seven interconnected tables covering transactional, customer, product, and geographic data.
- The dataset includes detailed records of product categories, customers’ personal and location information, product attributes, employee details, and a comprehensive sales log.
- The data spans a four-month time period begins from January 1st in 2018 to May 10th in 2018 with 6,758,125 rows of transation data.
-
Data Loading: In the second stage, the Airflow tasks stored the raw data in Google Cloud Storage for long term storage, and then loaded those files to BigQuery for further transformed and processed.
-
Data Processing & Transform: In this stage, Airflow automatically used DBT for data transformation and PySpark for customer segmentation and generating product recommendations.
- In DBT models, seeds data firstly be created as tables, and then join raw data sperately to create customer, employee and product dimensional models.
- After building dimensional tables, DBT builds a fact table called
fct_salespartioned bySalesDayand clustering byCustomerID. The reason for doing so is that the downstream models will group byCustomerIDfrequently and the dashboard will apply a date range filter to filter out target data. - Spark is used to categorize customers based on RFM model and generate recommendations for each customer with ALS model. In this stage, two fact tables is created.
- In the gold layer of data, four models are created in order to build the KPI dashboard, which are
mart_sales_summary,mart_customer_behavior,mart_employee_performanceandmart_recommend.
-
CI/CD: To achieve continous deployment without manual intervention, GitHub Actions is used to detect commits or change in DAGs and cloud resources. If commits are pushed in main branch, compute engines working on cloud will automactically catch up the latest version pipelines.
Diagram below shows an overview of data pipeline architecture used in this project.
If you would like to learn more about this project, please follow the instructions in reproduce.md.
You can access this dashboard from here.
Feel free to reach out!
📧 Email: r08521524@ntu.edu.tw
🔗 LinkedIn: HAOYU YANG
- A final project for Data Engineering Zoomcamp by DataTalks.Club
- Dataset from Kaggle Grocery Sales Database

