Cloud-Based Grocery Sales Analytics Pipeline

Problem Statement • Solution • Project Details • Running the Project
Dashboard • Contact • Acknowledgments

This project demonstrates deploying an end to end data pipeline on cloud that transforms raw grocery sales data into actionable business insights. Using GCP, Spark, Airflow, DBT, and Looker Studio, the project automate data ingestion, transformation, and visualization to answer critical business questions such as:

Who are our top-performing salespeople and customers?
Which regions generate the most revenue?
How do product categories contribute to revenue?
What are our hot selling products?

By focusing on these metrics, the project emulates the day-to-day work of a data engineer supporting data analyst for decision-makering in sales and marketing.

Problem Statement

Transaction records come in every day, and retailers must make data-driven decisions to effectively manage their business. Without tools that automatically process this raw data, these records would be nothing more than meaningless numbers. Retailers could struggle with fragmented data and a lack of automated processing. Therefore, a scalable, scheduled data processing pipeline is essential.

Once the records are cleaned and organized, retailers can extract key business insights such as:

Customer Behavior: Who are our repeat customers? What is their spending pattern?
Sales Team Performance: Which employees generate the most revenue?
Geographic Analysis: Which cities are leading in sales?

Solution Overview

To address those problems, this project implemented a scalable ELT pipeline that will:

Ingestes and processes over 6.7 million transaction records automatically using Airflow
Generates product recommendations for customers using Spark
Uses DBT to model and transform data into fact and dimension tables
Visualizes data on Looker Studio

In the end of this ELT pipeline, a dashboard will be built to provide some business insights such as:

Top 10 customers by order value
Top 10 employees by sales
Top 10 cities by revenue
Hot selling products
Revenue trends over time
Revenue produce by product Category
Customer segmentation by average order value

Project Details

The data pipeline automates the data processing workflow utilizing the following tech stacks:

Apache Airflow for data pipeline orchestration
Google Cloud Platform (GCP) services for data lake, data warehouse and virtual machine
Apache Spark for data processing
Apache Livy for using REST API ti trigger Spark jobs
DBT for data modeling and transformation
GitHub Actions for CI/CD
Terraform for cloud resources provision
Looker Studio for building dashboard

Data Pipeline Architecture

This pipeline contains the following parts of precessing data, and all tasks are build and run on the cloud:

Data Extraction: In Airflow a task called get will download data files from Kaggle Grocery Sales Database by KaggleAPI.
- This dataset is consists of sales transactions, customer demographics, product details, employee records, and geographical information across multiple cities.
- The Grocery Sales Dataset, sourced from Kaggle, comprises seven interconnected tables covering transactional, customer, product, and geographic data.
- The dataset includes detailed records of product categories, customers’ personal and location information, product attributes, employee details, and a comprehensive sales log.
- The data spans a four-month time period begins from January 1st in 2018 to May 10th in 2018 with 6,758,125 rows of transation data.
Data Loading: In the second stage, the Airflow tasks stored the raw data in Google Cloud Storage for long term storage, and then loaded those files to BigQuery for further transformed and processed.
Data Processing & Transform: In this stage, Airflow automatically used DBT for data transformation and PySpark for customer segmentation and generating product recommendations.
- In DBT models, seeds data firstly be created as tables, and then join raw data sperately to create customer, employee and product dimensional models.
- After building dimensional tables, DBT builds a fact table called fct_sales partioned by SalesDay and clustering by CustomerID. The reason for doing so is that the downstream models will group by CustomerID frequently and the dashboard will apply a date range filter to filter out target data.
- Spark is used to categorize customers based on RFM model and generate recommendations for each customer with ALS model. In this stage, two fact tables is created.
- In the gold layer of data, four models are created in order to build the KPI dashboard, which are mart_sales_summary, mart_customer_behavior, mart_employee_performance and mart_recommend.
CI/CD: To achieve continous deployment without manual intervention, GitHub Actions is used to detect commits or change in DAGs and cloud resources. If commits are pushed in main branch, compute engines working on cloud will automactically catch up the latest version pipelines.

Diagram below shows an overview of data pipeline architecture used in this project.

How to Run This Project

If you would like to learn more about this project, please follow the instructions in reproduce.md.

Dashboard

You can access this dashboard from here.

Contact Information

Feel free to reach out!

📧 Email: r08521524@ntu.edu.tw
🔗 LinkedIn: HAOYU YANG

Acknowledgments

A final project for Data Engineering Zoomcamp by DataTalks.Club
Dataset from Kaggle Grocery Sales Database

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/workflows		.github/workflows
airflow		airflow
assets		assets
dbt		dbt
spark		spark
terraform		terraform
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit.yaml		.pre-commit.yaml
README.md		README.md
docker-compose.yaml		docker-compose.yaml
reproduce.md		reproduce.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud-Based Grocery Sales Analytics Pipeline

Problem Statement

Solution Overview

Project Details

Data Pipeline Architecture

How to Run This Project

Dashboard

Contact Information

Acknowledgments

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cloud-Based Grocery Sales Analytics Pipeline

Problem Statement

Solution Overview

Project Details

Data Pipeline Architecture

How to Run This Project

Dashboard

Contact Information

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages