🧠 Py-DataEngineering Workshop

From CSV to Dashboard — Building a Mini Data Pipeline in Pure Python

This workshop shows how to build a complete data engineering workflow using open-source tools:
pandas • pyarrow • uv • streamlit.

You’ll ingest, clean, model, and visualize the classic Northwind dataset — all locally, with no cloud or database setup.

🧭 What This Repo Is — and Isn’t

✅ What It Is

A hands-on sandbox for learning how data engineers think and structure pipelines.
A mini data warehouse built entirely with open-source tools.
A practical example of the modern data-engineering mindset:
Extract → Load → Transform → Build → Visualize.
Designed for:
- Beginners exploring data pipelines and analytics engineering.
- Educators or mentors leading workshops and bootcamps.
- Teams wanting a lightweight demo of ETL + Data Quality in Python.

🚫 What It Isn’t

❌ Not a production-grade data platform or big-data tool.
❌ Not a replacement for Airflow, dbt, or Spark.
❌ Not built for parallel or distributed execution.
❌ Not a full data-engineering curriculum — this is the first step.

💡 In One Line

A teachable microcosm of modern data engineering — small enough for your laptop, structured enough to mirror real-world pipelines.

📦 Project Structure


py-dataengineering-workshop/
├─ data/
│  ├─ 00-raw/       # raw source CSVs (auto-downloaded)
│  ├─ 01-clean/     # cleaned + validated Parquet
│  │  └─ _dq/       # data quality logs
│  ├─ 02-model/     # modeled & aggregated tables
│  └─ 03-sandbox/   # scratch area
├─ etl/
│  ├─ extract.py    # download + read CSVs
│  ├─ load.py       # clean + validate
│  ├─ transform.py  # create fact/dim tables
│  ├─ build.py      # aggregate to gold layer
│  ├─ dq.py         # data-quality rules + logs
│  ├─ paths.py      # central folder definitions
│  └─ run.py        # orchestrator (CLI)
└─ view_data.py     # Inspect parquet data files
└─ app.py           # Streamlit dashboard

⚙️ Quickstart

git clone https://github.com/ogbinar/py-dataengineering-workshop.git
cd py-dataengineering-workshop
uv venv && uv sync

# Run the full pipeline
uv run python -m etl.run

# Or run by stage
uv run python -m etl.run --stage extract
uv run python -m etl.run --stage load
uv run python -m etl.run --stage transform
uv run python -m etl.run --stage build

# Launch the dashboard
uv run streamlit run app.py

On first run, extract.py downloads the Northwind CSVs into data/00-raw/.

👍 Alternative Setup

You may also run the code inside a dev container using Docker and VS Code. This can simplify setup by providing a consistent environment. Quick steps:

Install the "Dev Containers" extension in VS Code.
Open this project folder in VS Code (File → Open Folder...).
Open the command palette (F1 or Ctrl+Shift+P or Cmd+Shift+P) and choose: "Dev Containers: Reopen in Container".
The container should start building. Wait for it to finish.

Notes:

On the first run VS Code will build the container image using devcontainer.json. This may take several minutes depending on your network and machine.
After the container starts you can run the pipeline and the Streamlit app from the container terminal (the same commands shown in the Quickstart work inside the container).
If you don't have Docker or prefer a local virtualenv, the earlier Quickstart steps (using uv) will work without the dev container.

🧩 Pipeline Overview

Stage	Script	Purpose
Extract	`etl/extract.py`	Download + load raw CSVs
Load	`etl/load.py`	Clean, standardize, and log data-quality issues
Transform	`etl/transform.py`	Create fact and dimension tables
Build	`etl/build.py`	Aggregate gold-layer outputs (customer, country, product)

Data-quality results are saved under:

data/01-clean/_dq/
├─ dq_runs.parquet
└─ dq_issues.parquet

🧰 Quick Parquet Viewer

You can quickly inspect any Parquet file generated by the ETL pipeline using pandas + pyarrow.

uv run python view_data.py

This utility prints:

📂 The file being read
👀 A preview of the first 5 rows (head())
🧾 Schema and data types (info())
📊 Summary statistics (describe())

To view another file, edit the target path inside view_data.py (e.g., switch from sales_by_customer.parquet to fact_sales.parquet).

📊 Streamlit Dashboard

uv run streamlit run app.py

Open http://localhost:8501

Tabs

📊 Sales (Customers) – Top customers & products
🌍 Sales by Country – Regional aggregates
🧪 Data Quality – Run summaries & issue details

🤖 Workshop Proper

Workshop instructions can be found here.
Lecture slides can be found here.

🌻 Challenge Yourself!

Study the Northwind dataset.
Modify extract.py to ingest all available Northwind CSVs (Employees, Shippers, Suppliers, Categories, etc.).
Add new data-quality checks in dq.py (e.g., missing employee names, invalid postal codes).
Extend transform.py to include dim_product, dim_supplier, and fact_orders.
Enhance build.py with new aggregates (e.g., sales by category, supplier, year).
Visualize additional metrics in Streamlit (e.g., monthly trends, top-selling categories).
Use another dataset (like, Chinook) and create and test it through the pipeline.
Grow beyond Streamlit and try another web framework like Air.

🪶 Air Dashboard (Streamlit Expansion)

This app reimagines the original Streamlit Northwind Dashboard using Air, a FastAPI-based web framework. Air expands on Streamlit by enabling fully web-native dashboards—mixing HTML, APIs, and charts in one lightweight Python app.

Run locally with:

uv run uvicorn air_app:app --reload

🌱 From Demo → Real-World Data Stack

Stage	What it does	Recommended tools & patterns
Extract	Pull data from CSV/API/DB/stream	dlt (Python) → sources (CSV, REST, DB). For streams: Kafka (later).
Load	Land raw/staging into warehouse	dlt → Postgres (Dockerized). For dev-only: optionally DuckDB parquet.
Transform	Create clean staging + marts	dbt (SQL or Python models) over Postgres; incremental models for scale.
Store	Persist analytical outputs	Postgres (marts schemas), optional Parquet in `/02-model` for ad hoc.
Serve	BI / apps / ad-hoc queries	Streamlit app; Metabase (Docker) for dashboards; programmatic Ibis/DuckDB.
Orchestrate	Schedule & chain runs	Prefect (flows locally or Cloud). For simple prod cron: crontab + shell.
Validate / DQ	Schemas, ranges, FKs	Pandera (Python), selective Great Expectations; dbt tests (`not_null`, `unique`).
Observe	Runs, metrics, alerts	dlt run metrics + Prefect run states; ship DQ summaries to Grafana/Metabase/Slack.
Version & Lineage	Repro, history, docs	git for code, optional dvc for large artifacts; dbt docs + lineage.
Semantic Layer	Uniform query interface	Ibis to query Postgres/Parquet with one API; (optional) dbt metrics.
ML / Features	Gold → features	Notebook lab or feature store later; start with pandas/Polars over marts.

Explore these tools to build a deeper appreciation for how Data Engineering scales—experiment with various data sources and implement your own end-to-end projects.

🧾 License

🙌 Acknowledgments

Neo4j Northwind Dataset — Sample dataset used for this project
pandas • pyarrow — Core data processing and Parquet handling
streamlit — Interactive data app framework
uv — Fast Python environment and dependency manager
dltHub • dbt • DuckDB — Modern ELT and analytical data stack components
Ibis • ClickHouse — Unified analytical querying and columnar storage engines
DurianPy • PyCon Davao 2025 — Community and conference initiatives supporting open-source learning
Data Engineering Pilipinas — Open-source community promoting data literacy and collaboration in the Philippines

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
data		data
docs		docs
etl		etl
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
WORKSHOP_GUIDE.md		WORKSHOP_GUIDE.md
air_app.py		air_app.py
app.py		app.py
lecture.pdf		lecture.pdf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
view_data.py		view_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Py-DataEngineering Workshop

🧭 What This Repo Is — and Isn’t

✅ What It Is

🚫 What It Isn’t

💡 In One Line

📦 Project Structure

⚙️ Quickstart

👍 Alternative Setup

🧩 Pipeline Overview

🧰 Quick Parquet Viewer

📊 Streamlit Dashboard

🤖 Workshop Proper

🌻 Challenge Yourself!

🪶 Air Dashboard (Streamlit Expansion)

🌱 From Demo → Real-World Data Stack

🧾 License

🙌 Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ogbinar/py-dataengineering-workshop

Folders and files

Latest commit

History

Repository files navigation

🧠 Py-DataEngineering Workshop

🧭 What This Repo Is — and Isn’t

✅ What It Is

🚫 What It Isn’t

💡 In One Line

📦 Project Structure

⚙️ Quickstart

👍 Alternative Setup

🧩 Pipeline Overview

🧰 Quick Parquet Viewer

📊 Streamlit Dashboard

🤖 Workshop Proper

🌻 Challenge Yourself!

🪶 Air Dashboard (Streamlit Expansion)

🌱 From Demo → Real-World Data Stack

🧾 License

🙌 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages