From CSV to Dashboard — Building a Mini Data Pipeline in Pure Python
This workshop shows how to build a complete data engineering workflow using open-source tools:
pandas • pyarrow • uv • streamlit.
You’ll ingest, clean, model, and visualize the classic Northwind dataset — all locally, with no cloud or database setup.
- A hands-on sandbox for learning how data engineers think and structure pipelines.
- A mini data warehouse built entirely with open-source tools.
- A practical example of the modern data-engineering mindset:
Extract → Load → Transform → Build → Visualize. - Designed for:
- Beginners exploring data pipelines and analytics engineering.
- Educators or mentors leading workshops and bootcamps.
- Teams wanting a lightweight demo of ETL + Data Quality in Python.
- ❌ Not a production-grade data platform or big-data tool.
- ❌ Not a replacement for Airflow, dbt, or Spark.
- ❌ Not built for parallel or distributed execution.
- ❌ Not a full data-engineering curriculum — this is the first step.
A teachable microcosm of modern data engineering — small enough for your laptop, structured enough to mirror real-world pipelines.
py-dataengineering-workshop/
├─ data/
│ ├─ 00-raw/ # raw source CSVs (auto-downloaded)
│ ├─ 01-clean/ # cleaned + validated Parquet
│ │ └─ _dq/ # data quality logs
│ ├─ 02-model/ # modeled & aggregated tables
│ └─ 03-sandbox/ # scratch area
├─ etl/
│ ├─ extract.py # download + read CSVs
│ ├─ load.py # clean + validate
│ ├─ transform.py # create fact/dim tables
│ ├─ build.py # aggregate to gold layer
│ ├─ dq.py # data-quality rules + logs
│ ├─ paths.py # central folder definitions
│ └─ run.py # orchestrator (CLI)
└─ view_data.py # Inspect parquet data files
└─ app.py # Streamlit dashboard
git clone https://github.com/ogbinar/py-dataengineering-workshop.git
cd py-dataengineering-workshop
uv venv && uv sync
# Run the full pipeline
uv run python -m etl.run
# Or run by stage
uv run python -m etl.run --stage extract
uv run python -m etl.run --stage load
uv run python -m etl.run --stage transform
uv run python -m etl.run --stage build
# Launch the dashboard
uv run streamlit run app.pyOn first run, extract.py downloads the Northwind CSVs into data/00-raw/.
You may also run the code inside a dev container using Docker and VS Code. This can simplify setup by providing a consistent environment. Quick steps:
- Install the "Dev Containers" extension in VS Code.

- Open this project folder in VS Code (File → Open Folder...).
- Open the command palette (F1 or Ctrl+Shift+P or Cmd+Shift+P) and choose: "Dev Containers: Reopen in Container".

- The container should start building. Wait for it to finish.

Notes:
- On the first run VS Code will build the container image using
devcontainer.json. This may take several minutes depending on your network and machine. - After the container starts you can run the pipeline and the Streamlit app from the container terminal (the same commands shown in the Quickstart work inside the container).
- If you don't have Docker or prefer a local virtualenv, the earlier Quickstart steps (using
uv) will work without the dev container.
| Stage | Script | Purpose |
|---|---|---|
| Extract | etl/extract.py |
Download + load raw CSVs |
| Load | etl/load.py |
Clean, standardize, and log data-quality issues |
| Transform | etl/transform.py |
Create fact and dimension tables |
| Build | etl/build.py |
Aggregate gold-layer outputs (customer, country, product) |
Data-quality results are saved under:
data/01-clean/_dq/
├─ dq_runs.parquet
└─ dq_issues.parquet
You can quickly inspect any Parquet file generated by the ETL pipeline using pandas + pyarrow.
uv run python view_data.pyThis utility prints:
- 📂 The file being read
- 👀 A preview of the first 5 rows (
head()) - 🧾 Schema and data types (
info()) - 📊 Summary statistics (
describe())
To view another file, edit the target path inside view_data.py
(e.g., switch from sales_by_customer.parquet to fact_sales.parquet).
uv run streamlit run app.pyTabs
- 📊 Sales (Customers) – Top customers & products
- 🌍 Sales by Country – Regional aggregates
- 🧪 Data Quality – Run summaries & issue details
- Study the Northwind dataset.
- Modify
extract.pyto ingest all available Northwind CSVs (Employees, Shippers, Suppliers, Categories, etc.). - Add new data-quality checks in
dq.py(e.g., missing employee names, invalid postal codes). - Extend
transform.pyto includedim_product,dim_supplier, andfact_orders. - Enhance
build.pywith new aggregates (e.g., sales by category, supplier, year). - Visualize additional metrics in Streamlit (e.g., monthly trends, top-selling categories).
- Use another dataset (like, Chinook) and create and test it through the pipeline.
- Grow beyond Streamlit and try another web framework like Air.
This app reimagines the original Streamlit Northwind Dashboard using Air, a FastAPI-based web framework. Air expands on Streamlit by enabling fully web-native dashboards—mixing HTML, APIs, and charts in one lightweight Python app.
Run locally with:
uv run uvicorn air_app:app --reload| Stage | What it does | Recommended tools & patterns |
|---|---|---|
| Extract | Pull data from CSV/API/DB/stream | dlt (Python) → sources (CSV, REST, DB). For streams: Kafka (later). |
| Load | Land raw/staging into warehouse | dlt → Postgres (Dockerized). For dev-only: optionally DuckDB parquet. |
| Transform | Create clean staging + marts | dbt (SQL or Python models) over Postgres; incremental models for scale. |
| Store | Persist analytical outputs | Postgres (marts schemas), optional Parquet in /02-model for ad hoc. |
| Serve | BI / apps / ad-hoc queries | Streamlit app; Metabase (Docker) for dashboards; programmatic Ibis/DuckDB. |
| Orchestrate | Schedule & chain runs | Prefect (flows locally or Cloud). For simple prod cron: crontab + shell. |
| Validate / DQ | Schemas, ranges, FKs | Pandera (Python), selective Great Expectations; dbt tests (not_null, unique). |
| Observe | Runs, metrics, alerts | dlt run metrics + Prefect run states; ship DQ summaries to Grafana/Metabase/Slack. |
| Version & Lineage | Repro, history, docs | git for code, optional dvc for large artifacts; dbt docs + lineage. |
| Semantic Layer | Uniform query interface | Ibis to query Postgres/Parquet with one API; (optional) dbt metrics. |
| ML / Features | Gold → features | Notebook lab or feature store later; start with pandas/Polars over marts. |
Explore these tools to build a deeper appreciation for how Data Engineering scales—experiment with various data sources and implement your own end-to-end projects.
MIT © 2025 Myk Ogbinar / Data Engineering Pilipinas
- Neo4j Northwind Dataset — Sample dataset used for this project
- pandas • pyarrow — Core data processing and Parquet handling
- streamlit — Interactive data app framework
- uv — Fast Python environment and dependency manager
- dltHub • dbt • DuckDB — Modern ELT and analytical data stack components
- Ibis • ClickHouse — Unified analytical querying and columnar storage engines
- DurianPy • PyCon Davao 2025 — Community and conference initiatives supporting open-source learning
- Data Engineering Pilipinas — Open-source community promoting data literacy and collaboration in the Philippines