This repo demonstrates building a data pipeline with Bauplan and using Recce to compare pipeline changes across branches.
The demo walks through two stages of building an Instagram user engagement segmentation pipeline, each on its own branch:
| Branch | Description |
|---|---|
main |
Setup: data ingestion script and lineage tooling |
stage-1 |
Naive 4-segment engagement pipeline |
stage-2 |
Adds bot detection as a 5th segment |
- Bauplan CLI installed and configured
- Python 3.11+
- A Bauplan account with access to the shared lakehouse
The ingestion script imports Instagram engagement data from S3 into the Bauplan lakehouse using the Write-Audit-Publish (WAP) pattern:
python ingest_instagram_data.pyThis creates the bauplan.instagram_engagement_data table (~1.5M rows, 58 columns) on an isolated branch, validates it's non-empty, and leaves the branch open for inspection. Follow the printed instructions to merge to main.
Use the prompt below (or check out stage-1) to build the first version of the pipeline:
Let's build a pipeline that calculates different user segments by engagement level. We want to segment users into tiers based on their engagement metrics from the
instagram_engagement_datatable. Materialize all models.
The scripts/ directory contains tools for generating and validating column-level lineage metadata:
scripts/generate_lineage.py— Reads pipeline source and calls Claude to extract lineage JSONscripts/validate_lineage.py— Validates lineage JSON structure and checks against live Bauplan schemasprompts/lineage_prompt.md— The LLM prompt template used for lineage extraction