yelp-etl

Extract & transform (with compute provided by Spark) the Yelp Academic Dataset in an Apache Iceberg data lake (with object storage provided by Minio).

Quickstart (MacOS)

You'll need Docker Desktop for MacOS. installed and running. Have >=4 cores and >=16g RAM available.

Initialize data lake buckets & prefixes in Minio.

mkdir -p data/warehouse/lake
mkdir -p data/yelp/json
mkdir -p data/etl/spark/yelp-etl
mkdir -p data/log/spark/spark-logs

Download data from Yelp instructions.

open https://www.yelp.com/dataset/download
# Accept conditions & download the JSON dataset as a .zip to ~/Downloads
tar -xvf ~/Downloads/yelp_dataset.tar -C ./data/yelp/json

Create Python app deployment files.

cp app.py data/etl/spark/yelp-etl/
zip -vr data/etl/spark/yelp-etl/yelp_etl.zip yelp_etl/*

Run Spark Master + Worker and Minio (S3) locally.

docker-compose up --build -d

Observe ETL in the data lake with UIs.

open http://localhost:9001/ # minio console
open http://localhost:8080/ # spark master
open http://localhost:8081/ # spark worker
open http://localhost:4040/ # spark jobs

Run all pipelines with spark-submit.

# Get an interactive bash shell
docker exec -it spark-master /bin/bash

sh run-all-pipelines.sh

Cleanup.

exit
docker-compose down -v

Running Pipelines

Get help by executing:

python app.py --help

Extraction pipelines --pipeline extract

/opt/spark/bin/spark-submit \
    --conf spark.driver.cores=1 \
    --conf spark.driver.memory=4g \
    --conf spark.driver.maxResultSize=4g \
    --conf spark.executor.cores=1 \
    --conf spark.executor.memory=4g \
    --conf spark.sql.shuffle.partitions=8 \
    --num-executors=2 \
    --py-files s3a://etl/spark/yelp-etl/yelp_etl.zip \
    s3a://etl/spark/yelp-etl/app.py \
    --input s3a://yelp/json/yelp_academic_dataset_user.json \
    --output lake.bronze.yelp.user \
    --entity_type user \
    --pipeline extract \
    --bucket_column user_id \
    --buckets 8

Cleaning pipelines --pipeline clean

/opt/spark/bin/spark-submit \
    --conf spark.driver.cores=1 \
    --conf spark.driver.memory=4g \
    --conf spark.driver.maxResultSize=4g \
    --conf spark.executor.cores=1 \
    --conf spark.executor.memory=4g \
    --conf spark.sql.shuffle.partitions=8 \
    --num-executors=2 \
    --py-files s3a://etl/spark/yelp-etl/yelp_etl.zip \
    s3a://etl/spark/yelp-etl/app.py \
    --input lake.bronze.yelp.business \
    --output lake.silver.yelp.business \
    --entity_type business \
    --pipeline clean \
    --bucket_column business_id \
    --buckets 8

Enrichment pipelines --pipeline enrich

/opt/spark/bin/spark-submit \
    --conf spark.driver.cores=1 \
    --conf spark.driver.memory=4g \
    --conf spark.driver.maxResultSize=4g \
    --conf spark.executor.cores=1 \
    --conf spark.executor.memory=4g \
    --conf spark.sql.shuffle.partitions=8 \
    --num-executors=2 \
    --py-files s3a://etl/spark/yelp-etl/yelp_etl.zip \
    s3a://etl/spark/yelp-etl/app.py \
    --input lake.silver.yelp.tip \
    --output lake.silver.yelp.user_business_tip \
    --entity_type tip \
    --pipeline enrich \
    --partition_column date_year \
    --bucket_column business_id \
    --buckets 8 \
    --dimension_inputs lake.silver.yelp.business,lake.silver.yelp.user \
    --dimension_entity_types business,user

TODO: Add gold pipeline examples.

Contributing

You will need Python ~3.8.1 with Poetry ~1.6.0 installed.

Install Python app & dependencies.

poetry install --include dev

Run tests.

poetry run python -m unittest

Format & lint with pre-commit.

poetry run pre-commit install
# Format & lint happen automatically on commit.

Development Environment (MacOS)

An opinionated guide to set up your local environment.

Install brew.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Python3.8 with pyenv. Install poetry.

brew update
brew install pyenv
pyenv init >> ~/.zshrc
exec zsh -l
pyenv install 3.8
pyenv local 3.8
curl -sSL https://install.python-poetry.org | python3 -
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
mkdir -p ~/.zfunc
poetry completions zsh > ~/.zfunc/_poetry
exec zsh -l
poetry config virtualenvs.prefer-active-python true
poetry config virtualenvs.in-project true

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
tests		tests
yelp_etl		yelp_etl
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.minio.env		docker-compose.minio.env
docker-compose.spark.env		docker-compose.spark.env
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run-all-pipelines.sh		run-all-pipelines.sh
spark-defaults.conf		spark-defaults.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

yelp-etl

Quickstart (MacOS)

Running Pipelines

Contributing

Development Environment (MacOS)

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

Uh oh!

daniel-cortez-stevenson/yelp-etl

Folders and files

Latest commit

History

Repository files navigation

yelp-etl

Quickstart (MacOS)

Running Pipelines

Contributing

Development Environment (MacOS)

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages