Extract & transform (with compute provided by Spark) the Yelp Academic Dataset in an Apache Iceberg data lake (with object storage provided by Minio).
You'll need Docker Desktop for MacOS. installed and running. Have >=4 cores and >=16g RAM available.
Initialize data lake buckets & prefixes in Minio.
mkdir -p data/warehouse/lake
mkdir -p data/yelp/json
mkdir -p data/etl/spark/yelp-etl
mkdir -p data/log/spark/spark-logsDownload data from Yelp instructions.
open https://www.yelp.com/dataset/download
# Accept conditions & download the JSON dataset as a .zip to ~/Downloads
tar -xvf ~/Downloads/yelp_dataset.tar -C ./data/yelp/jsonCreate Python app deployment files.
cp app.py data/etl/spark/yelp-etl/
zip -vr data/etl/spark/yelp-etl/yelp_etl.zip yelp_etl/*Run Spark Master + Worker and Minio (S3) locally.
docker-compose up --build -dObserve ETL in the data lake with UIs.
open http://localhost:9001/ # minio console
open http://localhost:8080/ # spark master
open http://localhost:8081/ # spark worker
open http://localhost:4040/ # spark jobsRun all pipelines with spark-submit.
# Get an interactive bash shell
docker exec -it spark-master /bin/bashsh run-all-pipelines.shCleanup.
exit
docker-compose down -vGet help by executing:
python app.py --helpExtraction pipelines --pipeline extract
/opt/spark/bin/spark-submit \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=4g \
--conf spark.driver.maxResultSize=4g \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=4g \
--conf spark.sql.shuffle.partitions=8 \
--num-executors=2 \
--py-files s3a://etl/spark/yelp-etl/yelp_etl.zip \
s3a://etl/spark/yelp-etl/app.py \
--input s3a://yelp/json/yelp_academic_dataset_user.json \
--output lake.bronze.yelp.user \
--entity_type user \
--pipeline extract \
--bucket_column user_id \
--buckets 8Cleaning pipelines --pipeline clean
/opt/spark/bin/spark-submit \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=4g \
--conf spark.driver.maxResultSize=4g \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=4g \
--conf spark.sql.shuffle.partitions=8 \
--num-executors=2 \
--py-files s3a://etl/spark/yelp-etl/yelp_etl.zip \
s3a://etl/spark/yelp-etl/app.py \
--input lake.bronze.yelp.business \
--output lake.silver.yelp.business \
--entity_type business \
--pipeline clean \
--bucket_column business_id \
--buckets 8Enrichment pipelines --pipeline enrich
/opt/spark/bin/spark-submit \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=4g \
--conf spark.driver.maxResultSize=4g \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=4g \
--conf spark.sql.shuffle.partitions=8 \
--num-executors=2 \
--py-files s3a://etl/spark/yelp-etl/yelp_etl.zip \
s3a://etl/spark/yelp-etl/app.py \
--input lake.silver.yelp.tip \
--output lake.silver.yelp.user_business_tip \
--entity_type tip \
--pipeline enrich \
--partition_column date_year \
--bucket_column business_id \
--buckets 8 \
--dimension_inputs lake.silver.yelp.business,lake.silver.yelp.user \
--dimension_entity_types business,userTODO: Add gold pipeline examples.
You will need Python ~3.8.1 with Poetry ~1.6.0 installed.
Install Python app & dependencies.
poetry install --include devRun tests.
poetry run python -m unittestFormat & lint with pre-commit.
poetry run pre-commit install
# Format & lint happen automatically on commit.An opinionated guide to set up your local environment.
Install brew.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"Install Python3.8 with pyenv. Install poetry.
brew update
brew install pyenv
pyenv init >> ~/.zshrc
exec zsh -l
pyenv install 3.8
pyenv local 3.8
curl -sSL https://install.python-poetry.org | python3 -
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
mkdir -p ~/.zfunc
poetry completions zsh > ~/.zfunc/_poetry
exec zsh -l
poetry config virtualenvs.prefer-active-python true
poetry config virtualenvs.in-project true