- Java
- Poetry
- Docker
- https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store (Selected for Notebook for large .csv)
- https://www.kaggle.com/datasets/gsimonx37/letterboxd
- https://www.kaggle.com/datasets/olegshpagin/usa-stocks-prices-ohlcv
- https://www.kaggle.com/datasets/jayitabhattacharyya/hotels-details
- Clone this repository
- Run
poetry installin the root directory of the project - Get a dataset of your choice, we use ecommerce data for its shear size
- Once the data has been saved as delta format
- Be sure to kill the Spark spawned from Notebook
- Start docker-compose to expose Spark's thrift server for dbt to utilize
- Create external table within the container's context (Steps below)
- Change directory into dbt-spark
poetry run dbt runto start running dbt
- Connect to Hive with client in container
docker exec -it delta-lake-dbt-spark3-thrift-1 beeline -u "jdbc:hive2://localhost:10000/default" -n root - Create external tables by importing the data from delta format
CREATE SCHEMA raw;
CREATE SCHEMA rfn;
CREATE SCHEMA ast;
CREATE TABLE raw.ecommerce
USING DELTA
LOCATION '/data/delta/raw/ecommerce';
CREATE TABLE rfn.ecommerce
USING DELTA
LOCATION '/data/delta/rfn/ecommerce';
Most of the code here take lots of inspirations from this awesome blog
Most of the Dockerfiles came from this repo