If you already have a functioning Apache Spark configuration, you can use your own. For your convenience, the provided docker-compose.yml is based on the jupyter/pyspark-notebook image. You will need to have Docker and Docker Compose configured on your computer. Check out the Docker Desktop documentation for details.
You can run docker-compose up and follow the prompt to open the Jupyter Notebook UI (looks like http://127.0.0.1:8888/?token=<SOME_TOKEN>).
The given data/ directory mounts as a Docker volume at ~/data/ for easy access:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()
df = spark.read.options(
header='True',
inferSchema='True',
delimiter=',',
).csv(os.path.expanduser('~/data/DataSample.csv'))Please host your solution as one or multiple Notebooks (.ipynb) in a public git remote repository and reply with its link to the email thread you initially received to work on this work sample.
