This project uses Docker to launch you into an environment with spark locally installed and some data prepopulated.
- install Docker (http://docs.docker.com/mac/started/)
- brew install docker-machine
- brew install docker
- docker-machine create --driver virtualbox learn-spark
- docker-machine env learn-spark
- eval "$(docker-machine env learn-spark)"
from the project directory. The build will take a long time...
- docker build -t aces/learn-spark .
- docker run -it --rm aces/learn-spark
this may take a while ... but eventually you will see the spark ascii art and the scala>
val df = spark.sqlContext.jsonFile("data/enron-data.json.gz")
df.count()
df.printSchema()
df.show()
df.select("sender").show()
df.select("sender","date").show()
df.groupBy("sender").count().show()
df.groupBy("sender").count().sort($"count".desc).show()
df.groupBy("sender").count().sort($"count".desc).limit(2).show()