Docker image for starting Apache Zeppelin.
Note: Must run binary command with Java version 8 installed.
# Download dependencies
./build download --zeppelin_version=v0.8.2 --spark_version=2.4.3 --hadoop_version=2.7
# Build binaries
./build binary --zeppelin_version=v0.8.2 --spark_version=2.4.3 --hadoop_version=2.7
# Build docker container
./build docker --repo=${DOCKER_NAMESPACE:-datascienceplatform}/zeppelind --commit=$(git rev-parse --short HEAD)./build all --zeppelin_version=v0.8.2 --spark_version=2.4.3 --hadoop_version=2.7You can either start the image directly with Docker, or use the Nomad-Docker-Wrapper if you are running your containers on Nomad.
There are now 2 options for running Zeppelin; multi-user or single-user mode.  If you wish to test the SSSD integration you will need to run the sssd container which is explained in the sssd project documentation.  You need to share a volume /var/sssd/{{id}}/var/lib/sss/pipes:/var/lib/sss/pipes:rw on both the zeppelin container and the sssd container.  Start the sssd container prior to starting the zeppelin container.
# Multi-user login
docker run -p 8080:8080 \
  -e ZEPPELIN_PROCESS_USER_NAME="zeppelin" \
  -e ZEPPELIN_MEM="-Xmx1024m" \
  -e ZEPPELIN_PROCESS_USER_ID=12345 \
  -e ZEPPELIN_SERVER_PORT=8085 \
  -e ZEPPELIN_SPARK_DRIVER_MEMORY="512M" \
  -e ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.GitNotebookRepo \
  -e ZEPPELIN_PROCESS_GROUP_NAME="DSP1_USERS" \
  -e ZEPPELIN_PYSPARK_PYTHON=/usr/bin/python \
  -e ZEPPELIN_SPARK_UI_PORT=4045 \
  -e ZEPPELIN_PROCESS_GROUP_ID=12340 \
  -e ZEPPELIN_SPARK_MASTER="local[*]" \
  -e ZEPPELIN_PASSWORD="secret" \
  -e ZEPPELIN_USER_TYPE=multiuser \
  -v $(pwd)/notebooks:/usr/local/zeppelin/notebooks \
  -v $(pwd)/conf:/usr/local/zeppelin/conf \
  -v $(pwd)/hive:/hive \
  -t pactosystems/zeppelind:f9d604cf-zv0.8.2-s2.4.3-h2.7
# Single-user login
  docker run -p 8080:8080 \
  -e ZEPPELIN_PROCESS_USER_NAME="zeppelin" \
  -e ZEPPELIN_MEM="-Xmx1024m" \
  -e ZEPPELIN_PROCESS_USER_ID=12345 \
  -e ZEPPELIN_SERVER_PORT=8080 \
  -e ZEPPELIN_SPARK_DRIVER_MEMORY="512M" \
  -e ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.GitNotebookRepo \
  -e ZEPPELIN_PROCESS_GROUP_NAME="DSP1_USERS" \
  -e ZEPPELIN_PYSPARK_PYTHON=/usr/bin/python \
  -e ZEPPELIN_SPARK_UI_PORT=4040 \
  -e ZEPPELIN_PROCESS_GROUP_ID=12340 \
  -e ZEPPELIN_SPARK_MASTER="local[*]" \
  -e ZEPPELIN_PASSWORD="secret" \
  -e ZEPPELIN_USER_TYPE=singleuser \
  -v $(pwd)/notebooks:/usr/local/zeppelin/notebooks \
  -v $(pwd)/conf:/usr/local/zeppelin/conf \
  -v $(pwd)/hive:/hive \
  -t pactosystems/zeppelind:f9d604cf-zv0.8.2-s2.4.3-h2.7The docker image requires some environment variables to be set. They are used to configure your Zeppelin.
| Variable | Description | 
|---|---|
| ZEPPELIN_SPARK_MASTER | URL of the Spark master that Zeppelin should use. | 
| ZEPPELIN_PASSWORD | Password to use for authenticating as zeppelinuser on the UI. | 
| ZEPPELIN_NOTEBOOK_STORAGE | Notebook storage to use. | 
| ZEPPELIN_PROCESS_USER_NAME | User name to execute the Zeppelin process as. | 
| ZEPPELIN_PROCESS_USER_ID | User ID to execute the Zeppelin process as. | 
| ZEPPELIN_PROCESS_GROUP_NAME | Group name to assign to the Zeppelin user. | 
| ZEPPELIN_PROCESS_GROUP_ID | Group ID to assign to the Zeppelin user. | 
| ZEPPELIN_SERVER_PORT | Port to bind the Zeppelin server to. | 
| ZEPPELIN_SPARK_UI_PORT | Port to use for the Spark UI. | 
| ZEPPELIN_SPARK_DRIVER_MEMORY | Amount of memory to allocate to the Spark driver process (e.g. 512M). | 
| ZEPPELIN_PYSPARK_PYTHON | Path to python executable for the Spark worker nodes. | 
| ZEPPELIN_MEM | Zeppelin JVM Options | 
These environment variables should be defined and set appropriately in your Travis CI Settings.
https://travis-ci.org/<your travis account name>/docker-zeppelin/settings
| Variable | Description | 
|---|---|
| DOCKER_USER | Your docker ID | 
| DOCKER_PASSWORD | Password for your docker ID (make sure "display value in build log" is disabled for this env var) | 
| DOCKER_NAMESPACE | Docker namespace, where the image(s) should be pushed; usually equals to DOCKER_USER; if empty, then "datascienceplatform" (that can lead to permission issues while pushing) | 
SQL databases are supported trough SQLAlchemy (a python package with a comprehensive set of tools for working with databases). This container includes the following dialects:
| Dialect | Target DB | 
|---|---|
| pymssql | Microsoft SQL Server | 
| psycopg2 | PostgreSQL | 
| cx_Oracle | Oracle | 
Support for additional databases can be added by installing additional dialects into an anaconda environment and setting ZEPPELIN_PYSPARK_PYTHON to the environment location.
%spark.pyspark
import sqlalchemy
from sqlalchemy import create_engine
conn_str = "mssql+pymssql://<User>:<Password>@<Host>:<Port>"
engine = create_engine(conn_str)
# List All DBs
res = engine.execute('SELECT * FROM master.sys.databases')
for row in res:
    print row%spark.pyspark
import sqlalchemy
from sqlalchemy import create_engine
conn_str = "postgresql+psycopg2:///<User>:<Password>@<Host>:<Port>"
engine = create_engine(conn_str)
# List All DBs
res = engine.execute('SELECT * FROM pg_database')
for row in res:
    print row%spark.pyspark
import sqlalchemy
from sqlalchemy import create_engine
conn_str = "oracle+cx_oracle:///<User>:<Password>@<Host>:<Port>/<db>"
engine = create_engine(conn_str)
res = engine.execute('SELECT * FROM my_table')
for row in res:
    print row