Spark Docker

In this post we provide a comprehensive guide to building a Spark Docker image and using it for the provisioning of a ‘standalone’ Spark cluster composed of a master node and an indefinite number of slave nodes; each running within its own Docker container. Additionally, we provide two examples for running (1) an interactive Spark-shell and (2) a WordCount Spark Application on top of the cluster. The main technology drivers for this guide are Docker and Spark; we expect the readers to be familiar with the Docker-engineApache Spark and its installation in standalone mode.

spark-docker

Overview

Apache Spark has been an active open source project since 2010, with continuously growing attention in the big data world. According to the Hadoop survey carried by syncsort earlier this year, 70% of the survey participants showed most interest in Apache Spark; higher than the current adoption leader MapReduce. While more organizations are turning into Spark and considering the high development pace carried within the platform, with new features and bug fixes added and released every couple of weeks, a convenient way for testing the platform and realizing newly added features is crucial. This is where the provided Spark Docker image take place.

Spark provides a very simple standalone deployment mode. The only requirement to get a Spark node up and running is to have Java installed and a compiled version of Spark; for the standalone cluster, it’s indeed essential to have all the nodes running the same Java and Spark versions, and able to actively communicate with each other.

In this guide, we do the following

  1. provide a Dockerfile for a Spark Docker image, that is suitable for running either a single master, a single slave, or an interactive spark-shell in the standalone mode;
  2. provide the Docker instructions necessary to build the image, and deploy a standalone cluster composed of a master and n slave nodes; and run an interactive Spark-shell on top of deployed cluster;
  3. provide a Dockerfile that extends the base Spark Docker image to a generic Spark-driver that is suitable for running a Spark application as a ‘fat jar’;
  4. use the driver Dockerfile to build a Spark-driver encapsulating a simple WordCount application and run it on top of the deployed cluster

1. Dockerizing Spark

In this section we provide and describe the Dockerfile that automates the creation of the Spark Docker image. We start by Ubuntu-14.04 as our base image and incrementally RUN commands that assembles our desired image, mainly installing java and downloading a pre-compiled version of Spark; once these are in place we shall be ready to start any of our spark components.

Spark Dockerfile

 
#specifying our base docker-image
FROM ubuntu:14.04

####installing [software-properties-common] so that we can use [apt-add-repository] to add the repository [ppa:webupd8team/java] form which we install Java8
RUN apt-get install software-properties-common -y
RUN apt-add-repository ppa:webupd8team/java -y
RUN apt-get update -y

####automatically agreeing on oracle license agreement that normally pops up while installing java8
RUN echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
RUN echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections

####installing java
RUN apt-get install -y oracle-java8-installer
#####################################################################################
####downloading and unpacking Spark 1.6.1 [prebuilt for Hadoop 2.6+ and scala 2.10]
RUN wget http://apache.mirror.triple-it.nl/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz
RUN tar -xzf spark-1.6.1-bin-hadoop2.6.tgz

####moving the spark root folder to /opt/spark
RUN mv spark-1.6.1-bin-hadoop2.6 /opt/spark

####exposing port 8080 so we can later access the Spark master UI; to verify spark is running …etc.
EXPOSE 8080
#####################################################################################

Given the specified Dockerfile it looks sufficient to be able to start any of the target Spark components. The most convenient way to start a Spark master/slave in the standalone mode is via its corresponding launch script that is provided by Spark; this can be done from the Docker run command while starting the master/slave container. However, there is a small issue in that direction.

Docker containers are designed to run a single process; this process takes process id PID1 on the running container and controls the lifetime of the container; whenever the process terminates, the container will terminate. If we start a Spark master/slave container by directly running the corresponding launch script, this shell script will be the main process running in the container, and as expected, when the shell script is executed, the container will terminate; i.e. although the master/slave process will start successfully, the container its self will terminate since the process with PID1 (the shell script in that case) is completed.

The coolest workaround for that, is to use supervisor. Supervisor is a process control system that allows to monitor and control a number of processes on UNIX-like operating systems. Our trick is to use supervisor to run the Spark master/slave launch script. In that case, the supervisor process will run as PID1 and regardless the state of the used start-up script the supervisor process will remain alive, and thus the container. To make that happen we need to add the following to our Dockerfile

 
#####################################################################################
####installing supervisor
RUN apt-get install -y supervisor

####copy supervisor configuration files for master and slave nodes (described below)
COPY master.conf /opt/conf/master.conf
COPY slave.conf /opt/conf/slave.conf

#default command: running an interactive spark shell in the local mode
CMD ["/opt/spark/bin/spark-shell", "--master", "local[*]"]
########################################EOF##########################################

Note: While we chose to run the Spark master and slave via their corresponding launch scripts, this is not the case with the spark-shell. The spark-shell shall always be run “interactively” which guarantees that the container remains alive. We set the default command to a Spark Docker container to run the Spark-shell in the standalone local mode. This allows to conveniently run a complete spark setup with the interactive shell from the image using the command docker run -it <spark-docker-image-name>

Supervisor configuration file – master.conf

 
#setup supervisor to run in interactive mode and run spark master launch script
[supervisord]
nodaemon=true
[program:spark-master]
command=/opt/spark/sbin/start-master.sh

Supervisor configuration file – slave.conf

#setup supervisor to run in interactive mode and run spark slave launch script
[supervisord]
nodaemon=true
[program:spark-slave]
command=/opt/spark/sbin/start-slave.sh spark://master:7077

Note: the argument to the Spark slave launch script is the Spark master URL; we include master as the hostname of the Spark master; this will only work if master resolves to the IP-address of the Spark-master node; we can guarantee that by naming the master node container to master and allowing the master and slave containers to join the same Docker user-defined network; containers in the same Docker user-defined network can communicate with each other by their container name.

2.1 Running a Standalone Spark cluster on the Localhost

Given the provided Dockerfile and the supervisor configuration files we have all the necessary items to build and start running our image.

  1. building the image: 
    • create a folder containing (only) the 3 files with names Dockerfile, master.conf and slave.conf; this will act as the build context for our  image
    • from the docker command line, cd to the created folder and type: docker build -t anchormen/spark . (don’t forget the dot ‘.’); this command will build the Spark Docker image with the name <anchormen/spark>
  2. running the cluster:
    • create a user-defined network for the cluster components using the following command: docker network create spark_network; in this example we call it <spark_network>
    • run the master node using the following command docker run -d --net spark_network --name master -p 8080:8080 anchormen/spark /usr/bin/supervisord --configuration=/opt/conf/master.conf here we (1) run the master container as a deamon (2) explicitly allow it to join the spark_network (3) call it ‘master’ so it can be accessed by slaves and drivers nodes and (4) map the export port 8080 to port 8080 on the localhost (so we can access the masters UI); finally we run the supervisor process with the master.conf configuration file.
    • run n slave containers by running the following command n times: docker run -d --net spark_network anchormen/spark /usr/bin/supervisord --configuration=/opt/conf/slave.conf here we explicitly allow each worker to join the spark_network so that they can access the spark master using the its container name ‘master’;
    • verify the cluster is up and running by accessing the masters UI from your favorite browser "http://localhost:8080"

Windows & OSX users (do not use localhost)

Docker provides its containers by relying heavily on virtualization facilities provided by the Linux kernel. On Windows and OSX, the docker daemon and the containers can not run natively; only the docker-client is running on the Windows/OSX machine, but the daemon and the containers run in a VirtualBox virtual machine that runs Linux.

For Windows and OSX users, when a port is exposed from the Docker image and mapped by the container on the host machine, it is mapped to the VirtualBox virtual machine not the Windows/OSX host itself. Thus to connect to a container via its mapped ports, it is not possible to do that via “localhost”, as in the case with Linux users, but through the IP-address of the virtual machine that runs Linux. More on that could be found on the Windows and OSX installation guides.

2.2 Starting an interactive Spark-shell

The Spark shell provides a simple way to learn about the API, as well as a powerful tool to analyze data interactively. It is available in both Scala and Python. We can start a Scala Spark-shell and connect  it to the  cluster using the following command: docker run -it --net spark_network anchormen/spark /opt/spark/bin/spark-shell --master spark://master:7077 On the master UI, the Spark shell shall be added to the Running Applications Section.

Note: given the default command we have set up earlier in the Dockerfile; we can start a Spark-shell in its own local-cluster by running the command “docker run -it anchormen/spark“.

3. Spark-driver Dockerfile

In this section we provide and describe the Dockerfile that automates the creation of the generic Spark-driver Docker image. Starting from the previous image as our base image, and a ‘fat jar’ application, the only thing that is required is to copy the ‘fat jar’ to a location within the image and use the Spark-submit script to run the application. To make the image generic, we keep the main class name that shall be executed from the fat jar an an input environment variable.

#using the spark-docker image we just created as our base image
FROM anchormen/spark

#app.jar is our Fat Jar to be run; here we assume it’s in the same build context as the Dockerfile;
COPY app.jar /opt/app.jar

#calling the spark-submit command; with the --class argument being an input environment variable
CMD /opt/spark/bin/spark-submit --class $SPARK_CLASS --master spark://master:7077 /opt/app.jar

Note: a more professional way to have the same effect, is to use the same base Spark Docker image and run it with mounting the fat jar directly into the driver container. Information about how to mount a host file as a data volume using Docker can be hound here

4. Building and running the driver application

  1. building the image: 
    • create a folder containing (only) the provided Dockerfile and the ‘fat jar’ for the required application
    • from the docker command line, cd to the created folder and type: docker build -t anchormen/spark-driver .
  2. running the driver application on top of the cluster
    • In this example we assume the main class name is “nl.anchormen.WordCount”
    • docker run --net spark_network -e "SPARK_CLASS=nl.anchormen.WordCount" anchormen/spark-driver

Summary

In this post we provided a step by step guide to writing a Spark Docker image, a generic Spark-driver Docker image, as well as an example to use these images in the deployment of a standalone Spark cluster and running Spark applications. To keep things simple, we have written the Dockerfiles in a very naïve way, omitting things that one would possibly include, eg. setting up Spark logging, maintaining Spark’s configuration, HDFS installation, building Spark for Scala 2.11 and the preferred Hadoop version. Another very interesting use-case, is to include web-based notebooks that enables faster interactive data-analytics than the Spark-shell like Zeppelin or IPython.

We also put the assumption that the whole setup is taking place on a single machine (localhost), however, using the same image and the same commands, it is possible to deploy the cluster on multiple Docker-hosts spanning multiple machines; this however requires additional knowledge with more advanced Docker technologies and networking, mainly Docker-swarm and overlay networks.

GitHub Bonus

On our GitHub page we provide updated Dockerfiles that follow the best practices for writing Dockerfiles. Additionally, as a bonus step we provide Docker-compose scripts that automates setting up/destroying the cluster in a single command.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published.