Spark on Mesos

A new recipe

A recipe for a nice blend of spark on mesos. Preparation time 10 minutes, using the recipe below. Afterwards put in a hot stove for 20 minutes, enjoy! 😛

Spark on Mesos

This blogpost covers the use of Spark in combination with Mesos and aims to determine what Mesos is and how it differs from the commonly used YARN container manager.

This blogpost will answer the following questions

apache spark logo

  • What is Mesos?
  • How does it work?
  • What are the differences with respect to YARN?
  • Shows the steps to set it up and run a test Spark job.

What is Mesos

Mesos is a distributed system kernel that provides API’s to applications (such as Hadoop, Spark and others) in order to deal with resource management and enable scheduling over cloud environments and datacenters.

YARN (Map-Reduce version 2) vs Mesos

Mesos is similar to other resource managers and job schedulers, such as Map-Reduce version 1 and Map-Reduce version 2 (YARN, Yet Another Resource Negotiator), in that it does exactly that. However there are some notable differences, such as:

  • Mesos uses resource offers, e.g. it abstracts resource allocation and offers a chunk of it to a user to run its job.
  • Mesos is more performant as it does not require a JVM and the (memory) overhead that comes with it.
  • Mesos is more flexible in that it abstracts at a higher level and therefore supports a lot of different applications (frameworks) such as Hadoop, Spark, etc., with the possibility to define your custom application (framework). See: http://mesos.apache.org/documentation/latest/app-framework-development-guide/
  • YARN is much more mature, as it is currently at 2.7.2 and Mesos at 0.28.2, but much less cleanly separated from Hadoop than Mesos;
  • YARN is limited to Hadoop frameworks / applications, and can therefore only run Map-Reduce and YARN workloads, while Mesos is able to run any number of applications based on different frameworks.

See for more information about the mesos architecture:

http://mesos.apache.org/documentation/latest/architecture/

Setup

In order to setup all the parts to run spark on Mesos we perform the following steps;

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
CODENAME=$(lsb_release -cs)

Add the repository

Add the repo using:

echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
sudo tee /etc/apt/sources.list.d/mesosphere.list
sudo apt-get -y update

ZooKeeper

Configure and start ZooKeeper using the steps found at: https://open.mesosphere.com/getting-started/install/

Start Mesos

Start Mesos using:

sudo service mesos-slave start
sudo service mesos-master start
ps aux | grep mesos

Now shows the slave and master running.

Launch the ZooKeeper client

We launch the ZooKeeper client using:

sudo sh /<path_to_zookeeper>/bin/zkCli.sh

The path is typically:

/usr/share/zookeeper

List Mesos zone

We list the Mesos zone(s) using:

ls /mesos

Get contents

We get the contents using;

get /mesos/json.info_0000000000 # or higher number

Mesos UI

The Mesos user interface should now be running at:

http://127.0.0.1:5050/

In their respective tabs we verify that there is at least one slave is available. We can also determine the version of Mesos. For this tutorial we used version 0.28.2.

Add hadoop to the mix

In order to host Spark somewhere reachable for all components do:

wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz

And extract it using:

tar -zxvf hadoop-2.6.4.tar.gz

Add the spark framework

We download the latest stable of spark, at the time of writing this tutorial this is version 1.6.1, using:

wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1.tgz

And extract it using:

tar -zxvf spark-1.6.1.tgz

Host spark on HDFS

The steps to host spark on HDFS, and have it available through an URI that can be provided to all the components of this yummy recipe, we perform:

./<path_to_hadoop_bin_folder>/hadoop fs -put <path_to>/spark-1.6.1.tgz <optional_path_on_hdfs>

Build custom Spark

We need to build our custom spark for mesos (without YARN, which is default) using:

./<path_to_spark>/make-distribution.sh --tgz

Spark shell with Mesos

Now we use spark with Mesos using the spark-shell:

./<path_to_spark>/bin/spark-shell --master mesos://host:5050

And submit a sample spark job to Mesos

Using:

./<path_to_spark>/bin/run-example SparkPi 10

Which in turn uses spark-submit.

Progress

Of the spark job is shown here:

http://<hostname>:4040/

Screenshots

Shots fired!

 

mesos spark

Mesos running spark job

mesos spark

 

Spark running the SparkPi job using Mesos

Some sources

http://mesos.apache.org/

http://spark.apache.org/docs/latest/running-on-mesos.html

https://www.quora.com/How-does-YARN-compare-to-Mesos

The binary of Mesos and steps can be found at mesosphere.com, at the downloads page, and select Apache Mesos, the get started button.

Conclusion

So there you have it, a nice blend of Spark on Mesos! Enjoy!