Running spark-shell in browser with Apache Mesos and Marathon

Small trick on how to run spark-shell a web app using Mesos and Marathon.
Spark
Mesos
Marathon
Published

November 13, 2015

I’d like to share a small trick on how to run spark-shell as a web app using Mesos and Marathon framework.

This might be useful for debugging or just to try some spark code. And to do that you don’t need to install spark on a cluster.

All you need are Mesos cluster, Marathon framework running on top of Mesos, and Java runtime installed on Mesos slaves.

Sorry guys, I’m not going to discuss here how to setup Mesos cluster, that would be another story. Here I just assume that you already have running environment with mentioned components.

Another secret ingredient is a gotty tool, it simply wraps terminal session into a web app. So the idea is to run spark-shell with gotty which is run by Marathon. Sounds simple? Here how it can be done.

Let’s create a marathon app definition in a spark-shell.json file. You can read more about it here.

spark-shell.json
{
  "id": "/spark-shell",
  "cmd": "./gotty -w -p $PORT0 --title-format 'Spark Shell' spark-1.5.1-bin-hadoop2.6/bin/spark-shell",
  "cpus": 0.2,
  "mem": 256,
  "ports": [
    0
  ],
  "instances": 1,
  "env": {
    "MESOS_NATIVE_JAVA_LIBRARY": "/usr/lib/libmesos.so",
    "SPARK_EXECUTOR_URI": "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
    "MASTER": "mesos://zk://zk.cluster:2181/mesos"
  },
  "uris": [
    "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
    "https://github.com/yudai/gotty/releases/download/v0.0.12/gotty_linux_amd64.tar.gz"
  ]
}

Here you might need to update MESOS_NATIVE_JAVA_LIBRARY, this value depends on your system and installation. MASTER should point to Mesos master. You might also want to change port value if you use any kind of service discovery with Marathon.

gotty is used with following keys:

You can submit app to Marathon using your preferred http client tool, e.g. for httpie just run

http POST http://marathon.cluster/v2/apps < spark-shell.json

After that you’ll be able to see the new app in Marathon Web UI and its url once it’s started. When you visit that URL you should see spark shell prompt.

You should understand all potential security problems with running a remote session in browser, so you might want to suspend/destroy app in Marathon when you finish playing with spark-shell.

You might also notice that it works fine for one browser session, but when two sessions are opened simultaneously it fails with an error like:

Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@8ddbb68c
...

That happens because gotty starts separate spar-shell process for each connection, and spark-shell creates a metastore_db in a working directory, so second spark-shell instance conflicts with alredy running one.

To solve this let’s create a random working directory for each session, following trick worked well for me:

"cmd": "./gotty -w -p $PORT0 --title-format 'Spark Shell' sh -c 'dir=$RANDOM; mkdir $dir && cd $dir && ../spark-1.5.1-bin-hadoop2.6/bin/spark-shell'"

So a complete example is

spark-shell.json
{
  "id": "/spark-shell",
  "cmd": "./gotty -w -p $PORT0 --title-format 'Spark Shell' sh -c 'dir=$RANDOM; mkdir $dir && cd $dir && ../spark-1.5.1-bin-hadoop2.6/bin/spark-shell'",
  "cpus": 0.2,
  "mem": 256,
  "ports": [
    0
  ],
  "instances": 1,
  "env": {
    "MESOS_NATIVE_JAVA_LIBRARY": "/usr/lib/libmesos.so",
    "SPARK_EXECUTOR_URI": "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
    "MASTER": "mesos://zk://zk.cluster:2181/mesos"
  },
  "uris": [
    "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
    "https://github.com/yudai/gotty/releases/download/v0.0.12/gotty_linux_amd64.tar.gz"
  ]
}

That’s all you need to start playing with spark-shell on Mesos, from here you can tweak settings for your needs, like enabling coarse-grained mode for spark, adding libraries and so on.