I’d like to share a small trick on how to run spark-shell
as a web app using Mesos and Marathon framework.
This might be useful for debugging or just to try some spark code. And to do that you don’t need to install spark on a cluster.
All you need are Mesos cluster, Marathon framework running on top of Mesos, and Java runtime installed on Mesos slaves.
Sorry guys, I’m not going to discuss here how to setup Mesos cluster, that would be another story. Here I just assume that you already have running environment with mentioned components.
Another secret ingredient is a gotty tool, it simply wraps terminal session into a web app. So the idea is to run spark-shell
with gotty
which is run by Marathon. Sounds simple? Here how it can be done.
Let’s create a marathon app definition in a spark-shell.json
file. You can read more about it here.
spark-shell.json
{
"id": "/spark-shell",
"cmd": "./gotty -w -p $PORT0 --title-format 'Spark Shell' spark-1.5.1-bin-hadoop2.6/bin/spark-shell",
"cpus": 0.2,
"mem": 256,
"ports": [
0
],
"instances": 1,
"env": {
"MESOS_NATIVE_JAVA_LIBRARY": "/usr/lib/libmesos.so",
"SPARK_EXECUTOR_URI": "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
"MASTER": "mesos://zk://zk.cluster:2181/mesos"
},
"uris": [
"http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
"https://github.com/yudai/gotty/releases/download/v0.0.12/gotty_linux_amd64.tar.gz"
]
}
Here you might need to update MESOS_NATIVE_JAVA_LIBRARY
, this value depends on your system and installation. MASTER
should point to Mesos master. You might also want to change port value if you use any kind of service discovery with Marathon.
gotty
is used with following keys:
-w
allows writes to a terminal session, so you can interact with shell.-p $PORT0
sets the port gotty listens to,$PORT0
variable is assigned by Marathon.
You can submit app to Marathon using your preferred http client tool, e.g. for httpie just run
http POST http://marathon.cluster/v2/apps < spark-shell.json
After that you’ll be able to see the new app in Marathon Web UI and its url once it’s started. When you visit that URL you should see spark shell prompt.
You should understand all potential security problems with running a remote session in browser, so you might want to suspend/destroy app in Marathon when you finish playing with
spark-shell
.
You might also notice that it works fine for one browser session, but when two sessions are opened simultaneously it fails with an error like:
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@8ddbb68c
...
That happens because gotty starts separate spar-shell process for each connection, and spark-shell
creates a metastore_db in a working directory, so second spark-shell
instance conflicts with alredy running one.
To solve this let’s create a random working directory for each session, following trick worked well for me:
"cmd": "./gotty -w -p $PORT0 --title-format 'Spark Shell' sh -c 'dir=$RANDOM; mkdir $dir && cd $dir && ../spark-1.5.1-bin-hadoop2.6/bin/spark-shell'"
So a complete example is
spark-shell.json
{"id": "/spark-shell",
"cmd": "./gotty -w -p $PORT0 --title-format 'Spark Shell' sh -c 'dir=$RANDOM; mkdir $dir && cd $dir && ../spark-1.5.1-bin-hadoop2.6/bin/spark-shell'",
"cpus": 0.2,
"mem": 256,
"ports": [
0
,
]"instances": 1,
"env": {
"MESOS_NATIVE_JAVA_LIBRARY": "/usr/lib/libmesos.so",
"SPARK_EXECUTOR_URI": "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
"MASTER": "mesos://zk://zk.cluster:2181/mesos"
,
}"uris": [
"http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz",
"https://github.com/yudai/gotty/releases/download/v0.0.12/gotty_linux_amd64.tar.gz"
] }
That’s all you need to start playing with spark-shell on Mesos, from here you can tweak settings for your needs, like enabling coarse-grained mode for spark, adding libraries and so on.