Thoughts Resampled

Anatoliy Plastinin's Blog

spark-shell without Spark

spark-shell is an interactive shell that comes with spark’s distribution. Shell is useful for learning api, quick experiments, prototyping and etc. But to do that you don’t need to have cluster or even have spark distribution installed.

What you need to have is a basic knowledge of scala and sbt. Just create build.sbt file with following content:

build.sbt
1
2
3
4
5
6
7
8
scalaVersion := "2.11.8"

val sparkVersion = "2.0.1"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion
)

It defines two spark components as dependencies: core and sql, we will use sql later for testing. You can import other spark libraries that you need like MLLib, Streaming and etc.

And now you’re all set to start experimenting with Spark.

First, run

$ sbt console

This command starts the Scala interpreter REPL with specified dependencies on the classpath. So you can use Spark classes in the REPL.

Let’s start with setting up a SparkSession.

1
2
3
4
5
6
7
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
  builder().
  appName("Console Demo").
  master("local[*]").
  getOrCreate()

SparkSession is a new class in Spark 2.0 that gives you unified access to Spark SQL functionality.

The trick here is to set master URL to “local” to run Spark locally, local[*] means that spark will use as many worker threads as number of logical cores on your machine.

And now you can run some spark code.

1
2
3
4
5
6
val events = spark.sparkContext.parallelize(
  """{"action":"create","timestamp":"2016-11-06T00:01:17Z"}""" :: Nil)

val df = spark.read.json(events)

df.show()

By default you should see tons of Spark logs in console, but in the end you will see the result:

+------+--------------------+
|action|           timestamp|
+------+--------------------+
|create|2016-01-07T00:01:17Z|
+------+--------------------+

In case if you don’t need SparkSQL functionality, you can create just SparkContext with local master.

1
2
3
4
5
6
7
import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf().
  setAppName("Console Demo").
  setMaster("local[*]")

val sc = new SparkContext(conf)

Some spark features might also require setting additional spark options, but it depends on a feature.

And that’s all what you need to do to run spark locally within spark console.

Have fun playing with spark in console!

Comments