Quick Way to Play With Spark

On 29 Mar, 2014 By michaelmalak 0 Comments

Apache Spark continues to grow in importance, with two major announcements this week:

Apache Mahout is being ported from old-fashioned map/reduce over to Spark.
Spark SQL has been merged into the main Spark repository. This allows SQL querying against RDDs without having to use Shark/Hive, which doesn't natively talk to RDDs.

While both of these are yet to be released, if you're interested in a quick way to get a head start by playing with Spark without having to pay for cloud resources, and without having to go through the trouble of installing Hadoop at home, you can leverage the pre-installed Hadoop VM that Cloudera makes freely available to download. Below are the steps.

Because the VM is 64-bit, your computer must be configured to run 64-bit VM's. This is usually the default for computers made since 2012, but for computers made between 2006 and 2011, you will probably have to enable it in the BIOS settings.
Install https://www.virtualbox.org/wiki/Downloads (I use VirtualBox since it's more free than VMWare Player.)
Download and unzip the 2GB QuickStart VM for VirtualBox from Cloudera.
Launch VirtualBox and from its drop-down menu select File->Import Appliance
Click the Start icon to launch the VM.
From the VM Window's drop-down menu, select Devices->Shared Clipboard->Bidirectional
From the CentOS drop-down menu, select System->Shutdown->Restart. I have found this to be necessary to get HDFS to start working the first time on this particular VM.
The VM comes with OpenJDK 1.6, but Spark and Scala need Oracle JDK 1.7, which is also supported by Cloudera 4.4. From within CentOS, launch Firefox and navigate to http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Click the radio button "Accept License Agreement" and click to download jdk-7u51-linux-x64.rpm (64-bit RPM), opting to "save" rather than "open" it. I.e., save it to ~/Downloads.
From the CentOS drop-down menu, select Application->System Tools->Terminal and then:
sudo rpm -Uivh ~/Downloads/jdk-7u51-linux-x64.rpm
echo "export JAVA_HOME=/usr/java/latest" >>~/.bashrc
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >>~/.bashrc
source ~/.bashrc
wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating.tgz
tar xzvf spark-0.9.0-incubating.tgz
cd spark-0.9.0-incubating
SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly
bin/spark-shell

That sbt assembly command also has the nice side-effect of installing scala and sbt for you, so you can start writing scala code to use Spark instead of just using the Spark Shell.

You are here

Quick Way to Play With Spark