Quick Way to Play With Spark

Apache Spark continues to grow in importance, with two major announcements this week:

  1. Apache Mahout is being ported from old-fashioned map/reduce over to Spark.
  2. Spark SQL has been merged into the main Spark repository. This allows SQL querying against RDDs without having to use Shark/Hive, which doesn't natively talk to RDDs.

While both of these are yet to be released, if you're interested in a quick way to get a head start by playing with Spark without having to pay for cloud resources, and without having to go through the trouble of installing Hadoop at home, you can leverage the pre-installed Hadoop VM that Cloudera makes freely available to download. Below are the steps.

  1. Because the VM is 64-bit, your computer must be configured to run 64-bit VM's. This is usually the default for computers made since 2012, but for computers made between 2006 and 2011, you will probably have to enable it in the BIOS settings.
  2. Install https://www.virtualbox.org/wiki/Downloads (I use VirtualBox since it's more free than VMWare Player.)
  3. Download and unzip the 2GB QuickStart VM for VirtualBox from Cloudera.
  4. Launch VirtualBox and from its drop-down menu select File->Import Appliance
  5. Click the Start icon to launch the VM.
  6. From the VM Window's drop-down menu, select Devices->Shared Clipboard->Bidirectional
  7. From the CentOS drop-down menu, select System->Shutdown->Restart. I have found this to be necessary to get HDFS to start working the first time on this particular VM.
  8. The VM comes with OpenJDK 1.6, but Spark and Scala need Oracle JDK 1.7, which is also supported by Cloudera 4.4. From within CentOS, launch Firefox and navigate to http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html. Click the radio button "Accept License Agreement" and click to download jdk-7u51-linux-x64.rpm (64-bit RPM), opting to "save" rather than "open" it. I.e., save it to ~/Downloads.
  9. From the CentOS drop-down menu, select Application->System Tools->Terminal and then:
    sudo rpm -Uivh ~/Downloads/jdk-7u51-linux-x64.rpm
    echo "export JAVA_HOME=/usr/java/latest" >>~/.bashrc
    echo "export PATH=\$JAVA_HOME/bin:\$PATH" >>~/.bashrc
    source ~/.bashrc
    wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating.tgz
    tar xzvf spark-0.9.0-incubating.tgz
    cd spark-0.9.0-incubating
    SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

That sbt assembly command also has the nice side-effect of installing scala and sbt for you, so you can start writing scala code to use Spark instead of just using the Spark Shell.