apache hdfs spark standalone #7

Supports: trusty

Add to new model

Spark Standalone with HDFS

Overview

This bundle is a 4 node cluster designed to scale out. Built around Apache
Big Data components, it contains the following units:

  • 1 HDFS Master
  • 1 Compute Slaves
  • 1 Spark
  • 1 Plugin (colocated on the Spark unit)
  • 1 Benchmark GUI

This bundle will deploy Spark and HDFS in your environment. Spark is
preconfigured for standalone mode with spark_execution_mode=standalone. This
causes Spark to run a single master and worker with as many executor threads as
logical cores on your host.

This is an ideal deployment if YARN Resource Management and associated Node
Managers are not needed. For a Spark deployment that does utilize YARN in
addition to HDFS, see the
apache-hadoop-spark bundle.

Usage

Deploy this bundle using juju-quickstart:

sudo add-apt-repository ppa:juju/stable
sudo apt-get update
sudo apt-get install juju-core juju-quickstart
juju quickstart apache-hdfs-spark-standalone

See juju quickstart --help for deployment options, including machine
constraints and how to deploy a locally modified version of this bundle's
bundle.yaml.

Testing the deployment

Smoke test HDFS admin functionality

Once the deployment is complete and the cluster is running, ssh to the HDFS
Master unit:

juju ssh hdfs-master/0

As the ubuntu user, create a temporary directory on the Hadoop file system.
The steps below verify HDFS functionality:

hdfs dfs -mkdir -p /tmp/hdfs-test
hdfs dfs -chmod -R 777 /tmp/hdfs-test
hdfs dfs -ls /tmp # verify the newly created hdfs-test subdirectory exists
hdfs dfs -rm -R /tmp/hdfs-test
hdfs dfs -ls /tmp # verify the hdfs-test subdirectory has been removed
exit

Smoke test Spark

SSH to the Spark unit and run the SparkPi demo as follows:

juju ssh spark/0
~/sparkpi.sh
exit

Scale Out Usage

This bundle was designed to scale out. To increase the amount of Compute
Slaves, you can add units to the compute-slave service. To add one unit:

juju add-unit compute-slave

Or you can add multiple units at once:

juju add-unit -n4 compute-slave

Benchmarking

Run the Spark Bench benchmarking
suite to gauge the performance of your environment. Each enabled test is a
separate action and can be called as follows:

$ juju action do spark/0 pagerank
Action queued with id: 88de9367-45a8-4a4b-835b-7660f467a45e
$ juju action fetch --wait 0 88de9367-45a8-4a4b-835b-7660f467a45e
results:
  meta:
    composite:
      direction: asc
      units: secs
      value: "77.939000"
    raw: |
      PageRank,2015-12-10-23:41:57,77.939000,71.888079,.922363,0,PageRank-MLlibConfig,,,,,10,12,,200000,4.0,1.3,0.15
    start: 2015-12-10T23:41:34Z
    stop: 2015-12-10T23:43:16Z
  results:
    duration:
      direction: asc
      units: secs
      value: "77.939000"
    throughput:
      direction: desc
      units: x/sec
      value: ".922363"
status: completed
timing:
  completed: 2015-12-10 23:43:59 +0000 UTC
  enqueued: 2015-12-10 23:42:10 +0000 UTC
  started: 2015-12-10 23:42:15 +0000 UTC

Valid action names at this time are:

  • logisticregression
  • matrixfactorization
  • pagerank
  • sql
  • streaming
  • svdplusplus
  • svm
  • trianglecount

This bundle includes the Benchmarking GUI so you can easily see results of
your benchmark runs. To access this GUI, do the following:

juju set benchmark-gui juju-pass=`juju api-info password`
juju expose benchmark-gui

Then visit http://BENCHMARK-GUI-IP in a browser.

Contact Information

Help

Bundle configuration

Embed this bundle

Add this card to your website by copying the code below. Learn more.

Preview