hadoop es constraints #14

Supports: trusty
Add to new model

Hadoop OSCON demo bundle

Deploys a hadoop single master (namenode + resourcemanager) node deployment with ancillary services of: 1 to many hadoop-slaves (datanode + nodemanager), hive2, elasticsearch, mysql.

How to Deploy:

This bundle has two ways to deploy:

Bare Metal

For bare metal deployments we do not specify constraints, we just max out what you have. If you're using an orange box, use this.

juju-quickstart bundle:~jorge/hadoop-es/baremetal

Public Clouds

For public clouds we specify constraints to get big boxes, see the bottom of this README for the details:

juju-quickstart bundle:~jorge/hadoop-es/constraints

Services Description

Hadoop-Master : Hadoop master (namenode + resourcemanager) is the data manager and resource manager for all running map-reduce jobs within the hadoop cluster. These jobs are distributed among the hadoop-slave nodes by means of hadoop-master orchestrating the jobs and allocation worker resources.

Hadoop-Slave: Hadoop Slave (datanode + nodemanager) are the data storage and computation (worker) nodes. They are responsible for loading the Map Reduce applications and running actual computation units, as well as storing result output.

Hive2: Hive2 is a big-data warehousing unit, which ships with HiveQL - an sql-like language. You can read more about HiveQL here.

MySQL: MySql is a popular OpenSource SQL database. The relation in this deployment configuration sets up MySQL as the meta-data manger for Hive’s data warehouse.

ElasticSearch: ElasticSearch is a distributed, real-time search and analytics engine. It’s used to index the data warehoused in Hive2, and hadoop. Note: ElasticSearch’s configuration was taken from ES upstream documentation here

Working with the cluster

Validating Hadoop

First, check to ensure the hadoop master node can talk to the slaves:

juju ssh hadoop-master/0
hdfs dfsadmin -report

Now we will need to run a terasort. This can be done via juju run, as well as ssh’ing into the master node. (we will cover using juju-run):

/usr/local/hadoop/terasort.sh

You are now going to be running a teragen and terasort. It is recommended that you use > 4gb of memory per hadoop node. The process should complete within minutes, depending on hardware configuration. Depending on the hardware configuration this can take up to 20 minutes.

Validating Hive2

Hive as a warehouse manager can be validated through an ssh session with your hive2-server node.

juju ssh hive-server/0
hive

Once you are in the interactive Hive Shell Terminal, we’ll do a few things. Create an example database, and create a table within that database.

 CREATE DATABASE example;
 USE example;  
 CREATE TABLE page_view(viewTime INT, userid BIGINT,
            page_url STRING, referrer_url STRING,
            ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 STORED AS SEQUENCEFILE;

Using Hive and ElasticSearch

Recommended Minimum Deployment Constraints (Starting usage, can/will grow as demand increases)

With the assumption that we have 1 master, 3 slave nodes. You can do this by hand or just use the hadoop-es/constraints bundle (see the top of this README).

  • hadoop-master: juju set-constraints hadoop-master "mem=16G cpu-cores=8 root-disk=1T"
  • hadoop-slave: juju set-constraints hadoop-slavecluster "mem=4G cpu-cores=4 root-disk=1T"
  • Hive2: juju set-constraints hive-server "mem=4G cpu-cores=4 root-disk=1T"
  • ElasticSearch: juju set-constraints elasticsearch "mem=8G cpu-cores=4"

Bundle configuration

Embed this bundle

Add this card to your website by copying the code below. Learn more.

Preview