Hadoop OSCON demo bundle
Deploys a hadoop single master (namenode + resourcemanager) node deployment with ancillary services of: 1 to many hadoop-slaves (datanode + nodemanager), hive2, elasticsearch, mysql.
How to Deploy:
This bundle has two ways to deploy:
For bare metal deployments we do not specify constraints, we just max out what you have. If you're using an orange box, use this.
For public clouds we specify constraints to get big boxes, see the bottom of this README for the details:
Hadoop-Master : Hadoop master (namenode + resourcemanager) is the data manager and resource manager for all running map-reduce jobs within the hadoop cluster. These jobs are distributed among the hadoop-slave nodes by means of hadoop-master orchestrating the jobs and allocation worker resources.
Hadoop-Slave: Hadoop Slave (datanode + nodemanager) are the data storage and computation (worker) nodes. They are responsible for loading the Map Reduce applications and running actual computation units, as well as storing result output.
Hive2: Hive2 is a big-data warehousing unit, which ships with HiveQL - an sql-like language. You can read more about HiveQL here.
MySQL: MySql is a popular OpenSource SQL database. The relation in this deployment configuration sets up MySQL as the meta-data manger for Hive’s data warehouse.
ElasticSearch: ElasticSearch is a distributed, real-time search and analytics engine. It’s used to index the data warehoused in Hive2, and hadoop. Note: ElasticSearch’s configuration was taken from ES upstream documentation here
Working with the cluster
First, check to ensure the hadoop master node can talk to the slaves:
juju ssh hadoop-master/0 hdfs dfsadmin -report
Now we will need to run a terasort. This can be done via juju run, as well as ssh’ing into the master node. (we will cover using juju-run):
You are now going to be running a teragen and terasort. It is recommended that you use > 4gb of memory per hadoop node. The process should complete within minutes, depending on hardware configuration. Depending on the hardware configuration this can take up to 20 minutes.
Hive as a warehouse manager can be validated through an ssh session with your hive2-server node.
juju ssh hive-server/0 hive
Once you are in the interactive Hive Shell Terminal, we’ll do a few things. Create an example database, and create a table within that database.
CREATE DATABASE example; USE example; CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;
Using Hive and ElasticSearch
Recommended Minimum Deployment Constraints (Starting usage, can/will grow as demand increases)
With the assumption that we have 1 master, 3 slave nodes. You can do this by hand or just use the hadoop-es/constraints bundle (see the top of this README).
juju set-constraints hadoop-master "mem=16G cpu-cores=8 root-disk=1T"
juju set-constraints hadoop-slavecluster "mem=4G cpu-cores=4 root-disk=1T"
juju set-constraints hive-server "mem=4G cpu-cores=4 root-disk=1T"
juju set-constraints elasticsearch "mem=8G cpu-cores=4"