apache hadoop #2
Description
Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.
- Tags:
- applications ›
Overview
What is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Apache Hadoop 2.4.1 consists of significant improvements over the previous stable release (hadoop-1.x).
Here is a short overview of the improvments to both HDFS and MapReduce.
-
HDFS Federation In order to scale the name service horizontally, federation uses multiple independent
Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don't require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.More details are available in the HDFS Federation document:
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/Federation.html -
MapReduce NextGen aka YARN aka MRv2 The new architecture introduced in hadoop-0.23, divides the two major functions of the
JobTracker: resource management and job life-cycle management into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚ scheduling and coordination. An application is either a single job in the sense of classic MapReduce jobs or a DAG of
such jobs.
The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.
The per-application ApplicationMaster is, in effect, a framework specific library and is
tasked with negotiating resources from the ResourceManager and working with the NodeManager (s) to execute and monitor the tasks.
More details are available in the YARN document: http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/YARN.html
Usage
This charm supports the following Hadoop roles:
- HDFS: namenode and datanode
- YARN: ResourceManager, JobHistory, and NodeManager
This supports deployments of Hadoop in a number of configurations.
Simple Usage: Combined HDFS and YARN
In this configuration, the YARN ResourceManager is deployed on the same service units as HDFS namenode and the HDFS datanodes also run YARN NodeManager::
juju deploy hadoop yarn-hdfs-master
juju deploy hadoop compute-nodes
juju add-relation yarn-hdfs-master:namenode compute-nodes:datanode
juju add-relation yarn-hdfs-master:resourcemanager compute-nodes:nodemanager
juju add-unit -n 3 compute-nodes
Scale Out Usage: Separate HDFS and YARN
In this configuration the HDFS and YARN deployments operate on different service units as separate services::
juju deploy hadoop hdfs-master-server
juju deploy hadoop compute-nodes
juju deploy hadoop yarn-master-server
juju deploy hadoop hadoop-client
juju add-unit -n 3 compute-nodes
juju add-relation hdfs-master-server:namenode compute-nodes:datanode
juju add-relation yarn-master-server:mapred-namenode hdfs-master-server:namenode
juju add-relation compute-nodes:mapred-namenode hdfs-master-server:namenode
juju add-relation yarn-master-server:resourcemanager compute-nodes:nodemanager
juju add-relation hadoop-client:mapred-namenode hdfs-master-server:namenode
juju add-relation yarn-master-server:resourcemanager hadoop-client:nodemanager
In the long term juju should support improved placement of services to better support this type of deployment. This would allow mapreduce services to be deployed onto machines with more processing power and hdfs services to be deployed onto machines with larger storage.
TO deploy a Hadoop service with elasticsearch service::
# deploy ElasticSearch locally:
**juju deploy elasticsearch elasticsearch**
# elasticsearch-hadoop.jar file will be added to LIBJARS path
# Recommanded to use hadoop -libjars option to included elk jar file
**juju add-unit -n elasticsearch**
# deploy hive service by any senarios mentioned above
# associate Hive with elasticsearch
**juju add-relation {hadoop master}:elasticsearch elasticsearch:client**
Contact Information
amir sanjar amir.sanjar@canonical.com
Hadoop
Configuration
- dfs_block_size
- (int) The default block size for new files (default to 64MB). Increase this in larger deployments for better large data set performance.
- 134217728
- dfs_datanode_max_xcievers
- (int) The number of files that an datanode will serve at any one time. An Hadoop HDFS datanode has an upper bound on the number of files that it will serve at any one time. This defaults to 256 (which is low) in hadoop 1.x - however this charm increases that to 4096.
- 4096
- dfs_heartbeat_interval
- (int) Determines datanode heartbeat interval in seconds.
- 3
- dfs_namenode_handler_count
- (int) The number of server threads for the namenode. Increase this in larger deployments to ensure the namenode can cope with the number of datanodes that it has to deal with.
- 10
- dfs_namenode_heartbeat_recheck_interval
- (int) Determines datanode recheck heartbeat interval in milliseconds It is used to calculate the final tineout value for namenode. Calcultion process is as follow: 10.30 minutes = 2 x (dfs.namenode.heartbeat.recheck-interval=5*60*1000) + 10 * 1000 * (dfs.heartbeat.interval=3)
- 300000
- dfs_replication
- (int) Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time
- 3
- hadoop_dir_base
- (string) The directory under which all other hadoop data is stored. Use this to take advantage of extra storage that might be avaliable. . You can change this in a running deployment but all existing data in HDFS will be inaccessible; you can of course switch it back if you do this by mistake.
- /usr/local/hadoop/data
- hdfs_log_dir
- (string) Directory for storing the HDFS logs
- /var/log/hadoop/hdfs
- io_file_buffer_size
- (int) The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.
- 4096
- mapred_child_java_opts
- (string) Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: . -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc . The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes.
- -Xmx200m
- mapred_job_tracker_handler_count
- (int) The number of server threads for the JobTracker. This should be roughly 4% of the number of tasktracker nodes.
- 10
- mapreduce_framework_name
- (string) Execution framework set to Hadoop YARN.** DO NOT CHANGE **
- yarn
- mapreduce_reduce_shuffle_parallelcopies
- (int) The default number of parallel transfers run by reduce during the copy(shuffle) phase.
- 5
- mapreduce_task_io_sort_factor
- (int) More streams merged at once while sorting files.. This determines the number of open file handles.
- 10
- mapreduce_task_io_sort_mb
- (int) Higher memory-limit while sorting data for efficiency..
- 100
- tasktracker_http_threads
- (int) The number of worker threads that for the http server. This is used for map output fetching.
- 40
- yarn_local_dir
- (string) Space separated list of directories where YARN will store temporary data..
- /grid/hadoop/yarn/local
- yarn_local_log_dir
- (string) Space separated list of directories where YARN will store container log data.
- /grid/hadoop/yarn/local
- yarn_log_dir
- (string) Directory for storing the YARN logs
- /var/log/hadoop/yarn
- yarn_nodemanager_aux-services
- (string) Shuffle service that needs to be set for Map Reduce applications.
- mapreduce_shuffle
- yarn_nodemanager_aux-services_mapreduce_shuffle_class
- (string) Shuffle service that needs to be set for Map Reduce applications.
- org.apache.hadoop.mapred.ShuffleHandler