Data warehouse infrastructure built on top of Hadoop . Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
- databases ›
Hortonworks HIVE Overview
Data warehouse infrastructure built on top of Hortonwork Apache HIVE.
Hortonworks Apache Hive 0.12.x is a data warehouse infrastructure built on top of Hortonworks Hadoop 2.4.1 that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
- HiveQL - An SQL dialect language for querying data in a RDBMS fashion
- UDF/UDAF/UDTF (User Defined [Aggregate/Table] Functions) - Allows user to create custom Map/Reduce based functions for regular use
- Ability to do joins (inner/outer/semi) between tables
- Support (limited) for sub-queries
- Support for table 'Views'
- Ability to partition data into Hive partitions or buckets to enable faster querying
- Hive Web Interface - A web interface to Hive
- Hive Server2 - Supports multi-suer querying using Thrift, JDBC and ODBC clients
- Hive Metastore - Ability to run a separate Metadata storage process -* Hive cli - A Hive commandline that supports HiveQL
See [http://hive.apache.org]http://hive.apache.org) for more information.
This charm provides the Hive Server and Metastore roles which form part of an overall Hive deployment.
Hortonworks HIVE Usage
A Hive deployment consists of a Hive service, a RDBMS (only MySQL is currently supported), an optional Metastore service and a Hadoop cluster.
To deploy a simple four node Hadoop cluster (see Hadoop charm README for further information):: juju deploy hdp-hadoop yarn-hdfs-master juju deploy hdp-hadoop compute-node juju add-unit -n 2 compute-node juju add-relation yarn-hdfs-master:namenode compute-node:datanode juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager
A Hive server stores metadata in MySQL::
juju deploy mysql # hive requires ROW binlog juju set mysql binlog-format=ROW
To deploy a Hive service without a Metastore service::
# deploy Hive instance (hive-server2) juju deploy hdp-hive hdphive # associate Hive with MySQL juju add-relation hdphive:db mysql:db # associate Hive with HDFS Namenode juju add-relation hdphive:namenode yarn-hdfs-master:namenode # associate Hive with resourcemanager juju add-relation hdphive:resourcemanager yarn-hdfs-master:resourcemanager juju add-relation compute-node:hadoop-nodes hdphive:hadoop-nodes
Once you have a cluster running, just run:
1) juju ssh yarn-hdfs-master/0 <<= ssh to hadoop master
2) Smoke test HDFS admin functionality- As the HDFS user, create a /user/$CLIENT_USER in
hadoop file system - Below steps verifies/demos HDFS functionality
a) sudo su $HDFS_USER b) hdfs dfs -mkdir -p /user/ubuntu c) hdfs dfs -chown -R ubuntu:hdfs /user d) hdfs dfs -chmod -R 755 /user/ubuntu e) exit
3) Smoke test YARN and Mapreduce - Run the smoke test as the $CLIENT_USER, using Terasort and sort 10GB of data. a) hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar teragen 10000 /user/ubuntu/teragenout b) hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-*.jar terasort /user/ubuntu/teragenout /user/ubuntu/terasortout 4) Smoke test HDFS funtionality from ubuntu user space - delete mapreduce output from hdfs hdfs dfs -rm -r /user/ubuntu/teragenout HIVE+HDFS Usage: 1) juju ssh hdphive/0 <<= ssh to hive server 2) sudo su $HIVE_USER 3) hive 4) from Hive console: show databases; create table test(col1 int, col2 string); show tables; exit; 5) exit from $HIVE_USER session 6) sudo su $HDFS_USER 7) hadoop dfsadmin -report <<== verify connection to the remote HDFS cluster 8) hdfs dfs -ls /apps/hive/warehouse <<== verify that "test" directory has been created on the remote HDFS cluster
Scale Out Usage
In order to increase the amount of slaves, you must add units, to add one unit: juju add-unit compute-node Or you can add multiple units at once: juju add-unit -n4 compute-node
amir sanjar firstname.lastname@example.org
Upstream Project Name
- (int) The maximum heap size in MB to allocate for daemons processes within the service units managed by this charm.