Apache™ Pig allows you to write complex MapReduce transformations using a
simple scripting language. Pig Latin (the language) defines a set of
transformations on a data set such as aggregate, join and sort.
Pig translates the Pig Latin script into MapReduce so that it can be executed
within Hadoop®. Pig Latin is sometimes extended using UDFs
(User Defined Functions), which the user can write in Java or a scripting language and then call directly from the Pig Latin
- application ›
Hortonworks Pig overview
Hortonworks HDP 2.1 Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial parallelization,
which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that
produces sequences of Map-Reduce programs, for which large-scale parallel
implementations already exist (e.g., the Hadoop subproject). Pig's language
layer currently consists of a textual language called Pig Latin, which has the
following key properties:
- Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility. Users can create their own functions to do special-purpose processing.
Pig has two execution modes or exectypes:
- Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). - Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).
This charm provides Pig client with both execution modes (above).
Hortonworks Pig usage
Step-by-step instructions on using the charm:
Local Mode juju deploy hdp-pig hdp-pig
**Mapreduce Mode - remote hadoop cluster** - Install Hadoop HDP 2.1 cluster juju deploy hdp-hadoop yarn-hdfs-master juju deploy hdp-hadoop compute-node juju add-unit -n 2 compute-node juju add-relation yarn-hdfs-master:namenode compute-node:datanode juju add-relation yarn-hdfs-master:resourcemanager compute-node:nodemanager - Install HDP Pig juju deploy hdp-pig hdp-pig juju add-relation hdp-pig:namenode yarn-hdfs-master:namenode juju add-relation hdp-pig:resourcemanager yarn-hdfs-master:resourcemanager
Smoke test local mode deployment:
1) pig -x local
Smoke test mapreduce deployment:
Verify connections to remote cluster: 1) juju ssh hdp-pig 2) sudo su $HDFS_USER 3) hadoop version <= verifies if hadoop client is installed 4) hdfs dfsadmin -report <= verifies if Pig client has been connected to the remote HDFS server 5) yarn rmadmin -getGroups <= verifies if Pig client has been connected to the remote ResourceManager server Run a Pig Script Test: 1) hdfs dfs -mkdir -p /user/hduser 2) hdfs dfs -copyFromLocal /etc/passwd /user/hduser/passwd 3) vim /tmp/id.pig 4) add following Pig script commands, save and exit: A = load '/user/hduser/passwd' using PigStorage(':'); B = foreach A generate \$0 as id; store B into '/tmp/id.out'; 5) pig -l /tmp/pig.log /tmp/id.pig 6) hadoop fs -cat /tmp/id.out/part-m-00000 <= check the result on the hadoop cluster
Developer Contact Information
amir sanjar email@example.com
Upstream Hortonworks Links
- Hortonworks Upstream website http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.3/bk_installing_manually_book/content/rpm-chap1.html