Overview
Apache Pig is a platform and a part of the Big Data eco-system. The platform is used to process a large volume of data sets in a parallel way. The Pig platform works on top of the Apache Hadoop and MapReduce Platform. As we know, MapReduce is the programming model used for Hadoop applications. The Apache Pig platform provides an abstraction over the MapReduce model to make the programming easier. It provides a SQL-like interface to develop MapReduce programs. So instead of writing to MapReduce directly, developers can write a Pig script that will work automatically in a parallel manner on a distributed environment.
Introduction
Apache Pig is a platform used to analyze data sets of larger volume that consist of a high-level language used to express data analysis programs. It also provides the infrastructure for evaluating these applications. The most important property of Pig is that the structure is open to substantial parallelization, which in turn enables it to handle very large data sets.
At present, the infrastructure layer of Pig consists of a compilerwhich generates a sequence of underlying Map-Reduce programs. And for this to work, large-scale parallel implementations already exist in the framework.
The language layer of Pig consists of a textual language called Pig Latin. It has the following key features:
- Ease of programming: It presents a trivial way to achieve parallel execution of simple, parallel data analysis tasks. Complex tasks including multiple interrelated data transformations are explicitly encoded as data flow sequences. As a result the applications are easy to write, understand and maintain using Pig Latin script.
- Optimization: The tasks are encoded in a way that permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility: We can create our own functions to perform special-purpose processing.
PIG Installation and Execution
Apache Pig can be downloaded from the official website. It usually comes as an archive file. We just need to extract the archive and set the environment parameters. Pig can also be installed using the rpm package on redhat environment or using the deb package on the debian environment. Once the installation is done we simply start the Pig by specifying the local mode using the following command:
Listing 1: Sample showing starting the Pig
$ pig ?x local?.?.grunt>
On executing this we get the grunt shell which allows us to interactively enter and execute the Pig statements.
Listing 2: Sample Pig script
input_lines = LOAD '/tmp/myLocalCopyOfMyWebsite' AS (line:chararray);-- It extracts words from each line and put them into a pig bag-- datatype, then flatten the bag to get one word on each rowwords = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;-- It filters out any words which are just white spacesfiltered_words = FILTER words BY word MATCHES '\w+';-- create a group for each wordword_groups = GROUP filtered_words BY word;-- count the entries in each groupword_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;-- order the records by countordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/numberOfWordsInMyWebsite;
The code snippet above generates parallel executable tasks which are used to distribute across multiple machines in a Hadoop cluster to count the number of words in a dataset such as “all the web pages on the internet”.
Pig in MapReduce
To use Pig in MapReduce mode we should first ensure that Hadoop is up and running. This can be done by executing the following command on $ prompt:
Listing 3: Checking Hadoop Availability
$ hadoop dfs -ls /Found 3 itemsdrwxrwxrwx - hue supergroup 0 2011-12-08 05:20 /tmpdrwxr-xr-x - hue supergroup 0 2011-12-08 05:20 /userdrwxr-xr-x - mapred supergroup 0 2011-12-08 05:20 /var$
This piece of code lists out one or more lines if Hadoop is up and running. Now that we have ensured that Hadoop is running let us check Pig. To start with, we should first get the grunt prompt as shown in listing 1.
Listing 4: Testing Pig with Hadoop
$ pig ?x local2013-12-06 06:39:44,276 [main] INFO org.apache.pig.Main - Logging error messages to...2013-12-06 06:39:44,601 [main] INFO org.apache.pig.... Connecting to hadoop file system at: hdfs://0.0.0.0:80202013-12-06 06:39:44,988 [main] INFO org.apache.pig.... connecting to map-reduce job tracker at: 0.0.0.0:8021grunt> cd hdfs:///grunt> lshdfs://0.0.0.0/tmp hdfs://0.0.0.0/user hdfs://0.0.0.0/var grunt>
So, now we can see the Hadoop file system from within Pig. Once we achieve this we should try to read some into it from our local file system. To do this we should first copy the file from the local file System into HDFS using Pig.
Listing 5: Getting the test data
grunt> mkdir tomcatwebFolgrunt> cd tomcatwebFol grunt> copyFromLocal /usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml webXMLFilegrunt> lshdfs://0.0.0.0/tomcatwebFol/webXMLFile 10,924
Now using this sample test data within Hadoop’s file system, we can try and execute another script. For example we can do a cat on the file within Pig to see the contents. In order to achieve this we need to load the webXMLFile
from the HDFS into a Pig relation.
Listing 6: Load and parse the file
grunt> webXMLFile = LOAD '/usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml ' USING PigStorage('>') AS (context-param:chararray, param-name:chararray, param-name:chararray);grunt> DUMP webXMLFile;(RootDir, /usr/Oracle/AutoVueIntegrationSDK/FileSys/Repository/filesysRepository)...grunt>
Pig also provides the group operator which helps in grouping the tuples based on their shell.
Operators in Pig
Apache Pig has a number of relational and diagnostic operators. The most important ones are listed in the table below:
Operator Name |
Type |
Description |
||
|
Relational |
Select a set of tuples from a relation based on a condition. |
||
|
Relational |
Iterate the tuples of a relation and generates a data transformation |
||
|
Relational |
Group the data in one or more relations. |
||
|
Relational |
Join two or more relations (inner or outer join). |
||
LOAD |
Relational |
Load data from the file system. |
||
ORDER |
Relational |
|
||
SPLIT |
Relational |
Partition a relation into two or more relations. |
||
STORE |
Relational |
Store data in the file system. |
||
DESCRIBE |
Diagnostic |
Return the schema of a relation. |
||
DUMP |
Diagnostic |
Dump the contents of a relation to the screen. |
||
EXPLAIN |
Diagnostic |
Display the MapReduce execution plans. |
?
User Defined Functions
Although Pig is a powerful and useful scripting tool, it can be made even more powerful with the help of user-defined functions (UDFs). Pig scripts can use functions that we define for scenarios such as parsing the input data or formatting output data and even operators. These UDFs are written in Java and permit Pig to support custom processing.
Summary
- Apache Pig is a part of the Big Data ecosystem
- Apache Pig is a platform used to analyze data sets of larger volume which consists of a high-level language used to express data analysis programs.
- Apache Pig can be downloaded and installed from the official website.
- It can easily be configured and executed within Hadoop Distributed File System.
Hope you have enjoyed the article. Keep reading!
?
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.