Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

How to Process Your Data Using Apache Pig

Kaushik Pal provides some samples and tips on how to use Apache Pig for  efficient analysis of large data sets.


advertisement

Overview

Apache Pig is a platform and a part of the Big Data eco-system. The platform is used to process a large volume of data sets in a parallel way. The Pig platform works on top of the Apache Hadoop and MapReduce Platform. As we know, MapReduce is the programming model used for Hadoop applications. The Apache Pig platform provides an abstraction over the MapReduce model to make the programming easier. It provides a SQL-like interface to develop MapReduce programs. So instead of writing to MapReduce directly, developers can write a Pig script that will work automatically in a parallel manner on a distributed environment.

Introduction

Apache Pig is a platform used to analyze data sets of larger volume that consist of a high-level language used to express data analysis programs. It also provides the infrastructure for evaluating these applications. The most important property of Pig is that the structure is open to substantial parallelization, which in turn enables it to handle very large data sets.



At present, the infrastructure layer of Pig consists of a compilerwhich generates a sequence of underlying Map-Reduce programs. And for this to work, large-scale parallel implementations already exist in the framework.

The language layer of Pig consists of a textual language called Pig Latin. It has the following key features:

  • Ease of programming: It presents a trivial way to achieve parallel execution of simple, parallel data analysis tasks. Complex tasks including multiple interrelated data transformations are explicitly encoded as data flow sequences. As a result the applications are easy to write, understand and maintain using Pig Latin script.

  • Optimization: The tasks are encoded in a way that permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

  • Extensibility: We can create our own functions to perform special-purpose processing.

PIG Installation and Execution

Apache Pig can be downloaded from the official website. It usually comes as an archive file. We just need to extract the archive and set the environment parameters. Pig can also be installed using the rpm package on redhat environment or using the deb package on the debian environment. Once the installation is done we simply start the Pig by specifying the local mode using the following command:

Listing 1: Sample showing starting the Pig

$ pig –x local …. …. grunt>

On executing this we get the grunt shell which allows us to interactively enter and execute the Pig statements.

Listing 2: Sample Pig script

input_lines = LOAD '/tmp/myLocalCopyOfMyWebsite' AS (line:chararray); -- It extracts words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- It filters out any words which are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/numberOfWordsInMyWebsite;

The code snippet above generates parallel executable tasks which are used to distribute across multiple machines in a Hadoop cluster to count the number of words in a dataset such as "all the web pages on the internet".

Pig in MapReduce

To use Pig in MapReduce mode we should first ensure that Hadoop is up and running. This can be done by executing the following command on $ prompt:

Listing 3: Checking Hadoop Availability

$ hadoop dfs -ls / Found 3 items drwxrwxrwx - hue supergroup 0 2011-12-08 05:20 /tmp drwxr-xr-x - hue supergroup 0 2011-12-08 05:20 /user drwxr-xr-x - mapred supergroup 0 2011-12-08 05:20 /var $

This piece of code lists out one or more lines if Hadoop is up and running. Now that we have ensured that Hadoop is running let us check Pig. To start with, we should first get the grunt prompt as shown in listing 1.

Listing 4: Testing Pig with Hadoop

$ pig –x local 2013-12-06 06:39:44,276 [main] INFO org.apache.pig.Main - Logging error messages to... 2013-12-06 06:39:44,601 [main] INFO org.apache.pig.... Connecting to hadoop file \ system at: hdfs://0.0.0.0:8020 2013-12-06 06:39:44,988 [main] INFO org.apache.pig.... connecting to map-reduce \ job tracker at: 0.0.0.0:8021 grunt> cd hdfs:/// grunt> ls hdfs://0.0.0.0/tmp <dir> hdfs://0.0.0.0/user <dir> hdfs://0.0.0.0/var <dir> grunt>

So, now we can see the Hadoop file system from within Pig. Once we achieve this we should try to read some into it from our local file system. To do this we should first copy the file from the local file System into HDFS using Pig.

Listing 5: Getting the test data

grunt> mkdir tomcatwebFol grunt> cd tomcatwebFol grunt> copyFromLocal /usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml webXMLFile grunt> ls hdfs://0.0.0.0/tomcatwebFol/webXMLFile <r 1> 10,924

Now using this sample test data within Hadoop's file system, we can try and execute another script. For example we can do a cat on the file within Pig to see the contents. In order to achieve this we need to load the webXMLFilefrom the HDFS into a Pig relation.

Listing 6: Load and parse the file

grunt> webXMLFile = LOAD '/usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml ' USING PigStorage('>') AS (context-param:chararray, \ param-name:chararray, \ param-name:chararray); grunt> DUMP webXMLFile; (RootDir, /usr/Oracle/AutoVueIntegrationSDK/FileSys/Repository/filesysRepository) ... grunt>

Pig also provides the group operator which helps in grouping the tuples based on their shell.

Operators in Pig

Apache Pig has a number of relational and diagnostic operators. The most important ones are listed in the table below:

Operator Name

Type

Description

FILTER

Relational

Select a set of tuples from a relation based on a condition.

FOREACH.

Relational

Iterate the tuples of a relation and generates a data transformation

GROUP

Relational

Group the data in one or more relations.

JOIN

Relational

Join two or more relations (inner or outer join).

LOAD

Relational

Load data from the file system.

ORDER

Relational

Sort a relation based on one or more fields.

SPLIT

Relational

Partition a relation into two or more relations.

STORE

Relational

Store data in the file system.

DESCRIBE

Diagnostic

Return the schema of a relation.

DUMP

Diagnostic

Dump the contents of a relation to the screen.

EXPLAIN

Diagnostic

Display the MapReduce execution plans.

 

User Defined Functions

Although Pig is a powerful and useful scripting tool, it can be made even more powerful with the help of user-defined functions (UDFs). Pig scripts can use functions that we define for scenarios such as parsing the input data or formatting output data and even operators. These UDFs are written in Java and permit Pig to support custom processing.

Summary

  • Apache Pig is a part of the Big Data ecosystem
  • Apache Pig is a platform used to analyze data sets of larger volume which consists of a high-level language used to express data analysis programs.
  • Apache Pig can be downloaded and installed from the official website.
  • It can easily be configured and executed within Hadoop Distributed File System.

Hope you have enjoyed the article. Keep reading!

 

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.



   
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap