How to Process Your Data Using Apache Pig

How to Process Your Data Using Apache Pig


Apache Pig is a platform and a part of the Big Data eco-system. The platform is used to process a large volume of data sets in a parallel way. The Pig platform works on top of the Apache Hadoop and MapReduce Platform. As we know, MapReduce is the programming model used for Hadoop applications. The Apache Pig platform provides an abstraction over the MapReduce model to make the programming easier. It provides a SQL-like interface to develop MapReduce programs. So instead of writing to MapReduce directly, developers can write a Pig script that will work automatically in a parallel manner on a distributed environment.


Apache Pig is a platform used to analyze data sets of larger volume that consist of a high-level language used to express data analysis programs. It also provides the infrastructure for evaluating these applications. The most important property of Pig is that the structure is open to substantial parallelization, which in turn enables it to handle very large data sets.

At present, the infrastructure layer of Pig consists of a compilerwhich generates a sequence of underlying Map-Reduce programs. And for this to work, large-scale parallel implementations already exist in the framework.

The language layer of Pig consists of a textual language called Pig Latin. It has the following key features:

  • Ease of programming: It presents a trivial way to achieve parallel execution of simple, parallel data analysis tasks. Complex tasks including multiple interrelated data transformations are explicitly encoded as data flow sequences. As a result the applications are easy to write, understand and maintain using Pig Latin script.
  • Optimization: The tasks are encoded in a way that permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility: We can create our own functions to perform special-purpose processing.

PIG Installation and Execution

Apache Pig can be downloaded from the official website. It usually comes as an archive file. We just need to extract the archive and set the environment parameters. Pig can also be installed using the rpm package on redhat environment or using the deb package on the debian environment. Once the installation is done we simply start the Pig by specifying the local mode using the following command:

Listing 1: Sample showing starting the Pig

$ pig ?x local?.?.grunt>

On executing this we get the grunt shell which allows us to interactively enter and execute the Pig statements.

Listing 2: Sample Pig script

input_lines = LOAD '/tmp/myLocalCopyOfMyWebsite' AS (line:chararray);-- It extracts words from each line and put them into a pig bag-- datatype, then flatten the bag to get one word on each rowwords = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;-- It filters out any words which are just white spacesfiltered_words = FILTER words BY word MATCHES '\w+';-- create a group for each wordword_groups = GROUP filtered_words BY word;-- count the entries in each groupword_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;-- order the records by countordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/numberOfWordsInMyWebsite;

The code snippet above generates parallel executable tasks which are used to distribute across multiple machines in a Hadoop cluster to count the number of words in a dataset such as “all the web pages on the internet”.

Pig in MapReduce

To use Pig in MapReduce mode we should first ensure that Hadoop is up and running. This can be done by executing the following command on $ prompt:

Listing 3: Checking Hadoop Availability

$ hadoop dfs -ls /Found 3 itemsdrwxrwxrwx   - hue    supergroup          0 2011-12-08 05:20 /tmpdrwxr-xr-x   - hue    supergroup          0 2011-12-08 05:20 /userdrwxr-xr-x   - mapred supergroup          0 2011-12-08 05:20 /var$

This piece of code lists out one or more lines if Hadoop is up and running. Now that we have ensured that Hadoop is running let us check Pig. To start with, we should first get the grunt prompt as shown in listing 1.

Listing 4: Testing Pig with Hadoop

$ pig ?x local2013-12-06 06:39:44,276 [main] INFO  org.apache.pig.Main - Logging error messages to...2013-12-06 06:39:44,601 [main] INFO  org.apache.pig.... Connecting to hadoop file system at: hdfs:// 06:39:44,988 [main] INFO  org.apache.pig.... connecting to map-reduce job tracker at:> cd hdfs:///grunt> lshdfs://     hdfs://    hdfs://     grunt>

So, now we can see the Hadoop file system from within Pig. Once we achieve this we should try to read some into it from our local file system. To do this we should first copy the file from the local file System into HDFS using Pig.

Listing 5: Getting the test data

grunt> mkdir tomcatwebFolgrunt> cd tomcatwebFol grunt> copyFromLocal /usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml webXMLFilegrunt> lshdfs://  10,924

Now using this sample test data within Hadoop’s file system, we can try and execute another script. For example we can do a cat on the file within Pig to see the contents. In order to achieve this we need to load the webXMLFilefrom the HDFS into a Pig relation.

Listing 6: Load and parse the file

grunt> webXMLFile = LOAD '/usr/share/apache-tomcat/webapps/MywebApp/WEB-IINF/web.xml ' USING PigStorage('>') AS (context-param:chararray, param-name:chararray,  param-name:chararray);grunt> DUMP webXMLFile;(RootDir, /usr/Oracle/AutoVueIntegrationSDK/FileSys/Repository/filesysRepository)...grunt>

Pig also provides the group operator which helps in grouping the tuples based on their shell.

Operators in Pig

Apache Pig has a number of relational and diagnostic operators. The most important ones are listed in the table below:

Operator Name





Select a set of tuples from a relation based on a condition.



Iterate the tuples of a relation and generates a data transformation



Group the data in one or more relations.



Join two or more relations (inner or outer join).



Load data from the file system.



Sort a relation based on one or more fields.



Partition a relation into two or more relations.



Store data in the file system.



Return the schema of a relation.



Dump the contents of a relation to the screen.



Display the MapReduce execution plans.


User Defined Functions

Although Pig is a powerful and useful scripting tool, it can be made even more powerful with the help of user-defined functions (UDFs). Pig scripts can use functions that we define for scenarios such as parsing the input data or formatting output data and even operators. These UDFs are written in Java and permit Pig to support custom processing.


  • Apache Pig is a part of the Big Data ecosystem
  • Apache Pig is a platform used to analyze data sets of larger volume which consists of a high-level language used to express data analysis programs.
  • Apache Pig can be downloaded and installed from the official website.
  • It can easily be configured and executed within Hadoop Distributed File System.

Hope you have enjoyed the article. Keep reading!


About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at and you can email him here.

Share the Post:
Heading photo, Metadata.

What is Metadata?

What is metadata? Well, It’s an odd concept to wrap your head around. Metadata is essentially the secondary layer of data that tracks details about the “regular” data. The regular

XDR solutions

The Benefits of Using XDR Solutions

Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved

AI is revolutionizing fraud detection

How AI is Revolutionizing Fraud Detection

Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across

AI innovation

Companies Leading AI Innovation in 2023

Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several

data fivetran pricing

Fivetran Pricing Explained

One of the biggest trends of the 21st century is the massive surge in analytics. Analytics is the process of utilizing data to drive future decision-making. With so much of

kubernetes logging

Kubernetes Logging: What You Need to Know

Kubernetes from Google is one of the most popular open-source and free container management solutions made to make managing and deploying applications easier. It has a solid architecture that makes