Introduction to Hadoop Streaming

Introduction

Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language. But the basic concept remains the same. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs. Apache Hadoop uses streams as per UNIX standard between your application and Hadoop system. Streaming is the best fit for text processing. The data view is line oriented and processed as a key/value pair separated by ‘tab’ character. The program reads each line and processes it as per the requirement.

Working with Hadoop Streams

In any MapReduce job, we have input and output as key/value pairs. The same concept is true for Streaming API. In Streaming, input and output are always represented as text. The ‘tab’ character is used to separate key and value. The Streaming program uses the ‘tab’ character to split a line into key/value pair. The same procedure is followed for output. The Streaming program writes its output on stdout following the same format as mentioned below.

key1 	 value1 
key2 	 value2 
key3 	 value3 

In this process, each line contains only one key/value pair. So the input to the reducer is sorted so that all the same keys are placed adjacent to one another.

Any program or tool can be used as Mapper and Reducer if it is capable of handling input in text format as described above. Other scripts like Perl, Python or Bash can also be used for this purpose, provided all the nodes have an interpreter to understand the language.

Execution Steps

The Hadoop streaming utility allows any script or executable to work as Mapper/Reducer provided they can work with stdin and stdout. In this section I will describe the implementation steps of the Streaming utility. I will assume two sample programs to work as Mapper and Reducer.

First, let us check the following command to run a Streaming job. The command does not have any arguments so it will show different usage options as shown below.


Figure 1: Showing Streaming command and usage

Now let us assume streamMapProgram and streamReduceProgramwill work as Mapper and Reducer. These programs can be scripts, executables or any other component capable of taking input from stdin and producing output at stdout. The following command will show how the Mapper and Reducer arguments can be combined with the Streaming command.


Figure 2: Showing input and output arguments

It is assumed that the Mapper and Reducer programs are present in all the nodes before starting the Streaming job.

First, the Mapper task converts the input into lines and places it into the stdin of the process. After this the Mapper collects the output of the process from stdout and converts it into key/value pair. These key/value pairs are the actual output from the Mapper task. The key is the value till the first ‘tab’ character and the remaining portion of the line is considered as value. If there is no ‘tab’ character then the total line is considered as key with value as ‘null’.

The same process is followed when the reducer task runs. First it converts the key/value pairs into lines and put it into the stdin of the process. Then the reducer collects the line output from the stdout of the process and prepare key/value pairs as the final output of the reduce job. The key/value is separated the same way following the first ‘tab’ character.

The following diagram shows the process flow in a streaming job


Figure 3: Streaming process flow

Design Difference Between Java API and Streaming

There is a design difference between the Java MapReduce API and Hadoop Streaming. The difference is mainly in the processing implementation. In the standard Java API, the mechanism is to process each record, one at a time. So the framework is implemented to call the map () method (on your Mapper) for each record. But with the Streaming API, the map program can control the processing of input. It can also read and process multiple lines at a time as it can control the reading mechanism. In Java, the same can be implemented but with the help of some other mechanism such as using instance variables to accumulate multiple lines and then process it.

Streaming Commands

The Hadoop Streaming API supports both streaming and generic command options. Some important streaming command options are listed below.

Additional Configuration Variables

In a streaming job, additional configuration variables can be mentioned with ?D option ("-D =").

  • The following command can be used to change local temp directory
    -D dfs.data.dir=/tmp
  • The following command can be used to specify additional local temp directories
    -D mapred.local.dir=/tmp/local/streamingjob
  • The following command can be used to specify Map-Only job
    -D mapred.reduce.tasks=0
  • The following command can be used to specify number of reducers
    -D mapred.reduce.tasks=4
  • The following command can be used to specify line split options
    -D stream.map.output.field.separator=.    -D stream.num.map.output.key.fields=6

Conclusion

The Apache Hadoop framework and MapReduce programming are the industry standard for processing a large volume of data. The MapReduce programming framework is used to do the actual processing and logic implementation. The Java MapReduce API is the standard option for writing MapReduce programs. But the Hadoop Streaming API provides options to write MapReduce jobs in other languages. This is one of the best examples of flexibility available to MapReduce programmers who have experience in other languages apart from Java. Even executables can be used with the Streaming API to work as a MapReduce job. The only condition is that the program/executable should be able to take input from STDIN and produce output at STDOUT.

?

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

positive contribution tech

Technology’s Positive Contributions to Society

Technology has and continues to improve our lives. From the business world to the world of medicine, and our day-to-day lives, you can’t go a day without interacting with at least one form of technology. While some fear technology may be going too far, there are many ways in which

How to Choose From The Best Big Data Platforms in 2023

How to Choose From The Best Big Data Platforms in 2023

As big data continues to become increasingly popular in the business world, companies are always looking for better ways to process and analyze complex data. The process critically depends on the platform that manages and analyzes the data. In this article, we will provide a guide to help you choose

Why transparent code is a good idea

Why Transparent Code is a Good Idea

Code is used to make up the apps and software we use every day. From our favorite social media platforms to our online banking services, code is the framework used to build these tools that help make our lives easier. Code is complex. Software today requires large teams of programmers

The Role of WordPress Hosting in Website Speed and Performance

The Role of WordPress Hosting in Website Performance

The term “WordPress hosting” refers to a specific type of web hosting service that offers hardware and software configurations tailored to the needs of WP sites. It’s important to remember that a WP hosting package is not required to host WordPress webpages. WP web pages are also compatible with standard

Data Privacy vs. Data Security: What you Should Know

Data Privacy vs. Data Security: What you Should Know

Data privacy and data security are often used interchangeably, but they are two completely different things. It’s important to understand the difference for anyone who handles sensitive information, such as personal data or financial records. In this article, we’ll take a closer look at data privacy vs. data security. We’ll