Introduction to Hadoop Streaming

Introduction to Hadoop Streaming

Introduction

Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language. But the basic concept remains the same. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs. Apache Hadoop uses streams as per UNIX standard between your application and Hadoop system. Streaming is the best fit for text processing. The data view is line oriented and processed as a key/value pair separated by ‘tab’ character. The program reads each line and processes it as per the requirement.

Working with Hadoop Streams

In any MapReduce job, we have input and output as key/value pairs. The same concept is true for Streaming API. In Streaming, input and output are always represented as text. The ‘tab’ character is used to separate key and value. The Streaming program uses the ‘tab’ character to split a line into key/value pair. The same procedure is followed for output. The Streaming program writes its output on stdout following the same format as mentioned below.

key1 	 value1 
key2 	 value2 
key3 	 value3 

In this process, each line contains only one key/value pair. So the input to the reducer is sorted so that all the same keys are placed adjacent to one another.

Any program or tool can be used as Mapper and Reducer if it is capable of handling input in text format as described above. Other scripts like Perl, Python or Bash can also be used for this purpose, provided all the nodes have an interpreter to understand the language.

Execution Steps

The Hadoop streaming utility allows any script or executable to work as Mapper/Reducer provided they can work with stdin and stdout. In this section I will describe the implementation steps of the Streaming utility. I will assume two sample programs to work as Mapper and Reducer.

First, let us check the following command to run a Streaming job. The command does not have any arguments so it will show different usage options as shown below.


Figure 1: Showing Streaming command and usage

Now let us assume streamMapProgram and streamReduceProgramwill work as Mapper and Reducer. These programs can be scripts, executables or any other component capable of taking input from stdin and producing output at stdout. The following command will show how the Mapper and Reducer arguments can be combined with the Streaming command.


Figure 2: Showing input and output arguments

It is assumed that the Mapper and Reducer programs are present in all the nodes before starting the Streaming job.

First, the Mapper task converts the input into lines and places it into the stdin of the process. After this the Mapper collects the output of the process from stdout and converts it into key/value pair. These key/value pairs are the actual output from the Mapper task. The key is the value till the first ‘tab’ character and the remaining portion of the line is considered as value. If there is no ‘tab’ character then the total line is considered as key with value as ‘null’.

The same process is followed when the reducer task runs. First it converts the key/value pairs into lines and put it into the stdin of the process. Then the reducer collects the line output from the stdout of the process and prepare key/value pairs as the final output of the reduce job. The key/value is separated the same way following the first ‘tab’ character.

The following diagram shows the process flow in a streaming job


Figure 3: Streaming process flow

Design Difference Between Java API and Streaming

There is a design difference between the Java MapReduce API and Hadoop Streaming. The difference is mainly in the processing implementation. In the standard Java API, the mechanism is to process each record, one at a time. So the framework is implemented to call the map () method (on your Mapper) for each record. But with the Streaming API, the map program can control the processing of input. It can also read and process multiple lines at a time as it can control the reading mechanism. In Java, the same can be implemented but with the help of some other mechanism such as using instance variables to accumulate multiple lines and then process it.

Streaming Commands

The Hadoop Streaming API supports both streaming and generic command options. Some important streaming command options are listed below.

Additional Configuration Variables

In a streaming job, additional configuration variables can be mentioned with ?D option ("-D =").

  • The following command can be used to change local temp directory
    -D dfs.data.dir=/tmp
  • The following command can be used to specify additional local temp directories
    -D mapred.local.dir=/tmp/local/streamingjob
  • The following command can be used to specify Map-Only job
    -D mapred.reduce.tasks=0
  • The following command can be used to specify number of reducers
    -D mapred.reduce.tasks=4
  • The following command can be used to specify line split options
    -D stream.map.output.field.separator=.    -D stream.num.map.output.key.fields=6

Conclusion

The Apache Hadoop framework and MapReduce programming are the industry standard for processing a large volume of data. The MapReduce programming framework is used to do the actual processing and logic implementation. The Java MapReduce API is the standard option for writing MapReduce programs. But the Hadoop Streaming API provides options to write MapReduce jobs in other languages. This is one of the best examples of flexibility available to MapReduce programmers who have experience in other languages apart from Java. Even executables can be used with the Streaming API to work as a MapReduce job. The only condition is that the program/executable should be able to take input from STDIN and produce output at STDOUT.

?

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.

devx-admin

devx-admin

Share the Post:
Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India,

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1