Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Create an Apache Hadoop MapReduce Job Using Spring

Follow this simple six step process to create a MapReduce job in Apache Hadoop using Spring.


advertisement

Spring is a widely used framework in enterprise applications development and includes different components such as Spring ORM, Spring JDBC and more, to support various features. Spring for Apache Hadoop is the framework used to support application building with Hadoop components such as HDFS, MapReduce, Hive, etc. Spring provides APIs to work with all these components and also supports integration of Hadoop with other Spring ecosystem projects for real life application development. This article will discuss the usage of Spring for Apache Hadoop frameworks.

Introduction

Apache Hadoop is an open source software framework that is used to store and process data sets of large volume. Spring is also an open source framework that is widely used in Java/J2EE applications. Spring's dependency injection (DI) or inversion of control (IO) mechanism have become popular alternatives to the Enterprise Java Beans (or EJB) model. Spring has the advantage of being flexible enough to be easily plugged into any other development framework. Learn how to use Spring and Apache Hadoop together to get the maximum benefit from each framework.

Getting Started

First, we'll discuss how to create a Hadoop MapReduce job using Spring. This involves the following steps:

Step 1: Obtain the required dependencies using Maven

As you know, Maven is highly dependent on the pom.xml file, so you need to make the following entries in our pom.xml file. These dependency entries are for Hadoop core and Spring framework.



Listing 1: Sample configuration entries in pom.xml file

< !-- Spring Data Apache Hadoop -- >
< dependency >
    < groupId > org.springframework.data </ groupId >
    < artifactId  > spring-data-hadoop </ artifactId >
    < version > 1.0.0.RELEASE </ version >
< /dependency >
< !-- Apache Hadoop Core –- >
< dependency >
    < groupId > org.apache.hadoop </ groupId >
    < artifactId > hadoop-core </ artifactId >
    < version > 1.0.3 </version >
</dependency>

Step 2: Create the mapper component

The mapper component is used to break the actual problem into smaller components so they are easier to solve. You can have your own customized mapper component by extending the Apache MapReduce Mapper class. You need to override the map method of this class. The mapper class expects the following four parameters:

For input: The following parameters are for input key and value

  • KEYIN – This parameter describes the key type that is provided as an input to the mapper component.
  • VALUEIN – This parameter describes the type of the value that is provided as an input to the mapper component.

For output: Following parameters are for output key and value

  • KEYOUT – This parameter describes the type of the output key parameter from the mapper component.
  • VALUEOUT – This parameter describes the type of the output value from the mapper component.

Each of these parameters must implement the writableinterface. In the given example, we have used our mapper to read the contents of a file one line at a time and prepare key-value pairs of every line.

Our implementation of the map method performs the following tasks

  1. Split each single line into words
  2. Iterate through every single word and take out all the Unicode characters that are neither letters nor characters.
  3. Construct a key-value pair using the write method of the Mapper.Context class that is compatible with the expected output key-value pair.

Listing 2: Sample customized Mapper class

public class MyWordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text myword = new Text();

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer lineTokenz = new StringTokenizer(line);
        while (lineTokenz.hasMoreTokens()) {
            String cleaned_data = removeNonLettersNonNumbers(lineTokenz.nextToken());
           myword.set(cleaned_data);
            context.write(myword, new IntWritable(1));
        }
    }

    /**
     * Replace all Unicode characters that are neither numbers nor letters with an empty string.
     * @param original, It is the original string
     * @return a string object that contains only letters and numbers
     */
    private String removeNonLettersNonNumbers (String original) {
        return original.replaceAll("[^\\p{L}\\p{N}]", "");
    }
}

Step 3: Create the Reducer Component

A reducer is a component that deletes the unwanted intermediate values and forwards only those key value pairs that are relevant. To have our customized reducer, our class should extend the Reducer class and override the reduce method. The reducer class expects the following four parameters:

For input: The following parameters are for input key and value

  • KEYIN – This parameter describes the key type that is provided as an input to the mapper component.
  • VALUEIN – This parameter describes the type of the value that is provided as an input to the mapper component.

For output: Following parameters are for output key and value

  • KEYOUT – This parameter describes the type of the output key parameter from the mapper component.
  • VALUEOUT – This parameter describes the type of the output value from the mapper component.

While implementing we must make sure that the datatype of the 'keyin' and 'keyout' parameters are the same. Also, the 'valuein' and valueout' parameters should be of same type. Our implementation of the reduce method performs the following steps:

  1. Check that the input key contains the desired word.
  2. If the above step is true, get the number of occurrences of the word.
  3. Construct a new key-value pair by calling the write method of the reducer class.

Listing 3: Sample customized Reducer class

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MyWordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    protected static final String MY_TARGET_TEXT = "Hadoop";
    
@Override
 protected void reduce(Text keyTxt, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        if (containsTargetWord(keyTxt)) {
            int wCount = 0;
            for (IntWritable value: values) {
               wCount += value.get();
            }
            context.write(key, new IntWritable(wCount));
        }
    }
    private boolean containsTargetWord(Text keyTxt) {
        return keyTxt.toString().equals(MY_TARGET_TEXT);
    }
}

Step 4: Create the application context

The next step is to create the application context using XML. We can configure the application context of our application in this way:

  • Create a properties file that contains the value of the configuration properties. A sample application properties file is shown below.

    fs.default.name=hdfs://localhost:9000
    mapred.job.tracker=localhost:9001
    input.path=/path/to/input/file/
    output.path=/path/to/output/file
  • Configure a property place holder that is used to fetch the values of configuration properties from the created properties file. This can be done by adding the following in our application context XML file:

    <context:property-placeholder location="classpath:application.properties" />
  • Configure Apache Hadoop and its job. We can configure the default file system and its job tracker by adding the following in our application context file

    <hdp:configuration>
        fs.default.name=${fs.default.name}
        mapred.job.tracker=${mapred.job.tracker}
    </hdp:configuration>

    We should add the following in our application context XML file to define the job tracker.

    <hdp:job id="wordCountJobId"
     input-path="${input.path}"
     output-path="${output.path}"
     jar-by-class="net.qs.spring.data.apachehadoop.Main"
     mapper="net.qs.spring.data.apachehadoop.MyWordMapper"
     reducer="net.qs.spring.data.apachehadoop.MyWordReducer"/>

  • Configure the job runner that runs the created Hadoop job. The Job runner can be configured by adding the following in our application context XML file

    <hdp:job-runner id="wordCountJobRunner" job-ref="wordCountJobId" run-at-startup="true"/>

Step 5: Loading the application context at startup

We can now execute the created Hadoop job by loading the application context when the application starts up. We can do this by creating the instance of the ClasspathXmlApplicationContext object that accepts the name of our application context file as input parameter to the constructor.

Listing 4: Sample showing loading of application context

import org.springframework.context.ApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;

public class Main {
    public static void main(String[] arguments) {
        ApplicationContext ctx = new ClassPathXmlApplicationContext("applicationContext.xml");
    }
}

Step 6: Run the MapReduce job

We can start our map reduce job using the following steps:

  1. Upload an input file into HDFS – We can do this by executing the following command at the command prompt:

    hadoop dfs -put sample.txt /input/sample.txt

    The following is a sample input file. The target key word 'Hadoop' is highlighted in GREEN. The word 'Hadoop' exists 4 times in the sample.


    Figure 1: Sample input file

  2. Check if the file has been uploaded successfully by running the following command. It will show the input file.

    hadoop dfs -ls /input
  3. Run the MapReduce job. This can be done by executing the main method of our java file from the IDE. If all the steps work as expected then the following will be the output.

    Output: Hadoop 4

Summary

To recap, both Spring and Hadoop are useful frameworks from the open source community and by combining them, we can get the benefit of both the frameworks. To create a MapReduce job using Spring, you can follow this simple six step process.

 

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.



   
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap