By submitting your information, you agree that devx.com may send you DevX offers via email, phone and text message, as well as email offers about other products and services that DevX believes may be of interest to you. DevX will process your information in accordance with the Quinstreet Privacy Policy.

Design patterns are common in almost all levels of software development and are nothing more than proven and tested design techniques used to solve business problems. MapReduce is no different and also has its own design patterns to solve computation issues.

Introduction

MapReduce is used to process data that resides on more than one computer. It provides clear and distinctive boundaries between what we can and what we cannotdo. This minimizes the number of options we need to consider to solve a given problem. At the same time, we can figure out how to solve a problem with constraints. Design patterns have been tested for many problems and have shown the right level of abstraction.

MapReduce design patterns occupy the same role in a smaller space of problems and solutions. They provide a general framework to solve our data computation-related issues, without concentrating on the problem domain. MapReduce design patterns also provide a common language for teams working together on MapReduce applications.

MapReduce Design Pattern

We will discuss different design approaches one by one in the following section.

Calculating

Problem: Let us consider that we have a stack of documents where each document contains a set of terms. We have a requirement to calculate the number of occurrences of each term in the document.

Solution: In the following code snippet, we have a Mapper that simply records '1' for each term it encounters. The reducer here traverses throughout the list of such ones, sums them up and produces the result.

Listing 1: Computing Code 1

class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)
class Reducer
method Reduce(term t, counts [c1, c2,...])
cnt = 0
for all count c in [c1, c2,...] do
cnt = cnt + c
Emit(term t, count cnt)

Even though it's simple, the code above has an obvious disadvantage--a large amount of dummy counters are emitted by the Mapper. Now we can reduce this by summing the counters for each document.

Listing 2: Computing Code 2

class Mapper
method Map(docid id, doc d)
HArray = new AssociativeArray
for all term t in doc d do
HArray{t} = HArray{t} + 1
for all term t in HArray do
Emit(term t, count HArray{t})

Now, in order to accumulate the counters for all the documents we need the combiners:

Listing 3: Computing Code 3

class Mapper
method Map(docid id, doc d)
for all term ter in doc d do
Emit(term ter, count 1)
class Combiner
method Combine(term ter, [c1, c2,...])
sum = 0
for all count cnt in [c1, c2,...] do
sum = sum + cnt
Emit(term t, count sum)
class Reducer
method Reduce(term t, counts [c1, c2,...])
totalCnt = 0
for all count cnt in [c1, c2,...] do
totalCnt = totalCnt + cnt
Emit(term t, count totalCnt)

Collating

Problem: We have a set of items and some function of some items. It is required that we save all the items that have the same value of function or perform some other computation that requires all such items to be processed in a group.

Solution: We have the solution where the mapper computes the given function for each item and returns the value of the function as a key and the item as its value. The role of the reducer is to fetch all the grouped items and process them or save them.

Filtering (Grep), Parsing and Validation

Problem: Let us consider we have a set of records and the requirement is to collect all the records that meet some condition or transform these records into some other representation formats. The second part of the problem includes tasks, such as text parsing and extraction of the values.

Solution: The solution to this problem is quite straight forward – we have a mapper that takes one record at a time and returns the items that meet the criteria.

Distributed Task Execution

Problem: Let us consider we have a large computational problem that can be divided into multiple parts and results from all of these parts needs to be combined in order to obtain the final result.

Solution: The solution to this problem is to split the specifications into a set of specifications that are stored as input data for the mappers. Each of these mappers takes one specification at a time as input data and processes them and produces the results. The job of the reducer is to combine all of these results and produces the final result.

Iterative Message Passing

Problem: Let us consider we have a network of entities and there exists some relationship amongst them. We are required to calculate the state of each entity based on the property of other entity in the neighborhood. This state can be used to represent a distance to other nodes which is an indication that there is a neighbor having certain properties and characteristics.

Solution: We have a network that stores a set of nodes and each node contains the information of a list of adjacent node IDs. Conceptually, the MapReduce jobs are performed in iterative way and at each iteration the node sends messages to its neighbors. And then each neighbor updates its state on the basis of the message that is received. These iterations are terminated by some condition e.g. fixed maximal number of iterations. From the technical point of view, the Mapper returns the messages for each node using the ID of the adjacent node as a key. As a result, all of these messages are grouped by the incoming node and hence the reducer is able to recalculate the state and rewrites the node with the new state.

Listing 4: Iterative messaging passing

class Mapper
method Map(id nId, object NObj)
Emit(id nId, object NObj)
for all id mId in NObj.OutgoingRelations do
Emit(id mId, message getMessage(NObj))
class Reducer
method Reduce(id m, [s1, s2,...])
M = null
messages = []
for all s in [s1, s2,...] do
if IsObject(s) then
M = s
else // s is a message
messages.add(s)
M.State = calculateState(messages)
Emit(id m, item M)

Breadth First Search (Case Study)

Problem: Let us consider we have a graph and it is required to calculate the distance from one source node to all other nodes present in the graph. This is also called the number of hops.

Solution: The solution can be: first the source node emits 0 to all its neighbors, then the neighbors propagate this counter after incrementing it by 1 for each hop.

Distinctive Values

Problem: Let us consider we have a set of records that contain fields M and N. The requirement is to count the total number of unique values of field M for each subset of the same Group N.

Solution: The solution to this problem can be addressed in two stages. In the first stage, the mapper produces dummy counters for each pair of M and N. Then the reducer counts the total number of occurrences for each pair. The objective of this phase is to maintain uniqueness of M values. In the second phase, pairs are grouped by N and the total numbers of items are calculated for each group.

Summary

This article has discussed different design approaches that are commonly used to solve data computation issues. MapReduce design patterns are continuously evolving, so we will see more design approaches in near future.

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.