Hadoop, the distributed computing framework, uses its own distributed file system (HDFS) for handling large datasets. But developers who use Hadoop have some options for other aspects of their bulk data processing applications. In particular, they can use protocol buffers for data serialization, LZO for data compression and Elephant-Bird protocol class formatting:
- Protocol buffers are a flexible, efficient and automated mechanism for serializing structured data — think XML, only smaller, faster and simpler.
- LZO is a compression and decompression technique for structured data. It can compress large datasets to increase the available space in the HDFS system, and it supports fast decompression, which helps for fast data processing in a distributed environment.
- Elephant-Bird protocol is a framework for generating input and output format classes.
This article explains how to set up Hadoop with protocol buffers and LZO for large dataset processing. Because LZO files are splittable in HDFS, you can store them across a Hadoop cluster for distributed processing.
Enabling LZO in Hadoop
Take these steps to set up LZO compression in a Hadoop environment:
- Install the following .rpm files in your Hadoop system.
- lzo-2.04-1.el5.rf.i386.rpm
- lzo-devel-2.04-1.el5.rf.i386.rpm
- hadoop-gpl-packaging-0.2.0-1.i386.rpm
- Add hadoop-lzo-0.4.15.jar to the lib folder of Hadoop.
- Add the following property in core-site.xml. This file is in the conf folder of Hadoop.
io.compression.codecs org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec io.compression.codec.lzo.class com.hadoop.compression.lzo.LzoCodec
Using Elephant-Bird in Hadoop
The elephant-bird is a framework for generating protocol buffers and LZO compression-based input and output format classes for dataset processing. These input and output format classes are generated based on the Proto file. Here are the steps for generating input and output format classes:
- Copy and extract the kevinweil-elephant-bird-v2.1.8-8.zip file. It will create a kevinweil-elephant-bird-v2.1.8-8 folder.
- Enable the Proto compiler, which will allow you to create Java files for the proto input file.
- Enable the Ant build tool, which will allow you to create input and output format classes.
- Add a Proto file in the kevinweil-elephant-bird-v2.1.8-8examplessrcproto folder.
- Here is the Proto file of a sample called address_book.
package com.test.proto;option java_outer_classname = "AddressBookProtos";message Person { required string name = 1; required int32 id = 2; optional string email = 3;}
- The config-twadoop.yml file, located in the kevinweil-elephant-bird-v2.1.8-8examples folder, is used to set the input and output format in the code generation setup (based on the Proto file). Here is a sample file format where address_book is the proto file name.
address_book: - com.twitter.elephantbird.proto.codegen.LzoProtobufB64LineInputFormatGenerator - com.twitter.elephantbird.proto.codegen.LzoProtobufB64LineOutputFormatGenerator - com.twitter.elephantbird.proto.codegen.LzoProtobufBlockInputFormatGenerator - com.twitter.elephantbird.proto.codegen.LzoProtobufBlockOutputFormatGenerator - com.twitter.elephantbird.proto.codegen.ProtobufWritableGenerator
- Run the ant command
ant-examples
. It will create a gen-java folder in the examplessrc folder. The new folder contains the source file of input and output format classes. Here is the folder structure.- com estproto mapreduceinput
- com estproto mapreduceio
- com estproto mapreduceoutput
Hadoop MapReduce for Protocol Buffers with LZO Compression
Hadoop uses MapReduce for processing large datasets. The following steps explain how to use the protobuf with LZO compression in a MapReduce program to process large datasets.
- Set the LZOProtobuf output format class to store the normal record into protobuf with LZO compression form. This example is for the Person protobuf output format class.
public class LzoPersonProtobufBlockOutputFormat extends LzoProtobufBlockOutputFormat
{ public LzoPersonProtobufBlockOutputFormat() { setTypeRef(new TypeRef () { }); }} - Set the LZOProtobuf Input format class to read the LZO compressed protobuf record. This example is for the Person protobuf input format class.
public class LzoPersonProtobufBlockInputFormat extends LzoProtobufBlockInputFormat
{ public LzoPersonProtobufBlockInputFormat() { setTypeRef(new TypeRef () { }); } } - Here is the Hadoop Job configuration for writing records into a LZO protobuf compressed form.
TextInputFormat.addInputPaths(job, Inputfilepath);LzoPersonProtobufBlockOutputFormat.setOutputPath(job,outputfilePath);job.setMapOutputValueClass(ProtobufPersonWritable.class);job.setMapOutputKeyClass(NullWritable.class);job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(ProtobufPersonWritable.class);job.setInputFormatClass(TextInputFormat.class);job.setMapperClass(ProtoMapper.class);LzoPersonProtobufBlockOutputFormat.setCompressOutput(job,true);job.setOutputFormatClass(LzoPersonProtobufBlockOutputFormat.class);
- The ProtobufPersonWritable class is used for writing a normal record into a LZO compressed protobuf record and into an output file when you need to enable the compression option.
- Here is the Hadoop Job configuration for reading a LZO protobuf compressed record.
LzoPersonProtobufBlockInputFormat.addInputPaths(job, args[0]);FileOutputFormat.setOutputPath(job,outputPath);job.setMapOutputValueClass(Text.class);job.setMapOutputKeyClass(NullWritable.class);job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Text.class);job.setInputFormatClass(LzoPersonProtobufBlockInputFormat.class);job.setMapperClass(ProtoMapper.class);job.setOutputFormatClass(TextOutputFormat.class);
Benefits of LZO and Protobuf Compression
When you have a working protocol buffers and LZO setup in your Hadoop environment, you can take advantage of the following benefits:
- You save disk space, because data are stored in a compressed form in every HDFS location.
- Because LZO files are splittable, each split is used for the cluster process.
- LZO file supports fast decompression.
- Protocol buffer is used for storing record into serializing object.