RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Hadoop, Protocol Buffers and LZO for Bulk Data Processing

Developers who use Hadoop for big data processing have some productive options for data serialization, data compression and class formatting.


Hadoop, the distributed computing framework, uses its own distributed file system (HDFS) for handling large datasets. But developers who use Hadoop have some options for other aspects of their bulk data processing applications. In particular, they can use protocol buffers for data serialization, LZO for data compression and Elephant-Bird protocol class formatting:

  • Protocol buffers are a flexible, efficient and automated mechanism for serializing structured data -- think XML, only smaller, faster and simpler.
  • LZO is a compression and decompression technique for structured data. It can compress large datasets to increase the available space in the HDFS system, and it supports fast decompression, which helps for fast data processing in a distributed environment.
  • Elephant-Bird protocol is a framework for generating input and output format classes.

This article explains how to set up Hadoop with protocol buffers and LZO for large dataset processing. Because LZO files are splittable in HDFS, you can store them across a Hadoop cluster for distributed processing.

Enabling LZO in Hadoop

Take these steps to set up LZO compression in a Hadoop environment:

  1. Install the following .rpm files in your Hadoop system.
    • lzo-2.04-1.el5.rf.i386.rpm
    • lzo-devel-2.04-1.el5.rf.i386.rpm
    • hadoop-gpl-packaging-0.2.0-1.i386.rpm
  2. Add hadoop-lzo-0.4.15.jar to the lib folder of Hadoop.
  3. Add the following property in core-site.xml. This file is in the conf folder of Hadoop.
    org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, 

Using Elephant-Bird in Hadoop

The elephant-bird is a framework for generating protocol buffers and LZO compression-based input and output format classes for dataset processing. These input and output format classes are generated based on the Proto file. Here are the steps for generating input and output format classes:

  1. Copy and extract the kevinweil-elephant-bird-v2.1.8-8.zip file. It will create a kevinweil-elephant-bird-v2.1.8-8 folder.
  2. Enable the Proto compiler, which will allow you to create Java files for the proto input file.
  3. Enable the Ant build tool, which will allow you to create input and output format classes.
  4. Add a Proto file in the kevinweil-elephant-bird-v2.1.8-8\examples\src\proto folder.
  5. Here is the Proto file of a sample called address_book.
    package com.test.proto;
    option java_outer_classname = "AddressBookProtos";
    message Person {
      required string name = 1;
      required int32 id = 2;
      optional string email = 3;
  6. The config-twadoop.yml file, located in the kevinweil-elephant-bird-v2.1.8-8\examples folder, is used to set the input and output format in the code generation setup (based on the Proto file). Here is a sample file format where address_book is the proto file name.
      - com.twitter.elephantbird.proto.codegen.LzoProtobufB64LineInputFormatGenerator
      - com.twitter.elephantbird.proto.codegen.LzoProtobufB64LineOutputFormatGenerator
      - com.twitter.elephantbird.proto.codegen.LzoProtobufBlockInputFormatGenerator
      - com.twitter.elephantbird.proto.codegen.LzoProtobufBlockOutputFormatGenerator
      - com.twitter.elephantbird.proto.codegen.ProtobufWritableGenerator
  7. Run the ant command ant-examples. It will create a gen-java folder in the examples\src folder. The new folder contains the source file of input and output format classes. Here is the folder structure.
    • com\test\proto\ mapreduce\input
    • com\test\proto\ mapreduce\io
    • com\test\proto\ mapreduce\output

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date