Hadoop, Protocol Buffers and LZO for Bulk Data Processing

Hadoop, Protocol Buffers and LZO for Bulk Data Processing

Hadoop, the distributed computing framework, uses its own distributed file system (HDFS) for handling large datasets. But developers who use Hadoop have some options for other aspects of their bulk data processing applications. In particular, they can use protocol buffers for data serialization, LZO for data compression and Elephant-Bird protocol class formatting:

  • Protocol buffers are a flexible, efficient and automated mechanism for serializing structured data — think XML, only smaller, faster and simpler.
  • LZO is a compression and decompression technique for structured data. It can compress large datasets to increase the available space in the HDFS system, and it supports fast decompression, which helps for fast data processing in a distributed environment.
  • Elephant-Bird protocol is a framework for generating input and output format classes.

This article explains how to set up Hadoop with protocol buffers and LZO for large dataset processing. Because LZO files are splittable in HDFS, you can store them across a Hadoop cluster for distributed processing.

Enabling LZO in Hadoop

Take these steps to set up LZO compression in a Hadoop environment:

  1. Install the following .rpm files in your Hadoop system.
    • lzo-2.04-1.el5.rf.i386.rpm
    • lzo-devel-2.04-1.el5.rf.i386.rpm
    • hadoop-gpl-packaging-0.2.0-1.i386.rpm
  2. Add hadoop-lzo-0.4.15.jar to the lib folder of Hadoop.
  3. Add the following property in core-site.xml. This file is in the conf folder of Hadoop.
    io.compression.codecs org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec          io.compression.codec.lzo.class     com.hadoop.compression.lzo.LzoCodec  

Using Elephant-Bird in Hadoop

The elephant-bird is a framework for generating protocol buffers and LZO compression-based input and output format classes for dataset processing. These input and output format classes are generated based on the Proto file. Here are the steps for generating input and output format classes:

  1. Copy and extract the kevinweil-elephant-bird-v2.1.8-8.zip file. It will create a kevinweil-elephant-bird-v2.1.8-8 folder.
  2. Enable the Proto compiler, which will allow you to create Java files for the proto input file.
  3. Enable the Ant build tool, which will allow you to create input and output format classes.
  4. Add a Proto file in the kevinweil-elephant-bird-v2.1.8-8examplessrcproto folder.
  5. Here is the Proto file of a sample called address_book.
    package com.test.proto;option java_outer_classname = "AddressBookProtos";message Person {  required string name = 1;  required int32 id = 2;  optional string email = 3;}
  6. The config-twadoop.yml file, located in the kevinweil-elephant-bird-v2.1.8-8examples folder, is used to set the input and output format in the code generation setup (based on the Proto file). Here is a sample file format where address_book is the proto file name.
    address_book:  - com.twitter.elephantbird.proto.codegen.LzoProtobufB64LineInputFormatGenerator  - com.twitter.elephantbird.proto.codegen.LzoProtobufB64LineOutputFormatGenerator  - com.twitter.elephantbird.proto.codegen.LzoProtobufBlockInputFormatGenerator  - com.twitter.elephantbird.proto.codegen.LzoProtobufBlockOutputFormatGenerator  - com.twitter.elephantbird.proto.codegen.ProtobufWritableGenerator
  7. Run the ant command ant-examples. It will create a gen-java folder in the examplessrc folder. The new folder contains the source file of input and output format classes. Here is the folder structure.
    • com estproto mapreduceinput
    • com estproto mapreduceio
    • com estproto mapreduceoutput

Hadoop MapReduce for Protocol Buffers with LZO Compression

Hadoop uses MapReduce for processing large datasets. The following steps explain how to use the protobuf with LZO compression in a MapReduce program to process large datasets.

  1. Set the LZOProtobuf output format class to store the normal record into protobuf with LZO compression form. This example is for the Person protobuf output format class.
    public class LzoPersonProtobufBlockOutputFormat extends      LzoProtobufBlockOutputFormat {   public LzoPersonProtobufBlockOutputFormat() {      setTypeRef(new TypeRef() {      });   }}
  2. Set the LZOProtobuf Input format class to read the LZO compressed protobuf record. This example is for the Person protobuf input format class.
    public class LzoPersonProtobufBlockInputFormat extends      LzoProtobufBlockInputFormat {   public LzoPersonProtobufBlockInputFormat() {      setTypeRef(new TypeRef() {      });   }         }
  3. Here is the Hadoop Job configuration for writing records into a LZO protobuf compressed form.
    TextInputFormat.addInputPaths(job, Inputfilepath);LzoPersonProtobufBlockOutputFormat.setOutputPath(job,outputfilePath);job.setMapOutputValueClass(ProtobufPersonWritable.class);job.setMapOutputKeyClass(NullWritable.class);job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(ProtobufPersonWritable.class);job.setInputFormatClass(TextInputFormat.class);job.setMapperClass(ProtoMapper.class);LzoPersonProtobufBlockOutputFormat.setCompressOutput(job,true);job.setOutputFormatClass(LzoPersonProtobufBlockOutputFormat.class);
  4. The ProtobufPersonWritable class is used for writing a normal record into a LZO compressed protobuf record and into an output file when you need to enable the compression option.
  5. Here is the Hadoop Job configuration for reading a LZO protobuf compressed record.
    LzoPersonProtobufBlockInputFormat.addInputPaths(job, args[0]);FileOutputFormat.setOutputPath(job,outputPath);job.setMapOutputValueClass(Text.class);job.setMapOutputKeyClass(NullWritable.class);job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Text.class);job.setInputFormatClass(LzoPersonProtobufBlockInputFormat.class);job.setMapperClass(ProtoMapper.class);job.setOutputFormatClass(TextOutputFormat.class);

Benefits of LZO and Protobuf Compression

When you have a working protocol buffers and LZO setup in your Hadoop environment, you can take advantage of the following benefits:

  • You save disk space, because data are stored in a compressed form in every HDFS location.
  • Because LZO files are splittable, each split is used for the cluster process.
  • LZO file supports fast decompression.
  • Protocol buffer is used for storing record into serializing object.
Share the Post:
XDR solutions

The Benefits of Using XDR Solutions

Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved

AI is revolutionizing fraud detection

How AI is Revolutionizing Fraud Detection

Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across

AI innovation

Companies Leading AI Innovation in 2023

Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several

data fivetran pricing

Fivetran Pricing Explained

One of the biggest trends of the 21st century is the massive surge in analytics. Analytics is the process of utilizing data to drive future decision-making. With so much of

kubernetes logging

Kubernetes Logging: What You Need to Know

Kubernetes from Google is one of the most popular open-source and free container management solutions made to make managing and deploying applications easier. It has a solid architecture that makes

ransomware cyber attack

Why Is Ransomware Such a Major Threat?

One of the most significant cyber threats faced by modern organizations is a ransomware attack. Ransomware attacks have grown in both sophistication and frequency over the past few years, forcing

data dictionary

Tools You Need to Make a Data Dictionary

Data dictionaries are crucial for organizations of all sizes that deal with large amounts of data. they are centralized repositories of all the data in organizations, including metadata such as