Browse DevX
Sign up for e-mail newsletters from DevX


Hadoop, Protocol Buffers and LZO for Bulk Data Processing : Page 2

Developers who use Hadoop for big data processing have some productive options for data serialization, data compression and class formatting.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Hadoop MapReduce for Protocol Buffers with LZO Compression

Hadoop uses MapReduce for processing large datasets. The following steps explain how to use the protobuf with LZO compression in a MapReduce program to process large datasets.

  1. Set the LZOProtobuf output format class to store the normal record into protobuf with LZO compression form. This example is for the Person protobuf output format class.

    public class LzoPersonProtobufBlockOutputFormat extends LzoProtobufBlockOutputFormat<Person> { public LzoPersonProtobufBlockOutputFormat() { setTypeRef(new TypeRef<Person>() { }); } }

  2. Set the LZOProtobuf Input format class to read the LZO compressed protobuf record. This example is for the Person protobuf input format class.

    public class LzoPersonProtobufBlockInputFormat extends LzoProtobufBlockInputFormat<Person> { public LzoPersonProtobufBlockInputFormat() { setTypeRef(new TypeRef<Person>() { }); } }

  3. Here is the Hadoop Job configuration for writing records into a LZO protobuf compressed form.

    TextInputFormat.addInputPaths(job, Inputfilepath); LzoPersonProtobufBlockOutputFormat.setOutputPath(job,outputfilePath); job.setMapOutputValueClass(ProtobufPersonWritable.class); job.setMapOutputKeyClass(NullWritable.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(ProtobufPersonWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(ProtoMapper.class); LzoPersonProtobufBlockOutputFormat.setCompressOutput(job,true); job.setOutputFormatClass(LzoPersonProtobufBlockOutputFormat.class);

  4. The ProtobufPersonWritable class is used for writing a normal record into a LZO compressed protobuf record and into an output file when you need to enable the compression option.
  5. Here is the Hadoop Job configuration for reading a LZO protobuf compressed record.

    LzoPersonProtobufBlockInputFormat.addInputPaths(job, args[0]); FileOutputFormat.setOutputPath(job,outputPath); job.setMapOutputValueClass(Text.class); job.setMapOutputKeyClass(NullWritable.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(LzoPersonProtobufBlockInputFormat.class); job.setMapperClass(ProtoMapper.class); job.setOutputFormatClass(TextOutputFormat.class);

Benefits of LZO and Protobuf Compression

When you have a working protocol buffers and LZO setup in your Hadoop environment, you can take advantage of the following benefits:

  • You save disk space, because data are stored in a compressed form in every HDFS location.
  • Because LZO files are splittable, each split is used for the cluster process.
  • LZO file supports fast decompression.
  • Protocol buffer is used for storing record into serializing object.

Sivakumar Kuppusamy is a product technical lead involved in the design and development of Java EE applications.
Thanks for your registration, follow us on our social networks to keep up-to-date