Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX

By submitting your information, you agree that devx.com may send you DevX offers via email, phone and text message, as well as email offers about other products and services that DevX believes may be of interest to you. DevX will process your information in accordance with the Quinstreet Privacy Policy.


Processing Large Datasets in a Grid Environment

Using LZO compression and a protocol buffer can improve the speed and scalability of your large dataset processing.




Building the Right Environment to Support AI, Machine Learning and Deep Learning

Distributed computing enables large-scale dataset processing in a grid environment. An interconnected network of individual computers forms the core. Each computer has its own processor and memory, which are used to process assigned tasks separately in the grid. Large datasets are generally split into equally sized blocks, each of which is stored in an individual computer in the grid.

Larger data sets can occupy a terabyte or more of disk space. To avoid the problem of space crunch, dataset blocks are further compressed and stored in each computer in the grid and then decompressed and processed in each computer when needed. This process of compression and decompression is vital to larger data set processing.

In addition to the compression/decompression process, object conversion can enable fast record writing, reading and updating. By converting each record in the dataset block to object form and then compressing and storing the corresponding block file, you can boost performance when handling larger data sets. A protocol buffer is an effective way to handle this object conversion.

In this article, we provide an overview of our solution for handling a large data set file in a grid environment and explain the benefits along the way. We use LZO compression/decompression and a protocol buffer based on the open source framework from Google, Protobuf. Our approach supports object serialization and deserialization by basically converting file records into serialized objects and vice versa.

LZO Compression and Decompression

In a distributed computing scenario, dataset blocks are stored across the network -- replicated on each computer system in the cluster -- for parallel processing. A job (or one process in distributed computing) is executed across the clustered systems or networked systems in parallel. The job under execution picks up the dataset block file of each computing system in the cluster. This process requires considerable disk space, which is where the LZO compression technique comes in. LZO compression is based on data loss compression and the memory-based fast decompression methodology.

Twitter has an internal project to build a solution for splittable LZO for Hadoop. We created our proof of concept based on the Twitter/Hadoop project but instead of using Hadoop to maintain bulk data we used the Protobuf protocol buffer for fast data serialization and deserialization. The combination of a protocol buffer and LZO compression improved the scalability of large dataset processing for us.

Based on the Twitter/Hadoop project findings, here are some data points about how LZO compression compares with the GZIP compression/decompression algorithm.

  1. The LZO compression ratio is less than GZIP's. For example, GZIP compressed a 64KB file to 20KB, while LZO compressed the same file to 35KB size file.
  2. The decompression ratio of LZO is faster than GZIP decompression, and it supports splittable files for decompression.
  3. LZO compression is data-loss compression.

Processing Large Datasets in a Single-Node System

To process a large dataset in a single-node system, we opted for the multithreading option. Large dataset file records are converted to a proto object and stored in an equally sized block file. We use LZO compression to compress the block file and storing it in the system. For faster processing, we use n number of threads to process the block file based on random memory availability in the single-node system. Figure 1 illustrates the compression flow for a large dataset file in a single-node system.

Click here for larger image

Figure 1. Compression Flow for a Large Dataset File

After splitting the large dataset file into multiple block files, multiple proto object block files will be created for parallel processing (based on the block size). These proto object block files are compressed using LZO compression. They are also used for parallel processing in single- and multi-node systems.

Our previous DevX article explains how to use a protocol buffer for object management.

Figure 2 illustrates the decompression flow.

Figure 2. Decompression Flow for a Large Dataset File

The compressed block file is decompressed and converted back to the proto object. For decompression, we used multithreading to decompress the compressed block file and read the proto object from an uncompressed file. Here are the steps for reading the LZO compressed block file for decompression.

  1. Check the available free memory from the single-node system.

  2. Calculate the possible thread count based on block size and free memory. Here is the code for determining thread count.

    public static int getNoOfThreadUsage(int blockSize){
    Runtime runtime = Runtime.getRuntime ();
    return (int) (runtime.freeMemory()/blockSize);
  3. Get the file count of the compressed block file, which is called a block count.

  4. Calculate the parallel thread limit for uncompressing the compressed block file.

    threadLimit = noThread < blockCount? noThread : blockCount;
  5. Set the thread count value for parallel execution and increase the value until it reaches the thread limit. This means uncompress all compressed block files one by one based on the available memory in the single-node system. This helps to avoid deadlocks and out-of-memory issues.

Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date