Processing Large Datasets in a Grid Environment

Processing Large Datasets in a Grid Environment

Distributed computing enables large-scale dataset processing in a grid environment. An interconnected network of individual computers forms the core. Each computer has its own processor and memory, which are used to process assigned tasks separately in the grid. Large datasets are generally split into equally sized blocks, each of which is stored in an individual computer in the grid.

Larger data sets can occupy a terabyte or more of disk space. To avoid the problem of space crunch, dataset blocks are further compressed and stored in each computer in the grid and then decompressed and processed in each computer when needed. This process of compression and decompression is vital to larger data set processing.

In addition to the compression/decompression process, object conversion can enable fast record writing, reading and updating. By converting each record in the dataset block to object form and then compressing and storing the corresponding block file, you can boost performance when handling larger data sets. A protocol buffer is an effective way to handle this object conversion.

In this article, we provide an overview of our solution for handling a large data set file in a grid environment and explain the benefits along the way. We use LZO compression/decompression and a protocol buffer based on the open source framework from Google, Protobuf. Our approach supports object serialization and deserialization by basically converting file records into serialized objects and vice versa.

LZO Compression and Decompression

In a distributed computing scenario, dataset blocks are stored across the network — replicated on each computer system in the cluster — for parallel processing. A job (or one process in distributed computing) is executed across the clustered systems or networked systems in parallel. The job under execution picks up the dataset block file of each computing system in the cluster. This process requires considerable disk space, which is where the LZO compression technique comes in. LZO compression is based on data loss compression and the memory-based fast decompression methodology.

Twitter has an internal project to build a solution for splittable LZO for Hadoop. We created our proof of concept based on the Twitter/Hadoop project but instead of using Hadoop to maintain bulk data we used the Protobuf protocol buffer for fast data serialization and deserialization. The combination of a protocol buffer and LZO compression improved the scalability of large dataset processing for us.

Based on the Twitter/Hadoop project findings, here are some data points about how LZO compression compares with the GZIP compression/decompression algorithm.

  1. The LZO compression ratio is less than GZIP’s. For example, GZIP compressed a 64KB file to 20KB, while LZO compressed the same file to 35KB size file.
  2. The decompression ratio of LZO is faster than GZIP decompression, and it supports splittable files for decompression.
  3. LZO compression is data-loss compression.

Processing Large Datasets in a Single-Node System

To process a large dataset in a single-node system, we opted for the multithreading option. Large dataset file records are converted to a proto object and stored in an equally sized block file. We use LZO compression to compress the block file and storing it in the system. For faster processing, we use n number of threads to process the block file based on random memory availability in the single-node system. Figure 1 illustrates the compression flow for a large dataset file in a single-node system.

After splitting the large dataset file into multiple block files, multiple proto object block files will be created for parallel processing (based on the block size). These proto object block files are compressed using LZO compression. They are also used for parallel processing in single- and multi-node systems.

Our previous DevX article explains how to use a protocol buffer for object management.

Figure 2 illustrates the decompression flow.


Figure 2. Decompression Flow for a Large Dataset File

The compressed block file is decompressed and converted back to the proto object. For decompression, we used multithreading to decompress the compressed block file and read the proto object from an uncompressed file. Here are the steps for reading the LZO compressed block file for decompression.

  1. Check the available free memory from the single-node system.
  2. Calculate the possible thread count based on block size and free memory. Here is the code for determining thread count.
    public static int getNoOfThreadUsage(int blockSize){Runtime runtime = Runtime.getRuntime ();return (int) (runtime.freeMemory()/blockSize);}
  3. Get the file count of the compressed block file, which is called a block count.
  4. Calculate the parallel thread limit for uncompressing the compressed block file.
    threadLimit = noThread < blockCount? noThread : blockCount;
  5. Set the thread count value for parallel execution and increase the value until it reaches the thread limit. This means uncompress all compressed block files one by one based on the available memory in the single-node system. This helps to avoid deadlocks and out-of-memory issues.

Putting It All Together: LZO Compression Use Case

This section walks through the code for an LZO compression use case of our solution. This sample use case program provides parallel processing in a single-node system using large dataset files. (Click here to download the source code.)

The following employee proto file is used for the proto object.

option java_package = "com.emp";option java_outer_classname = "EmpProtos";message Employee {required string name = 1;required int32 id = 2; // Unique ID number for this employeeoptional string email = 3;}message Emprecord {repeated Employee emp = 1;}

The following code snippet uses a proto Java file for object management, and it writes the employee object into the output file as a serialized form.

package com.emp;import java.io.*;import com.emp.EmpProtos.*; //importing the Java proto which is created by proto compiler.//writing the employee record into the output filepublic static void addEmployee(int id ,String name,String email, FileOutputStream fout) throws Exception {Emprecord.Builder recBuilder = Emprecord.newBuilder();Employee.Builder empBuilder = Employee.newBuilder();empBuilder.setName(name);empBuilder.setId(id);empBuilder.setEmail(email);recBuilder.addEmp(empBuilder.build());recBuilder.build().writeTo(fout);}

The following code snippet helps read the employee record in the input file.

//Reading the employee record from the filepublic static void readEmprecord(FileInputStream fin) throws Exception {Emprecord empList =Emprecord.parseFrom(fin);System.out.println("==Employee List ==");if(empList != null){for (Employee emp: empList.getEmpList()) {System.out.println(emp.getId()+" "+emp.getName()+" "+getEmail());}}}

The following code snippet is for compressing the block of the proto object file. We use the lzocomdecomp.jar to compress and decompress the proto object block file. The code shows the syntax for calling the LZO compression class. This class will return the .(dot)LZO extn file, which means it compressed via LZO compression.

LzoCompress.compress( )

The following code snippet is for uncompressing the compressed proto object block file. We create an object for the class and pass the value as a compressed proto object file. This class supports the threading and you can customize it based on your proto object for reading from the block. The output of the file extension is .(dot)unlzo.

FileUncompress( )

The Required JARs

The following are the supported JAR files required for executing the sample use case. Click here to download the source code.

  • protobuf-java-.jar
  • protobuf-format-java-1.1.jar
  • lzocomdecomp.jar: This is our own JAR for LZO compression and decompression.
  • Sampleusecase_emp.jar: This JAR contains multithreading way of processing large dataset.

Here is the syntax for executing the sample use program.

Empaddread   

The input file is comma separated in this format:

,,

Here is the field format:

  • Id: integer
  • name: string
  • email: string
devx-admin

devx-admin

Share the Post:
Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security.

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like

Huge Savings

Score Massive Savings on Portable Gaming

This week in tech bargains, a well-known firm has considerably reduced the price of its portable gaming device, cutting costs by as much as 20

Cloudfare Protection

Unbreakable: Cloudflare One Data Protection Suite

Recently, Cloudflare introduced its One Data Protection Suite, an extensive collection of sophisticated security tools designed to protect data in various environments, including web, private,

Drone Revolution

Cool Drone Tech Unveiled at London Event

At the DSEI defense event in London, Israeli defense firms exhibited cutting-edge drone technology featuring vertical-takeoff-and-landing (VTOL) abilities while launching two innovative systems that have

Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security. These carefully designed learning courses

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers to better manage smaller unmanned

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like artificial intelligence (AI), semiconductors, and

Huge Savings

Score Massive Savings on Portable Gaming

This week in tech bargains, a well-known firm has considerably reduced the price of its portable gaming device, cutting costs by as much as 20 percent, which matches the lowest

Cloudfare Protection

Unbreakable: Cloudflare One Data Protection Suite

Recently, Cloudflare introduced its One Data Protection Suite, an extensive collection of sophisticated security tools designed to protect data in various environments, including web, private, and SaaS applications. The suite

Drone Revolution

Cool Drone Tech Unveiled at London Event

At the DSEI defense event in London, Israeli defense firms exhibited cutting-edge drone technology featuring vertical-takeoff-and-landing (VTOL) abilities while launching two innovative systems that have already been acquired by clients.

2D Semiconductor Revolution

Disrupting Electronics with 2D Semiconductors

The rapid development in electronic devices has created an increasing demand for advanced semiconductors. While silicon has traditionally been the go-to material for such applications, it suffers from certain limitations.

Cisco Growth

Cisco Cuts Jobs To Optimize Growth

Tech giant Cisco Systems Inc. recently unveiled plans to reduce its workforce in two Californian cities, with the goal of optimizing the company’s cost structure. The company has decided to

FAA Authorization

FAA Approves Drone Deliveries

In a significant development for the US drone industry, drone delivery company Zipline has gained Federal Aviation Administration (FAA) authorization, permitting them to operate drones beyond the visual line of

Mortgage Rate Challenges

Prop-Tech Firms Face Mortgage Rate Challenges

The surge in mortgage rates and a subsequent decrease in home buying have presented challenges for prop-tech firms like Divvy Homes, a rent-to-own start-up company. With a previous valuation of

Lighthouse Updates

Microsoft 365 Lighthouse: Powerful Updates

Microsoft has introduced a new update to Microsoft 365 Lighthouse, which includes support for alerts and notifications. This update is designed to give Managed Service Providers (MSPs) increased control and

Website Lock

Mysterious Website Blockage Sparks Concern

Recently, visitors of a well-known resource website encountered a message blocking their access, resulting in disappointment and frustration among its users. While the reason for this limitation remains uncertain, specialists

AI Tool

Unleashing AI Power with Microsoft 365 Copilot

Microsoft has recently unveiled the initial list of Australian clients who will benefit from Microsoft 365 (M365) Copilot through the exclusive invitation-only global Early Access Program. Prominent organizations participating in

Microsoft Egnyte Collaboration

Microsoft and Egnyte Collaboration

Microsoft has revealed a collaboration with Egnyte, a prominent platform for content cooperation and governance, with the goal of improving real-time collaboration features within Microsoft 365 and Microsoft Teams. This

Best Laptops

Top Programming Laptops of 2023

In 2023, many developers prioritize finding the best laptop for programming, whether at home, in the workplace, or on the go. A high-performing, portable, and user-friendly laptop could significantly influence

Renaissance Gaming Magic

AI Unleashes A Gaming Renaissance

In recent times, artificial intelligence has achieved remarkable progress, with resources like ChatGPT becoming more sophisticated and readily available. Pietro Schirano, the design lead at Brex, has explored the capabilities

New Apple Watch

The New Apple Watch Ultra 2 is Awesome

Apple is making waves in the smartwatch market with the introduction of the highly anticipated Apple Watch Ultra 2. This revolutionary device promises exceptional performance, robust design, and a myriad

Truth Unveiling

Unveiling Truths in Bowen’s SMR Controversy

Tony Wood from the Grattan Institute has voiced his concerns over Climate and Energy Minister Chris Bowen’s critique of the Coalition’s support for small modular nuclear reactors (SMRs). Wood points

Avoiding Crisis

Racing to Defy Looming Financial Crisis

Chinese property developer Country Garden is facing a liquidity challenge as it approaches a deadline to pay $15 million in interest associated with an offshore bond. With a 30-day grace

Open-Source Development

Open-Source Software Development is King

The increasingly digital world has led to the emergence of open-source software as a critical factor in modern software development, with more than 70% of the infrastructure, products, and services

Home Savings

Sensational Savings on Smart Home Security

For a limited time only, Amazon is offering massive discounts on a variety of intelligent home devices, including products from its Ring security range. Running until October 2 or while

Apple Unleashed

A Deep Dive into the iPhone 15 Pro Max

Apple recently unveiled its groundbreaking iPhone 15 Pro and iPhone 15 Pro Max models, featuring a revolutionary design, extraordinary display technology, and unrivaled performance. These new models are the first

Renewable Crypto Miners

Crypto Miners Embrace Renewable Energy?

As the cryptocurrency sector deals with the fallout from the FTX and Celsius exchange collapses, Bitcoin miners are increasingly exploring alternative energy sources to reduce expenses and maintain profitability. Specialists

Laptop Savings

The HP Omen 16 is a Gamer’s Dream

Best Buy is currently offering an unbeatable deal on the HP Omen 16 gaming laptop, giving potential buyers the chance to save a significant $720 on their purchase. Originally priced

How to Check for Vulnerabilities in Exchange Server

It is imperative to keep your systems and infrastructure up-to-date to mitigate security issues and loopholes, and to protect them against any known vulnerabilities and security risks. There are many