Apache Hadoop and Hive for Big Data Storage and Processing

Big data is an aptly named concept. Nowadays, huge amounts of data and information are generated by hundreds of millions of applications, devices and human beings all round the world. Be they user’s personal data maintained by social networking sites, public sites hosting blogs, weather-related data generated by different types of sensors, or customer and product information maintained by a large enterprise organization — they are all contributing to big data.

The amount of data sets and the volume of information being generated, processed and analyzed particularly for business intelligence and decision making is growing rapidly as well, making traditional warehousing solutions expensive, difficult to leverage and mostly ineffective. There clearly is a need for a generalized, flexible and scalable tool that can cope with the challenges of big data.

When developers deal with big data, the challenge generally boils down to either:

  • storing the data in a way so that it is easy to access and manage
  • processing the whole set of data in a way that makes the processing easy, efficient and faster

Enter Hadoop for Big Data

The Apache Hadoop is a popular software framework that supports data-intensive distributed applications. It is a map-reduce implementation inspired by Google’s MapReduce and Google File System (GFS) papers. Hadoop is written in Java, and many large-scale applications and organizations (e.g. Yahoo and Facebook) use it to run large distributed computations on commodity hardware. It is designed in a way that it can scale up from a single server node to thousands of machines, each offering local computation and storage. All these features and capabilities make Hadoop the best candidate for developers dealing with big data.

However, the map-reduce programming used by Hadoop is very low level and Hadoop lacks the SQL-like expressiveness of query languages, which forces developers to spend a lot of time writing programs for even simple analyses. Hadoop also is not easy for those developers who are not familiar with the map-reduce concept. It requires them to write custom programs for each operation, even for simple tasks like getting the number of rows from a log or averages over some columns. Even the code generated is hard to maintain and reuse for different applications.

Apache Hive, a NoSQL type of database system that runs over Hadoop, overcomes these limitations and provides a neat and simple interface. Hive efficiently meets the challenges of storing, managing and processing big data, which is difficult with traditional RDBMS solutions.

Apache Hive Features

The following features enable Hive to meet big data challenges such as storing, managing and processing large data sets.

Powerful CLI and HiveQL

Hive supports a SQL-like query language known as the Hive query language (HiveQL) over one or multiple data files located either in a local file system or in HDFS. HiveQL runs over the Hadoop map-reduce framework itself, but hides the complexity from the developer. HiveQL is composed of a subset of SQL features and some useful extensions that are helpful for batch processing systems.

HiveQL supports basic SQL-like features such as CREATE tables, DROP tables, SELECT … FROM … WHERE clauses, various types of joins (inner, left outer, right outer and outer joins), Cartesian products, GROUP BY, SORT BY, aggregations, union and many useful functions on primitive and complex data types. Metadata browsing features such as list databases, tables and so on are also provided. This enables developers not familiar with Hadoop or MapReduce to begin querying the system right away through the Hive CLI (command line interface).

However, HiveQL does have some limitations compared with traditional RDBMS SQL. For instance, Hive currently does not support inserting data into an existing table. All insert operations into a table or partition overwrite the existing data.

Managing a Wide Variety of Data

Using Hive, developers can structure and map file data into RDBMS concepts such as tables, columns, rows, and partitions. Other than the primitive data types, such as boolean, integers, floats, doubles, strings and so on, Hive also supports all the major complex types such as list, map and struct. Even more complex types can be generated by arbitrarily combining these types.

Hive and SerDe

SerDe, the Hive Serialization/Deserialization module, takes an implementation of the SerDe Java interface provided by the user and associates it to a Hive table or partition. This enables a developer to interpret and query custom data formats easily with HiveQL.

The default SerDe implementation in Hive assumes that the rows are delimited by a newline (ASCII code 13) and the columns within a row are delimited by Ctrl-A (ASCII code 1). The SerDe can also be used to read data that uses any other delimiter character between columns using regular expression (e.g. ([^ ]*) ([^ ]*)) provided at the time of creating table.

Setting Up Hive Over Hadoop

Setting up Hive on Hadoop is a multi-step process that requires a few technologies:

  1. Hadoop: Latest available version of Hadoop already installed.
  2. Java 1.6 or above: Hadoop and Hive can run only on Java6 or above.
  3. SSH client: To run the hive commands over the Hadoop machine

The following steps explain the procedures needed for installing and configuring Apache Hive over a Linux system.

  1. Download the most recent stable release of Hive from one of the Apache download mirrors on Hive Releases page.
  2. Unpack the tarball using the following command. This will result in the creation of a subdirectory named hive-x.y.z.
         $ tar -xzvf hive-x.y.z.tar.gz
  3. Make sure that the environment variable HADOOP_HOME is defined with the Hadoop installation directory. If not, use the following command to set the same.
         $ export HADOOP_HOME=

    Similarly, set the environment variable HIVE_HOME to point to the installation directory as:

         $ cd hive-x.y.z     $ export HIVE_HOME={{pwd}}
  4. Optionally, you can add $HIVE_HOME/bin to the terminal environment variable PATH:
         $ export PATH=$HIVE_HOME/bin:$PATH
  5. To run Hive, you need to create a Hive metastore warehouse directory in HDFS and provide the appropriate permission using these commands.
         $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp     $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp     $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse     $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
  6. After doing all the above configurations successfully, use the following command to start Hive CLI from the shell:
         $ $HIVE_HOME/bin/hive

All the metadata for Hive tables and partitions are stored in Hive Metastore. The Hive metastore can be set up in three different ways:

  1. Embedded Metastore
  2. Local Metastore
  3. Remote Metastore

The Hive configuration variables are defined in an XML file HIVE_HOME/conf/hive-default.xml. They can be changed by (re-)defining them in HIVE_HOME /conf/hive-site.xml. For local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it for getting the metadata information. The following table lists a few important configuration parameters for setting up a metastore in a local MySQL server.

Config Param Config Value
javax.jdo.option.ConnectionURL jdbc:mysql://server_host:3306/db_name
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
hive.metastore.local True
hive.metastore.warehouse.dir /user/hive/warehouse
Table 1. Configuration Parameters for a Metastore in MySQL

You should change the values based on your own setup.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin


The Latest

your company's audio

4 Areas of Your Company Where Your Audio Really Matters

Your company probably relies on audio more than you realize. Whether you’re creating a spoken text message to a colleague or giving a speech, you want your audio to shine. Otherwise, you could cause avoidable friction points and potentially hurt your brand reputation. For example, let’s say you create a

chrome os developer mode

How to Turn on Chrome OS Developer Mode

Google’s Chrome OS is a popular operating system that is widely used on Chromebooks and other devices. While it is designed to be simple and user-friendly, there are times when users may want to access additional features and functionality. One way to do this is by turning on Chrome OS

homes in the real estate industry

Exploring the Latest Tech Trends Impacting the Real Estate Industry

The real estate industry is changing thanks to the newest technological advancements. These new developments — from blockchain and AI to virtual reality and 3D printing — are poised to change how we buy and sell homes. Real estate brokers, buyers, sellers, wholesale real estate professionals, fix and flippers, and beyond may