devxlogo

Apache Hadoop and Hive for Big Data Storage and Processing

Apache Hadoop and Hive for Big Data Storage and Processing

Big data is an aptly named concept. Nowadays, huge amounts of data and information are generated by hundreds of millions of applications, devices and human beings all round the world. Be they user’s personal data maintained by social networking sites, public sites hosting blogs, weather-related data generated by different types of sensors, or customer and product information maintained by a large enterprise organization — they are all contributing to big data.

The amount of data sets and the volume of information being generated, processed and analyzed particularly for business intelligence and decision making is growing rapidly as well, making traditional warehousing solutions expensive, difficult to leverage and mostly ineffective. There clearly is a need for a generalized, flexible and scalable tool that can cope with the challenges of big data.

When developers deal with big data, the challenge generally boils down to either:

  • storing the data in a way so that it is easy to access and manage
  • processing the whole set of data in a way that makes the processing easy, efficient and faster

Enter Hadoop for Big Data

The Apache Hadoop is a popular software framework that supports data-intensive distributed applications. It is a map-reduce implementation inspired by Google’s MapReduce and Google File System (GFS) papers. Hadoop is written in Java, and many large-scale applications and organizations (e.g. Yahoo and Facebook) use it to run large distributed computations on commodity hardware. It is designed in a way that it can scale up from a single server node to thousands of machines, each offering local computation and storage. All these features and capabilities make Hadoop the best candidate for developers dealing with big data.

However, the map-reduce programming used by Hadoop is very low level and Hadoop lacks the SQL-like expressiveness of query languages, which forces developers to spend a lot of time writing programs for even simple analyses. Hadoop also is not easy for those developers who are not familiar with the map-reduce concept. It requires them to write custom programs for each operation, even for simple tasks like getting the number of rows from a log or averages over some columns. Even the code generated is hard to maintain and reuse for different applications.

Apache Hive, a NoSQL type of database system that runs over Hadoop, overcomes these limitations and provides a neat and simple interface. Hive efficiently meets the challenges of storing, managing and processing big data, which is difficult with traditional RDBMS solutions.

Apache Hive Features

The following features enable Hive to meet big data challenges such as storing, managing and processing large data sets.

Powerful CLI and HiveQL

Hive supports a SQL-like query language known as the Hive query language (HiveQL) over one or multiple data files located either in a local file system or in HDFS. HiveQL runs over the Hadoop map-reduce framework itself, but hides the complexity from the developer. HiveQL is composed of a subset of SQL features and some useful extensions that are helpful for batch processing systems.

HiveQL supports basic SQL-like features such as CREATE tables, DROP tables, SELECT … FROM … WHERE clauses, various types of joins (inner, left outer, right outer and outer joins), Cartesian products, GROUP BY, SORT BY, aggregations, union and many useful functions on primitive and complex data types. Metadata browsing features such as list databases, tables and so on are also provided. This enables developers not familiar with Hadoop or MapReduce to begin querying the system right away through the Hive CLI (command line interface).

However, HiveQL does have some limitations compared with traditional RDBMS SQL. For instance, Hive currently does not support inserting data into an existing table. All insert operations into a table or partition overwrite the existing data.

Managing a Wide Variety of Data

Using Hive, developers can structure and map file data into RDBMS concepts such as tables, columns, rows, and partitions. Other than the primitive data types, such as boolean, integers, floats, doubles, strings and so on, Hive also supports all the major complex types such as list, map and struct. Even more complex types can be generated by arbitrarily combining these types.

Hive and SerDe

SerDe, the Hive Serialization/Deserialization module, takes an implementation of the SerDe Java interface provided by the user and associates it to a Hive table or partition. This enables a developer to interpret and query custom data formats easily with HiveQL.

The default SerDe implementation in Hive assumes that the rows are delimited by a newline (ASCII code 13) and the columns within a row are delimited by Ctrl-A (ASCII code 1). The SerDe can also be used to read data that uses any other delimiter character between columns using regular expression (e.g. ([^ ]*) ([^ ]*)) provided at the time of creating table.

Setting Up Hive Over Hadoop

Setting up Hive on Hadoop is a multi-step process that requires a few technologies:

  1. Hadoop: Latest available version of Hadoop already installed.
  2. Java 1.6 or above: Hadoop and Hive can run only on Java6 or above.
  3. SSH client: To run the hive commands over the Hadoop machine

The following steps explain the procedures needed for installing and configuring Apache Hive over a Linux system.

  1. Download the most recent stable release of Hive from one of the Apache download mirrors on Hive Releases page.
  2. Unpack the tarball using the following command. This will result in the creation of a subdirectory named hive-x.y.z.
         $ tar -xzvf hive-x.y.z.tar.gz
  3. Make sure that the environment variable HADOOP_HOME is defined with the Hadoop installation directory. If not, use the following command to set the same.
         $ export HADOOP_HOME=

    Similarly, set the environment variable HIVE_HOME to point to the installation directory as:

         $ cd hive-x.y.z     $ export HIVE_HOME={{pwd}}
  4. Optionally, you can add $HIVE_HOME/bin to the terminal environment variable PATH:
         $ export PATH=$HIVE_HOME/bin:$PATH
  5. To run Hive, you need to create a Hive metastore warehouse directory in HDFS and provide the appropriate permission using these commands.
         $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp     $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp     $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse     $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
  6. After doing all the above configurations successfully, use the following command to start Hive CLI from the shell:
         $ $HIVE_HOME/bin/hive

All the metadata for Hive tables and partitions are stored in Hive Metastore. The Hive metastore can be set up in three different ways:

  1. Embedded Metastore
  2. Local Metastore
  3. Remote Metastore

The Hive configuration variables are defined in an XML file HIVE_HOME/conf/hive-default.xml. They can be changed by (re-)defining them in HIVE_HOME /conf/hive-site.xml. For local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it for getting the metadata information. The following table lists a few important configuration parameters for setting up a metastore in a local MySQL server.

Config Param Config Value
javax.jdo.option.ConnectionURL jdbc:mysql://server_host:3306/db_name
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
hive.metastore.local True
hive.metastore.warehouse.dir /user/hive/warehouse
Table 1. Configuration Parameters for a Metastore in MySQL

You should change the values based on your own setup.

devx-admin

Share the Post: