RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Apache Hadoop and Hive for Big Data Storage and Processing-2 : Page 2


Apache Hive Features

The following features enable Hive to meet big data challenges such as storing, managing and processing large data sets.

Powerful CLI and HiveQL

Hive supports a SQL-like query language known as the Hive query language (HiveQL) over one or multiple data files located either in a local file system or in HDFS. HiveQL runs over the Hadoop map-reduce framework itself, but hides the complexity from the developer. HiveQL is composed of a subset of SQL features and some useful extensions that are helpful for batch processing systems.

HiveQL supports basic SQL-like features such as CREATE tables, DROP tables, SELECT ... FROM ... WHERE clauses, various types of joins (inner, left outer, right outer and outer joins), Cartesian products, GROUP BY, SORT BY, aggregations, union and many useful functions on primitive and complex data types. Metadata browsing features such as list databases, tables and so on are also provided. This enables developers not familiar with Hadoop or MapReduce to begin querying the system right away through the Hive CLI (command line interface).

However, HiveQL does have some limitations compared with traditional RDBMS SQL. For instance, Hive currently does not support inserting data into an existing table. All insert operations into a table or partition overwrite the existing data.

Managing a Wide Variety of Data

Using Hive, developers can structure and map file data into RDBMS concepts such as tables, columns, rows, and partitions. Other than the primitive data types, such as boolean, integers, floats, doubles, strings and so on, Hive also supports all the major complex types such as list, map and struct. Even more complex types can be generated by arbitrarily combining these types.

Hive and SerDe

SerDe, the Hive Serialization/Deserialization module, takes an implementation of the SerDe Java interface provided by the user and associates it to a Hive table or partition. This enables a developer to interpret and query custom data formats easily with HiveQL.

The default SerDe implementation in Hive assumes that the rows are delimited by a newline (ASCII code 13) and the columns within a row are delimited by Ctrl-A (ASCII code 1). The SerDe can also be used to read data that uses any other delimiter character between columns using regular expression (e.g. ([^ ]*) ([^ ]*)) provided at the time of creating table.

Setting Up Hive Over Hadoop

Setting up Hive on Hadoop is a multi-step process that requires a few technologies:

  1. Hadoop: Latest available version of Hadoop already installed.
  2. Java 1.6 or above: Hadoop and Hive can run only on Java6 or above.
  3. SSH client: To run the hive commands over the Hadoop machine

The following steps explain the procedures needed for installing and configuring Apache Hive over a Linux system.

  1. Download the most recent stable release of Hive from one of the Apache download mirrors on Hive Releases page.

  2. Unpack the tarball using the following command. This will result in the creation of a subdirectory named hive-x.y.z.
         $ tar -xzvf hive-x.y.z.tar.gz
  3. Make sure that the environment variable HADOOP_HOME is defined with the Hadoop installation directory. If not, use the following command to set the same.
         $ export HADOOP_HOME=<hadoop-install-dir>
    Similarly, set the environment variable HIVE_HOME to point to the installation directory as:
         $ cd hive-x.y.z
         $ export HIVE_HOME={{pwd}}
  4. Optionally, you can add $HIVE_HOME/bin to the terminal environment variable PATH:
         $ export PATH=$HIVE_HOME/bin:$PATH
  5. To run Hive, you need to create a Hive metastore warehouse directory in HDFS and provide the appropriate permission using these commands.
         $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
         $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
         $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
         $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
  6. After doing all the above configurations successfully, use the following command to start Hive CLI from the shell:
         $ $HIVE_HOME/bin/hive

All the metadata for Hive tables and partitions are stored in Hive Metastore. The Hive metastore can be set up in three different ways:

  1. Embedded Metastore
  2. Local Metastore
  3. Remote Metastore

The Hive configuration variables are defined in an XML file HIVE_HOME/conf/hive-default.xml. They can be changed by (re-)defining them in HIVE_HOME /conf/hive-site.xml. For local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it for getting the metadata information. The following table lists a few important configuration parameters for setting up a metastore in a local MySQL server.

Config Param Config Value
javax.jdo.option.ConnectionURL jdbc:mysql://server_host:3306/db_name
javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver
javax.jdo.option.ConnectionUserName <user name>
javax.jdo.option.ConnectionPassword <password>
hive.metastore.local True
hive.metastore.warehouse.dir /user/hive/warehouse

Table 1. Configuration Parameters for a Metastore in MySQL

You should change the values based on your own setup.

Kamalkumar Mistry is a Technology Analyst at Infosys Ltd. India. He has five years of industry experience designing and developing software across a variety of technologies such as aerospace network system simulation, cloud computing, big data analytics and applications based on various virtualization technologies.
Email AuthorEmail Author
Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date