Apache Hive Query Patterns: Generalized HiveQL Queries for Common Operations

When developing Hive-based applications, developers always have to deal with different HiveQL queries to perform basic CRUD (Create, Read, Update and Delete) and other operations on Hive table data. The queries also vary from one application to another. For applications where multiple types of Hive tables are used and many of them are created on the fly during the execution of the application, managing a table’s lifecycle and querying data dynamically become very difficult. Developers have to hardcode various HiveQL queries for each operation. This makes developing, managing and maintaining the application very difficult.

To help developers address the challenges related to applying HiveQL queries in large applications, we present in this article different design patterns and solutions for Hive-based operations (e.g. create pattern, insert pattern, select pattern and delete pattern). By extending the approach we explain here, you can add further design patterns specific to your scenario. The approach presented here is not limited to only Hive applications, but it can easily be applied to other similar environments such as HBase, Mongo DB, relational databases, and so on.

Model Classes in Apache Hive

In the core of our solution patterns are various model classes. In Hive, the structure of the data being stored and processed is represented as a table. The Hive table metadata stores the format of the file data (e.g. Web server log file) in the form of columns and row formatters (i.e. column and line separators). So, you need a class to store the Hive table metadata.

The following class diagram of a Java class, called HiveTable, represents the Hive table metadata information.

HiveTable
String databaseName;String tableName;String tableComment;boolean external;List tableFields;List tablePartitions;RowFormat rowFormat;

The class TableField represents the column metadata information (i.e. information about a Hive table column and its data type.

TableField
String fieldName;String fieldType;String comments;

The TablePartition class holds the Hive table partition details.

TablePartition
String partitionName;String partitionType;String partitionExpression;String partitionComment;

The class RowFormat represents the column and line separator information.

RowFormat
String fieldsTerminatedBy;String linesTerminatedBy;boolean fieldsTerminatorInOctal;boolean linesTerminatorInOctal;

All these model classes will be used in our proposed design patterns. The relation between these model classes, other Hive environment components and the proposed design patterns are depicted in Figure 1.

Apache Hive Query Patterns

Hive provides a SQL-like query based solution (HiveQL) for MapReduced problems over a large volume of input data. Hence, while developing Hive-based applications, programmers of any language have to deal with HiveQL queries. These queries are generally hardcoded into the program with some input data and conditions. For example when one needs to create a table in Hive that stores information about employees, the Create Table query in HiveQL looks like as follows.

CREATE TABLE if not exists default.Employee (     name STRING COMMENT 'Name of the Employee ',      empid INT COMMENT 'Employee ID ',     age DOUBLE COMMENT 'Employee age ',     salary DOUBLE COMMENT 'Salary of the Employee ') COMMENT 'Employee details table.'PARTITIONED BY (city STRING COMMENT 'City of the Employee ') ROW FORMAT DELIMITED      FIELDS TERMINATED BY '01'      LINES  TERMINATED BY '12'

Likewise, for every new table, programmers have to write a new Create Table query based on the table format and the type of data it stores. This is equally applicable for other operations such as Load, Select, Delete, and so on.

What we propose in this section is a common solution for handling multiple queries for various operations on Hive Tables. For each Hive table operation, we propose a design pattern in the form of query builders, which can generate any possible query for the given operation based on user needs. The query builder will take a specific model class object as input and generate a HiveQL query for the particular operation. This way, only one piece code is required to generate a particular type of HiveQL query. This enhances the code’s reusability and maintainability. It also makes it easy to track errors and fix bugs during the query formation process, since each type of operation has a unique query builder, which is used across all the components required to generate the HiveQL queries.

The following sections explain the basic design patterns and respective query builders with the required model classes in Java. We will present four commonly used patterns:

  1. Create Pattern: a generic solution for Create Table in Hive
  2. Delete Pattern: a generic solution for Delete Table in Hive
  3. Select Pattern: a generic solution for Select … FROM … WHERE in Hive
  4. Load pattern: a generic solution for loading data into the Hive table

You can extend this concept further and come up with your own solution pattern(s) and query builder(s) for your particular needs.

Apache Hive Create Pattern

In order to create a Hive table, one has to write a Create Table query specific to the table and its columns/fields in HiveQL. To avoid that, we propose a generic method that returns the needed Create Table query specific to the particular table. The same method can be used to generate a Create Table query for all the possible types of Hive tables. This generic Create Table query builder takes the HiveTable model class object as input and deliver the Create Table query as return value based on the information provided through the HiveTable model class object. Figure 2 shows the solution pattern.

The following algorithm is the “Create Table …” HiveQL query generator solution using the HiveTable model class explained in the previous section.

Algorithm-1: Generate CreateTable QueryInput: 	HiveTable model class objectOutput:	String object having HiveQL Create table query.Step-1:	If (Table name is NULL or TableFields object is NULL")Then1.	Report error "Unable to create table query".2.	Terminate the algorithm.end ifStep-2:	Start creating the Hive Create Table query by adding following components, if they are not null, into the query string.	Add database and table name information.	For each column in "List tableFields"	-	Add field name, field data type and comment details.	For each partition in "List tablePartitions"	-	Add partition name, data type and comment details.	Add row formatters if any in "RowFormat rowFormat" object. Step-3:	Return the Hive Create Table query.

Apache Hive Delete Pattern

The solution pattern for Delete is illustrated in Figure 3 for the Create pattern. Here, the input HiveTable model class object will be used to create the HiveQL query to delete the table. The algorithm will be similar to Algorithm-1 described in the Create pattern, but it will create a “DROP TABLE …” HiveQL query using the information available from the input HiveTable model class object.

Apache Hive Select Pattern

Select (i.e. reading data from one or more tables) is one of the most heavily used operations in Hive-based applications. Generating a Select query is really complicated because the resulting Select query can contain information about:

  • one or more columns in the result set
  • one or more tables in the case of Join operations
  • different conditional clauses and other information about grouping, aggregation and ordering of result set data

Due to all these permutations, a developer has to hardcode individual Select queries for each requirement. To address this challenge, we propose a Select query builder model class, which will be provided as input to the Hive select Query builder to generate a Hive Select query. The solution pattern for Select is described in Figure 4.

The following class diagram represents the Select query builder class SelectQueryBuilder, which includes the method getSelectQuery() for generating the Select query. The other one is the WhereClause model class used by the SelectQueryBuilder class.

SelectQueryBuilder
String databaseName;String tableName;String alias;List resultSetColumn;List whereClauses;List groupByColumns;List sortByColumns;int limit;
String getSelectQuery();
WhereClause
String fieldName;String fieldCriteria;Object criteriaValue;String criteriaValueType;String clauseCondition;
String  getWhereClauseExpression();

The following algorithm is the “SELECT … FROM … WHERE” HiveQL query generator pattern solution using the SelectQueryBuilder class.

Algorithm-2: Generate Select_From_Where QueryStep-1:	If (Table name is NULL")Then3.	Report error "Unable to Select table query".4.	Terminate the algorithm.end ifStep-2:	Start creating the Hive Select Table query by adding following components, if they are not null, into the result query string.	Add database and table name information.	Add result set columns information available in List resultSetColumn	For each where clause in "List whereClauses"	-	Add where clause expression into the Select query by calling getWhereClauseExpression() method on respective where clause object in the list.	For each column name in "List groupByColumns"	-	Add group by column name, into the Select query.	For each column name in "List sortByColumns"	-	Add sort by column name, into the Select query.	Add limit factor available in limit variable.Step-3:	Return the Hive Select Table query.

Apache Hive Insert Pattern

The way data is loaded or inserted into the Hive table is different from the common insert operation performed over the table in traditional RDBMS systems. Hive allows the user to load data directly from a file into the table instead of inserting one row at a time. The insert pattern shown in Figure 5, proposes the solution approach for generating a “LOAD DATA … INTO TABLE …” HiveQL query for loading file contents into an existing Hive table.

Developing a Java Client Program for Hive

In this section, we explain how to develop a client program for Hive using Java. The client program can call any of the above query builder pattern solutions to generate the needed HiveQL query and pass the query string to the query executor. We will explain two types of Hive clients: one using the Hive JDBC client and the other using Hive Thrift Client. Both are written in Java.

Developing a Hive Client Using JDBC APIs

The Hive client written using JDBC APIs looks exactly the same as a client program written for other RDBMS (e.g. MySQL) in Java using JDBC APIs. The only difference will be in the driver name (org.apache.hadoop.hive.jdbc.HiveDriver) and the URI string (jdbc:hive:// for embedded mode setup of the Hive server and jdbc:hive://host:port/dbname for the standalone server). Here, host and port are determined by where the Hive server is running.

The following is a sample Hive client program using JDBC APIs:

public class HiveTutorialJdbcClient {      private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";          public static void main(String[] args) throws SQLException{          try {               Class.forName (driverName);          }           catch (ClassNotFoundException e) {               e.printStackTrace();               System.exit(1);          }          // Provide appropriate user and password information below.          Connection con = DriverManager.getConnection (                    "jdbc:hive://localhost:10000/default", "User", "Password");          Statement stmt = con.createStatement();          ResultSet resultSet = stmt.executeQuery(HiveQL_Query);     }}

Here, HiveQL_Query is the HiveQL query (e.g. “CREATE TABLE …”) generated by using any of the above query builder pattern solutions or any valid HiveQL query string.

Developing a Thrift Hive Client

The following is the sample Thrift Hive client program written in Java.

public class HiveTutorialThriftClient {          private static Client getClient(String hiveServer, Integer hivePort) {          final int SOME_BIG_NUMBER = 99999999;                  Client client=null;        try {             TSocket transport = new TSocket(hiveServer, hivePort);             transport.setTimeout(SOME_BIG_NUMBER);             transport.open();             TBinaryProtocol protocol = new TBinaryProtocol(transport);             client = new ThriftHive.Client(protocol);             return client;        }        catch (Exception e) {           e.printStackTrace();           return null;        }     }          public static void main(String[] args) throws Exception {     // Provide appropriate Server Hostname and Port number.          String HIVE_SERVER = "localhost";          Integer HIVE_PORT  = new Integer(10000);                    Client client = getClient(HIVE_SERVER, HIVE_PORT);          client.execute(HiveQL_Query);     }}

Here, the Client is a Thrift Hive client class from the package org.apache.hadoop.hive.service.ThriftHive in the hive-service-.jar library. The other classes are TSocket from the package org.apache.thrift.transport and TBinaryProtocol from the package org.apache.thrift.protocol in the hive-exec-.jar library.

Starting Hive Server and Running a Client Program

To run both the sample Hive client programs, follow these steps.

  1. Make sure that all the Hadoop services and daemons are running. If not, go to the Hadoop installation directory (i.e. HADOOP_HOME) and execute the command:
    $ bin/start-all.sh
  2. Open a new terminal and start the Hive server using this command:
    $ $ HIVE_PORT=10000 /bin/hive --service hiveserver

    Make sure that the port you provide in the command is the same as the one used in the client program. In our sample program, we used Hive server port 10000. We assume that the HADOOP_HOME and HIVE_HOME environment variables are set with the appropriate values, and the required JARs will be in the Java CLASSPATH.

  3. Open a new terminal and compile the Hive client Java programs as follows:
    $ javac HiveTutorialJdbcClient.java$ javac HiveTutorialThriftClient.java
  4. Run the client program using this command:
    $ java HiveTutorialJdbcClient $ java HiveTutorialThriftClient

Applying UDFs in a Generalized Way in Hive

This section discusses user defined function (UDF) support in Hive. Apart from the built-in functions, such as date(), cast(), year(), and month(). Hive provides the capability to define and use your own functions in HiveQL queries. In order to create a UDF, you have to create a new class that extends the Hive class UDF and overrides one or more evaluate() methods in it. After compiling the class to a JAR, you need to register the JAR into Hive and then it can be used in queries.

Let’s dive into how you can create a UDF and use it in Hive queries. Suppose one of the columns in a Hive table holds BLOB content that represents an image’s data. Due to size constraints, instead of holding the complete image in BLOB format in the table, let’s convert it into an image and store the image in a particular location. In the table column, instead of holding the BLOB content, just specify the image file location where that image is stored. By doing this, querying the table will become much faster.

Below is a UDF class with an overridden evaluate() method, which converts incoming BLOB content to an image, stores the image to a specified location, and returns the stored image path. Note that we do not provide the complete code but specify the necessary steps in the comments.

package com.tutorial;public class BlobToImageUdf extends UDF{     public Text evaluate(String tableName, int primaryKey, Text imageBlob, String fileType, String storagePath)     {          // Convert blob to an image file.          // Write image to the storage location "storagePath".          // Returns the storage location path String.     }}

The following steps explain how to use the above UDF in a Hive query.

  1. Every UDF which is created needs to be compiled in to JAR to be used by Hive. So, you have to compile the BlobToImageUdf class into a JAR file BlobUdf.jar.
  2. Every UDF that is created needs to get recognized by Hive, and it should be registered with the same Hive client session that is running the query with the UDF. To register the UDF, run these commands:
    hive> create temporary function Blob2Image as 'com.tutorial.BlobToImageUdf'hive> add JAR /path_to_jar_dir/BlobUdf.jar

After registering the UDF Blob2Image, you can use this function in a Hive query. Suppose, you have a table named EmployeeData with column set {empId, empName, empAddress, empImage}. First, let’s load the data from an input file in the EmployeeData table using the query below. Here, inputFilePath represents the file with appropriate employee data content.

hive> load data inpath inputFilePath overwrite into table EmployeeData

The data that gets loaded in the table will have the BLOB content in the empImage column. Therefore, running the query below, which applies the Blob2Image function which you have created, will convert the BLOB to an image, store it on the target image storage, and store the path of the image in the same column (empImage).

hive> insert overwrite table EmployeeData select empId, empName, empAddress, 
     Blob2Image("EmployeeData", empId, "jpeg", "/home/user/image_repos/employees/", empImage) from EmployeeData;

When you run the above query, the evaluate() method of the Blob2Image class is called. From the input parameters, tableName (i.e. EmployeeData) and primaryKey (i.e. empId) are used to generate a unique name for the imageFile. The image type is used to give the proper extension to the image file (i.e. jpeg). The path is where the output file will be stored (i.e. /home/user/image_repos/employees/), and finally the column name empImage, which contains the BLOB content, is the one on which this UDF will run row by row.

Therefore, if an empId series starts from 101, the output of the UDF will be /home/user/image_repos/employees/EmployeeData_101.jpeg, /home/user/image_repos/employees/EmployeeData_102.jpeg and so on, stored at /home/user/image_repos/employees/ in our example.

What Have You Learned?

The need for storing and processing large volumes of data is increasing, raising challenges for dealing with data sets in the scale of terabytes and petabytes. To make application development easier, more efficient and less erroneous in Hive environments, we have proposed four basic design patterns. We believe our proposed approach greatly improves the process of developing Hive-based applications by enhancing code reusability and reducing the chance of introducing errors during the HiveQL query formation process. We also explained how to create and use user defined functions in Hive through an effective example.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: