When developing Hive-based applications, developers always have to deal with different HiveQL queries to perform basic CRUD (Create, Read, Update and Delete) and other operations on Hive table data. The queries also vary from one application to another. For applications where multiple types of Hive tables are used and many of them are created on the fly during the execution of the application, managing a table's lifecycle and querying data dynamically become very difficult. Developers have to hardcode various HiveQL queries for each operation. This makes developing, managing and maintaining the application very difficult.
To help developers address the challenges related to applying HiveQL queries in large applications, we present in this article different design patterns and solutions for Hive-based operations (e.g. create pattern, insert pattern, select pattern and delete pattern). By extending the approach we explain here, you can add further design patterns specific to your scenario. The approach presented here is not limited to only Hive applications, but it can easily be applied to other similar environments such as HBase, Mongo DB, relational databases, and so on.
Model Classes in Apache Hive
In the core of our solution patterns are various model classes. In Hive, the structure of the data being stored and processed is represented as a table. The Hive table metadata stores the format of the file data (e.g. Web server log file) in the form of columns and row formatters (i.e. column and line separators). So, you need a class to store the Hive table metadata.
The following class diagram of a Java class, called HiveTable, represents the Hive table metadata information.
The class TableField represents the column metadata information (i.e. information about a Hive table column and its data type.
The TablePartition class holds the Hive table partition details.
The class RowFormat represents the column and line separator information.
All these model classes will be used in our proposed design patterns. The relation between these model classes, other Hive environment components and the proposed design patterns are depicted in Figure 1.
Click here for larger image
Figure 1. Relation Between Hive Components, Model Classes and Design Pattern Blocks
Apache Hive Query Patterns
Hive provides a SQL-like query based solution (HiveQL) for MapReduced problems over a large volume of input data. Hence, while developing Hive-based applications, programmers of any language have to deal with HiveQL queries. These queries are generally hardcoded into the program with some input data and conditions. For example when one needs to create a table in Hive that stores information about employees, the Create Table query in HiveQL looks like as follows.
CREATE TABLE if not exists default.Employee (
name STRING COMMENT 'Name of the Employee ',
empid INT COMMENT 'Employee ID ',
age DOUBLE COMMENT 'Employee age ',
salary DOUBLE COMMENT 'Salary of the Employee ')
COMMENT 'Employee details table.'
PARTITIONED BY (city STRING COMMENT 'City of the Employee ')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\012'
Likewise, for every new table, programmers have to write a new Create Table query based on the table format and the type of data it stores. This is equally applicable for other operations such as Load, Select, Delete, and so on.
What we propose in this section is a common solution for handling multiple queries for various operations on Hive Tables. For each Hive table operation, we propose a design pattern in the form of query builders, which can generate any possible query for the given operation based on user needs. The query builder will take a specific model class object as input and generate a HiveQL query for the particular operation. This way, only one piece code is required to generate a particular type of HiveQL query. This enhances the code's reusability and maintainability. It also makes it easy to track errors and fix bugs during the query formation process, since each type of operation has a unique query builder, which is used across all the components required to generate the HiveQL queries.
The following sections explain the basic design patterns and respective query builders with the required model classes in Java. We will present four commonly used patterns:
- Create Pattern: a generic solution for Create Table in Hive
- Delete Pattern: a generic solution for Delete Table in Hive
- Select Pattern: a generic solution for Select … FROM … WHERE in Hive
- Load pattern: a generic solution for loading data into the Hive table
You can extend this concept further and come up with your own solution pattern(s) and query builder(s) for your particular needs.