Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Apache Hive Query Patterns: Generalized HiveQL Queries for Common Operations-3 : Page 3


advertisement

Developing a Java Client Program for Hive

In this section, we explain how to develop a client program for Hive using Java. The client program can call any of the above query builder pattern solutions to generate the needed HiveQL query and pass the query string to the query executor. We will explain two types of Hive clients: one using the Hive JDBC client and the other using Hive Thrift Client. Both are written in Java.

Developing a Hive Client Using JDBC APIs

The Hive client written using JDBC APIs looks exactly the same as a client program written for other RDBMS (e.g. MySQL) in Java using JDBC APIs. The only difference will be in the driver name (org.apache.hadoop.hive.jdbc.HiveDriver) and the URI string (jdbc:hive:// for embedded mode setup of the Hive server and jdbc:hive://host:port/dbname for the standalone server). Here, host and port are determined by where the Hive server is running.



The following is a sample Hive client program using JDBC APIs:

public class HiveTutorialJdbcClient {       private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";            public static void main(String[] args) throws SQLException {           try {                Class.forName (driverName);           }           catch (ClassNotFoundException e) {                e.printStackTrace();                System.exit(1);           }           // Provide appropriate user and password information below.           Connection con = DriverManager.getConnection (                     "jdbc:hive://localhost:10000/default", "User", "Password");           Statement stmt = con.createStatement();            ResultSet resultSet = stmt.executeQuery(HiveQL_Query);      } }

Here, HiveQL_Query is the HiveQL query (e.g. "CREATE TABLE …") generated by using any of the above query builder pattern solutions or any valid HiveQL query string.

Developing a Thrift Hive Client

The following is the sample Thrift Hive client program written in Java.

public class HiveTutorialThriftClient {           private static Client getClient(String hiveServer, Integer hivePort) {           final int SOME_BIG_NUMBER = 99999999;            Client client=null; try {       TSocket transport = new TSocket(hiveServer, hivePort);       transport.setTimeout(SOME_BIG_NUMBER);       transport.open();       TBinaryProtocol protocol = new TBinaryProtocol(transport);       client = new ThriftHive.Client(protocol);       return client; } catch (Exception e) { e.printStackTrace(); return null; }      }            public static void main(String[] args) throws Exception {      // Provide appropriate Server Hostname and Port number.           String HIVE_SERVER = "localhost";           Integer HIVE_PORT = new Integer(10000);                      Client client = getClient(HIVE_SERVER, HIVE_PORT);           client.execute(HiveQL_Query);      } }

Here, the Client is a Thrift Hive client class from the package org.apache.hadoop.hive.service.ThriftHive in the hive-service-<version>.jar library. The other classes are TSocket from the package org.apache.thrift.transport and TBinaryProtocol from the package org.apache.thrift.protocol in the hive-exec-<version>.jar library.

Starting Hive Server and Running a Client Program

To run both the sample Hive client programs, follow these steps.

  1. Make sure that all the Hadoop services and daemons are running. If not, go to the Hadoop installation directory (i.e. HADOOP_HOME) and execute the command:

    $ bin/start-all.sh

  2. Open a new terminal and start the Hive server using this command:

    $ $ HIVE_PORT=10000 <HIVE_HOME dir>/bin/hive --service hiveserver

    Make sure that the port you provide in the command is the same as the one used in the client program. In our sample program, we used Hive server port 10000. We assume that the HADOOP_HOME and HIVE_HOME environment variables are set with the appropriate values, and the required JARs will be in the Java CLASSPATH.


  3. Open a new terminal and compile the Hive client Java programs as follows:

    $ javac HiveTutorialJdbcClient.java $ javac HiveTutorialThriftClient.java

  4. Run the client program using this command:

    $ java HiveTutorialJdbcClient $ java HiveTutorialThriftClient

Applying UDFs in a Generalized Way in Hive

This section discusses user defined function (UDF) support in Hive. Apart from the built-in functions, such as date(), cast(), year(), and month(). Hive provides the capability to define and use your own functions in HiveQL queries. In order to create a UDF, you have to create a new class that extends the Hive class UDF and overrides one or more evaluate() methods in it. After compiling the class to a JAR, you need to register the JAR into Hive and then it can be used in queries.

Let's dive into how you can create a UDF and use it in Hive queries. Suppose one of the columns in a Hive table holds BLOB content that represents an image's data. Due to size constraints, instead of holding the complete image in BLOB format in the table, let's convert it into an image and store the image in a particular location. In the table column, instead of holding the BLOB content, just specify the image file location where that image is stored. By doing this, querying the table will become much faster.

Below is a UDF class with an overridden evaluate() method, which converts incoming BLOB content to an image, stores the image to a specified location, and returns the stored image path. Note that we do not provide the complete code but specify the necessary steps in the comments.

package com.tutorial; public class BlobToImageUdf extends UDF {      public Text evaluate(String tableName, int primaryKey, Text imageBlob, String fileType, String storagePath)      {           // Convert blob to an image file.           // Write image to the storage location "storagePath".           // Returns the storage location path String.      } }

The following steps explain how to use the above UDF in a Hive query.

  1. Every UDF which is created needs to be compiled in to JAR to be used by Hive. So, you have to compile the BlobToImageUdf class into a JAR file BlobUdf.jar.


  2. Every UDF that is created needs to get recognized by Hive, and it should be registered with the same Hive client session that is running the query with the UDF. To register the UDF, run these commands:

    hive> create temporary function Blob2Image as 'com.tutorial.BlobToImageUdf' hive> add JAR /path_to_jar_dir/BlobUdf.jar

After registering the UDF Blob2Image, you can use this function in a Hive query. Suppose, you have a table named EmployeeData with column set {empId, empName, empAddress, empImage}. First, let's load the data from an input file in the EmployeeData table using the query below. Here, inputFilePath represents the file with appropriate employee data content.

hive> load data inpath inputFilePath overwrite into table EmployeeData

The data that gets loaded in the table will have the BLOB content in the empImage column. Therefore, running the query below, which applies the Blob2Image function which you have created, will convert the BLOB to an image, store it on the target image storage, and store the path of the image in the same column (empImage).

hive> insert overwrite table EmployeeData select empId, empName, empAddress,
     Blob2Image("EmployeeData", empId, "jpeg", "/home/user/image_repos/employees/", empImage) from EmployeeData;

When you run the above query, the evaluate() method of the Blob2Image class is called. From the input parameters, tableName (i.e. EmployeeData) and primaryKey (i.e. empId) are used to generate a unique name for the imageFile. The image type is used to give the proper extension to the image file (i.e. jpeg). The path is where the output file will be stored (i.e. /home/user/image_repos/employees/), and finally the column name empImage, which contains the BLOB content, is the one on which this UDF will run row by row.

Therefore, if an empId series starts from 101, the output of the UDF will be /home/user/image_repos/employees/EmployeeData_101.jpeg, /home/user/image_repos/employees/EmployeeData_102.jpeg and so on, stored at /home/user/image_repos/employees/ in our example.

What Have You Learned?

The need for storing and processing large volumes of data is increasing, raising challenges for dealing with data sets in the scale of terabytes and petabytes. To make application development easier, more efficient and less erroneous in Hive environments, we have proposed four basic design patterns. We believe our proposed approach greatly improves the process of developing Hive-based applications by enhancing code reusability and reducing the chance of introducing errors during the HiveQL query formation process. We also explained how to create and use user defined functions in Hive through an effective example.



Ira Agrawal works as a Technical Manager with Infosys Labs, where she has worked on different aspects of distributed computing including various middleware and products based on DSM, SOA and virtualization technologies.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap