Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

How to Create Your First Hive Script

Apache Hive provides a mechanism to manage data in a distributed environment and query it using an SQL-like language called Hive Query Language, or HiveQL. This article will discuss Hive scripts and execution.


advertisement

Overview

Apache Hive is an integral part of Hadoop eco-system. Hive can be defined as a data warehouse-like software that facilitates query and large data management on HDFS (Hadoop distributed file system). One must remember that Hive is not data warehouse software, rather it provides a mechanism to manage data on distributed environment and query it by using an SQL-like language called Hive Query Language, or HiveQL. Hive scripts can be defined as a group of Hive commands bundled together to reduce the execution time. In this article, we will discuss Hive scripts and execution.

Introduction

Hadoop Distributed File System, or HDFS, provides a scalable and fault tolerant enabled data storage. HIVE provides a simple SQL like query language – HiveQL. HiveQL allows the traditional map reduce developers to plug-in their custom mappers and reducers to do more sophisticated analysis.

Limitation of Hive



Latency for Hive queries is usually very high because of the substantial overhead in job submission and scheduling. Hive does not offer real time queries and row level updates. It is best used for log analysis.

Hive Data Units

Hive data is organized into following four categories:

  • Databases: This consists of the namespaces that separate the tables and other data units to avoid name conflicts.
  • Tables: These are homogeneous units of data having a common schema. A commonly used example could be of a page view table where each row can have the following columns:
    • USERID
    • IPADDRESS
    • LAST ACCESSED
    • PAGE URL
    This example lists the record of the usage of a website or an application for individual users.
  • Partitions: Partitions determine how the data is stored. Each table can have one or multiple partitions. Partitions also help users to efficiently identify the rows that satisfy a certain selection criteria.
  • Buckets or Clusters: Data in each partition may be further subdivided into buckets or clusters or blocks. The data in the above example can be clustered based on the user id or on the ip address or on the page url column.

Hive Data Types

Based on the need, Hive supports primitive and complex data types as described below:

  • Primitive types:
    • INTEGERS
      • TINY INT 1 byte integer
      • SMALL INT 2 byte integer
      • INT 4 byte integer
      • BIGINT 8 byte integer
    • BOOLEAN
      • BOOLEAN TRUE or FALSE
    • FLOATING POINT numbers
      • FLOAT Single precision
      • DOUBLE Double precision
    • STRING type
      • STRING Sequence of characters
  • Complex Types: Complex types can be constructed using primitive data types and other composite types with the help of:
    • Structs
    • Maps or key value pairs
    • Arrays – Indexed lists

Hive Scripting

Similar to any other scripting language, Hive scripts are used to execute a set of Hive commands collectively. Hive scripting helps us to reduce the time and effort invested in writing and executing the individual commands manually. Hive scripting is supported in Hive 0.10.0 or higher versions of Hive. To write and execute a Hive script, we need to install Cloudera distribution for Hadoop CDH4.

Writing Hive Scripts

First, open a terminal in your Cloudera CDH4 distribution and give the command below to create a Hive Script.

Command: gedit sample.sql

Similar to any other query language, the Hive script file should be saved with .sql extension. This will enable the execution of the commands. Now open the file in Edit mode and write your Hive commands that will be executed using this script. In this sample script, we will do the following tasks sequentially (create, describe and then load the data into the table. And then retrieve the data from table).

  • Create a table 'product' in Hive:

Command: create table product_dtl ( product_id: int, product-name: string, product_price: float, product_category: string) rows format delimited fields terminated by ',' ;

Here { product_id, product-name, product_price, product_category} are names of the columns in the 'product_dtl' table. "Fields terminated by ',' " indicates that the columns in the input file are separated by the ',' delimiter. You can also use other delimiters as per your requirement. For example, we can consider the records in an input file separated by a new line ('\n') character.

  • Describe the Table:

Command: describe product_dtl;

  • Load the data into the Table:

Now, let's check the data loading part. Create an input file that contains the records that needs to be inserted into the table.

Command: sudo gedit input.txt

Now let's create few records in the input text file as shown in the figure below:


Figure 1: Input file.

So, our input file will look like:

  1. Laptop, 45000, Computers
  2. Pencils, 2, Stationery
  3. Rice, 64.45, Grocery
  4. Furniture, 65000, Interiors

To load the data from this file we need to execute the following:

Command: load data local inpath '/home/cloudera/input.txt' into table product_dtl;

Retrieving the Data

To retrieve the data we use the simple select statement as under –

Command: select * from product_dtl;

The above command will execute and fetch all the records from the table 'product'. The script will look like the following image:


Figure 2: The sample SQL file.

Save this sample.sql file and run the following command

Command: Hive –f /home/cloudera/sample.sql

While executing the script, mention the entire path of the script location. Here the sample script is present in the current directory; I haven't provided the complete path of the script.

<p>The following output shows that the table is created and the data from our sample input file is stored in the database.

 

1                            Laptop                                45000                   Computers

2                            Pencils                                 2                            Stationery

3                            Rice                                     64.45                    Groceries

4                            Furniture                             65000                   Interiors

 

Summary

Before concluding our discussion, we must take note of the following points:

  • Apache Hive is an integral part of HDFS
  • Hive is an SQL-like query language
  • Hive script is easy to understand and implement
  • Hive supports both primitive data types and complex data types.

 

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.



   
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap