devxlogo

Hive and Hadoop for Data Analytics on Large Web Logs

Hive and Hadoop for Data Analytics on Large Web Logs

In this Apache Hive tutorial, we explain how to use Hive in a Hadoop distributed processing environment to enable Web analytics on large datasets. The Web analytics process will involve analyzing weblog details such as URLs accessed, cookies, access dates with times, and IP addresses. This information will be used to analyze visitors’ website usage as well as their browsing patterns and behavior. Armed with this information, site owners can predict what a particular user likes on the site and personalize it accordingly. For their part, developers can add extra tracking values in the weblog for additional analytics.

Apache Hadoop and Hive for Data Processing

Apache Hadoop, the open source distributed computing framework for handling large datasets, uses the HDFS file system for storing files and Map/Reduce model for processing large datasets. Apache Hive, a sub-project of Hadoop, is a data warehouse infrastructure used to query and analyze large datasets stored in Hadoop files. Although Hadoop Hive is a non-SQL database, it will support some SQL as well. With its Hive-SQL option, Hive users can query the Hive tables. Hive works on top of Hadoop and ZooKeeper, a centralized Hadoop service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Weblogs can be processed by the Hadoop MapReduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.

Weblog Formats and Processing

Both Apache Web Server and Microsoft IIS record website requests into log files, but the formats of those logs differ. Apache’s preferred weblog format is the combined log format, which logs all the details of Web usages. Here is an example of the combined weblog format.

LogFormat "%h %v %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Cookie}i""

Here is the IIS preferred format, Microsoft IIS W3C Extended Log File Format:

c-ip cs-username date time sc-bytes sc-status cs-uri-stem cs[Referer] cs[User-Agent] cs[Cookie] 

Large retail applications are accessed by many users around the world, so their weblog file sizes might be between 10 and 15 gigabytes. For example, Amazon.com’s weblog is more than 15 gigabytes. This weblog information is used to predict customer interest and personalize the site.

Page tagging is used for tagging the Web page and tracking usage of the page. Web analytics providers such as WebTrends and Google Analytics use this option for tracking page views and storing this view information in log files.

Hadoop MapReduce for Parsing Weblogs

Below is a sample Apache log where the log fields are terminated by a space. The OpenCSV framework is used for parsing these logs. It is using field terminated characters to parse the log and split the fields.

127.0.0.1 - - [10/Apr/2007:10:39:11 +0300] "GET / HTTP/1.1" 500 606 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.3) 
Gecko/20061201 Firefox/2.0.0.3 (Ubuntu-feisty)"

Here are the steps for parsing a log file using Hadoop MapReduce:

    1. Load log files into the HDFS location using this Hadoop command:
      hadoop fs -put   
    2. The Opencsv2.3.jar framework is used for parsing log records.
  1. Below is the Mapper program for parsing the log file from the HDFS location.
    public static class ParseMapper      extends Mapper{private Text word = new Text();public void map(Object key, Text value, Context context               ) throws IOException, InterruptedException {     CSVParser parse = new CSVParser(' ','"');     String sp[]=parse.parseLine(value.toString());     int spSize=sp.length;     StringBuffer rec= new StringBuffer();     for(int i=0;i
  2. The command below is the Hadoop-based log parse execution. TheMapReduce programis attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system.
    hadoop jar    
  3. The output file is stored in the HDFS location, and the output file name starts with "part-".

Hadoop Hive Configuration

The sections to follow explain how to configure Hive for weblog analytics.

Requirements

  • Java 1.6
  • Hadoop 0.20.x.

Installing Hadoop Hive from a Stable Release

First, download the latest stable release of Hive from one of the Apache download mirrors.

Next, unpack the tarball, which will create a subdirectory named hive-x.y.z:

 $ tar -xzvf hive-x.y.z.tar.gz

Point the environment variable HIVE_HOME to the installation directory:

      $ cd hive-x.y.z$ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

       $ export PATH=$HIVE_HOME/bin:$PATH

Running Hadoop Hive

Because Hive uses Hadoop either:

  • you must have Hadoop in your path, OR
  • export HADOOP_HOME=

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup are as follows:

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

I also find it useful, but not necessary, to set HIVE_HOME as follows:

$ export HIVE_HOME=

To use the Hive command line interface (CLI) from the shell:

$ $HIVE_HOME/bin/hive

Hive Runtime Configuration

  • Hive queries are executed using MapReduce queries. Therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
  • The CLI command SETcan be used to set any Hadoop (or Hive) configuration variable. For example:
    hive> SET mapred.job.tracker=myhost.mycompany.com:50030;hive> SET -v;

Hive, MapReduce and Local-Mode

The Hive compiler generates MapReduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by this variable:

mapred.job.tracker

This usually points to a MapReduce cluster with multiple nodes, but Hadoop also provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over small data sets because in such cases local mode execution is usually much faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode runs with only one reducer and can be very slow when processing larger data sets.

Starting with version 0.7, Hive fully supports local mode execution, which you can activate by enabling the following option:

Hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<:username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).

Starting with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode automatically. The relevant options are:

 Hive> SET hive.exec.mode.local.auto=false;

Hadoop Hive Data Load

Hive provides tools to enable easy data ETL, a mechanism to put structures on the data, and defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, Hive QL also allows programmers familiar with MapReduce to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Below is the create table command for Hive.

create table weblogs (client_ip    string,full_request_date string,day    string, month    string, 
month_num int, year string, hour string, minute string, second string,timezone string,http_verb string,
uri string, http_status_code string,bytes_returned string,referrer string,user_agent string)
row format delimited fields terminated by ' ' stored as textfile

The below command is used for loading data from an HDFS location to a Hive table.

LOAD DATA INPATH '' INTO TABLE 

After loading the data into the table, the normal user can query Hive using Hive QL. Below is an example query for getting user counts of each location.

SELECT client_ip  , COUNT(client_ip) FROM weblogs GROUP BY client_ip 

Hadoop Hive JDBC Support

Hive also supports JDBC connections. To connect Hive with JDBC, you need to start the Hive Thrift Server as follows.

Export HIVE_PORT=9999      Hive –service hiveserver 

Here are the steps to establish a Hive JDBC connection:

    1. Add hive-jdbc0.7.jar in the classpath; this is a type-4 driver.
    2. Use the org.apache.hadoop.hive.jdbc.HiveDriver driver for the connection.
    3. Connection String : jdbc:hive://:/
    4. Use Hive QL to query the table in Hive, and it will return the result set.
    5. Using the result set, you can project the output in graphs or charts easily.
    6. Here is a sample Hive JDBC program:

    Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");    Connection con = DriverManager.getConnection("jdbc:hive://10.0.0.1:9999/default", "", "");    Statement stmt = con.createStatement();    Resultset res = stmt.executeQuery("SELECT client_ip  , COUNT(client_ip) FROM weblogs GROUP BY  client_ip");    while (res.next()) {      System.out.println(res.getInt(1) + "	" + res.getString(2));    }
devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist