In this Apache Hive tutorial, we explain how to use Hive in a Hadoop distributed processing environment to enable Web analytics on large datasets. The Web analytics process will involve analyzing weblog details such as URLs accessed, cookies, access dates with times, and IP addresses. This information will be used to analyze visitors’ website usage as well as their browsing patterns and behavior. Armed with this information, site owners can predict what a particular user likes on the site and personalize it accordingly. For their part, developers can add extra tracking values in the weblog for additional analytics.
Apache Hadoop and Hive for Data Processing
Apache Hadoop, the open source distributed computing framework for handling large datasets, uses the HDFS file system for storing files and Map/Reduce model for processing large datasets. Apache Hive, a sub-project of Hadoop, is a data warehouse infrastructure used to query and analyze large datasets stored in Hadoop files. Although Hadoop Hive is a non-SQL database, it will support some SQL as well. With its Hive-SQL option, Hive users can query the Hive tables. Hive works on top of Hadoop and ZooKeeper, a centralized Hadoop service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Weblogs can be processed by the Hadoop MapReduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.
Weblog Formats and Processing
Both Apache Web Server and Microsoft IIS record website requests into log files, but the formats of those logs differ. Apache’s preferred weblog format is the combined log format, which logs all the details of Web usages. Here is an example of the combined weblog format.
Here is the IIS preferred format, Microsoft IIS W3C Extended Log File Format:
c-ip cs-username date time sc-bytes sc-status cs-uri-stem cs[Referer] cs[User-Agent] cs[Cookie]
Large retail applications are accessed by many users around the world, so their weblog file sizes might be between 10 and 15 gigabytes. For example, Amazon.com’s weblog is more than 15 gigabytes. This weblog information is used to predict customer interest and personalize the site.
Page tagging is used for tagging the Web page and tracking usage of the page. Web analytics providers such as WebTrends and Google Analytics use this option for tracking page views and storing this view information in log files.
Hadoop MapReduce for Parsing Weblogs
Below is a sample Apache log where the log fields are terminated by a space. The OpenCSV framework is used for parsing these logs. It is using field terminated characters to parse the log and split the fields.
Load log files into the HDFS location using this Hadoop command:
hadoop fs -put
The Opencsv2.3.jar framework is used for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location.
public static class ParseMapper extends Mapper
The command below is the Hadoop-based log parse execution. TheMapReduce programis attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system.
hadoop jar
The output file is stored in the HDFS location, and the output file name starts with "part-".
Hadoop Hive Configuration
The sections to follow explain how to configure Hive for weblog analytics.
Next, unpack the tarball, which will create a subdirectory named hive-x.y.z:
$ tar -xzvf hive-x.y.z.tar.gz
Point the environment variable HIVE_HOME to the installation directory:
$ cd hive-x.y.z$ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
Running Hadoop Hive
Because Hive uses Hadoop either:
you must have Hadoop in your path, OR
export HADOOP_HOME=
In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.
I also find it useful, but not necessary, to set HIVE_HOME as follows:
$ export HIVE_HOME=
To use the Hive command line interface (CLI) from the shell:
$ $HIVE_HOME/bin/hive
Hive Runtime Configuration
Hive queries are executed using MapReduce queries. Therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
The CLI command SETcan be used to set any Hadoop (or Hive) configuration variable. For example:
hive> SET mapred.job.tracker=myhost.mycompany.com:50030;hive> SET -v;
Hive, MapReduce and Local-Mode
The Hive compiler generates MapReduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by this variable:
mapred.job.tracker
This usually points to a MapReduce cluster with multiple nodes, but Hadoop also provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over small data sets because in such cases local mode execution is usually much faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode runs with only one reducer and can be very slow when processing larger data sets.
Starting with version 0.7, Hive fully supports local mode execution, which you can activate by enabling the following option:
Hive> SET mapred.job.tracker=local;
In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<:username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).
Starting with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode automatically. The relevant options are:
Hive> SET hive.exec.mode.local.auto=false;
Hadoop Hive Data Load
Hive provides tools to enable easy data ETL, a mechanism to put structures on the data, and defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, Hive QL also allows programmers familiar with MapReduce to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
Below is the create table command for Hive.
create table weblogs (client_ip string,full_request_date string,day string, month string, month_num int, year string, hour string, minute string, second string,timezone string,http_verb string, uri string, http_status_code string,bytes_returned string,referrer string,user_agent string) row format delimited fields terminated by ' ' stored as textfile
The below command is used for loading data from an HDFS location to a Hive table.
LOAD DATA INPATH '' INTO TABLE
After loading the data into the table, the normal user can query Hive using Hive QL. Below is an example query for getting user counts of each location.
SELECT client_ip , COUNT(client_ip) FROM weblogs GROUP BY client_ip
Hadoop Hive JDBC Support
Hive also supports JDBC connections. To connect Hive with JDBC, you need to start the Hive Thrift Server as follows.
Export HIVE_PORT=9999 Hive –service hiveserver
Here are the steps to establish a Hive JDBC connection:
Add hive-jdbc0.7.jar in the classpath; this is a type-4 driver.
Use the org.apache.hadoop.hive.jdbc.HiveDriver driver for the connection.
Connection String : jdbc:hive://:/
Use Hive QL to query the table in Hive, and it will return the result set.
Using the result set, you can project the output in graphs or charts easily.
Here is a sample Hive JDBC program:
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); Connection con = DriverManager.getConnection("jdbc:hive://10.0.0.1:9999/default", "", ""); Statement stmt = con.createStatement(); Resultset res = stmt.executeQuery("SELECT client_ip , COUNT(client_ip) FROM weblogs GROUP BY client_ip"); while (res.next()) { System.out.println(res.getInt(1) + " " + res.getString(2)); }
In the age of digital transformation, the internet has become a ubiquitous part of our lives. From socializing, shopping, and learning to more sensitive activities such as banking and healthcare,
The world of software development is changing drastically with the introduction of Artificial Intelligence and Machine Learning technologies. In the past, software developers were in charge of the entire development
Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved
Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across
Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several
Creating a website is not easy, but protecting your website is equally important. Implementing copyright laws ensures that the substance of your website remains secure and sheltered. Copyrighting your website
One of the biggest trends of the 21st century is the massive surge in analytics. Analytics is the process of utilizing data to drive future decision-making. With so much of
Kubernetes from Google is one of the most popular open-source and free container management solutions made to make managing and deploying applications easier. It has a solid architecture that makes
One of the most significant cyber threats faced by modern organizations is a ransomware attack. Ransomware attacks have grown in both sophistication and frequency over the past few years, forcing
Data dictionaries are crucial for organizations of all sizes that deal with large amounts of data. they are centralized repositories of all the data in organizations, including metadata such as
If you’re thinking about a startup, it’s likely you need to raise an initial round of funding for your venture. This article covers some of the very early development techniques
Are you aware of the possibility of a recession in 2023? This year has been challenging for the economy, with reports of high prices and significant corporations laying off workers