RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Hive and Hadoop for Data Analytics on Large Web Logs

Developers can use Apache Hive and Hadoop for data analytics on large web logs, analyzing users' browsing patterns and behavior.


In this Apache Hive tutorial, we explain how to use Hive in a Hadoop distributed processing environment to enable Web analytics on large datasets. The Web analytics process will involve analyzing weblog details such as URLs accessed, cookies, access dates with times, and IP addresses. This information will be used to analyze visitors' website usage as well as their browsing patterns and behavior. Armed with this information, site owners can predict what a particular user likes on the site and personalize it accordingly. For their part, developers can add extra tracking values in the weblog for additional analytics.

Apache Hadoop and Hive for Data Processing

Apache Hadoop, the open source distributed computing framework for handling large datasets, uses the HDFS file system for storing files and Map/Reduce model for processing large datasets. Apache Hive, a sub-project of Hadoop, is a data warehouse infrastructure used to query and analyze large datasets stored in Hadoop files. Although Hadoop Hive is a non-SQL database, it will support some SQL as well. With its Hive-SQL option, Hive users can query the Hive tables. Hive works on top of Hadoop and ZooKeeper, a centralized Hadoop service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Weblogs can be processed by the Hadoop MapReduce program and stored in HDFS. Meanwhile, Hive supports fast reading of the data in the HDFS location, basic SQL, joins, and batch data load to the Hive database.

Weblog Formats and Processing

Both Apache Web Server and Microsoft IIS record website requests into log files, but the formats of those logs differ. Apache's preferred weblog format is the combined log format, which logs all the details of Web usages. Here is an example of the combined weblog format.

LogFormat "%h %v %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""

Here is the IIS preferred format, Microsoft IIS W3C Extended Log File Format:

c-ip cs-username date time sc-bytes sc-status cs-uri-stem cs[Referer] cs[User-Agent] cs[Cookie] 

Large retail applications are accessed by many users around the world, so their weblog file sizes might be between 10 and 15 gigabytes. For example, Amazon.com's weblog is more than 15 gigabytes. This weblog information is used to predict customer interest and personalize the site.

Page tagging is used for tagging the Web page and tracking usage of the page. Web analytics providers such as WebTrends and Google Analytics use this option for tracking page views and storing this view information in log files.

Hadoop MapReduce for Parsing Weblogs

Below is a sample Apache log where the log fields are terminated by a space. The OpenCSV framework is used for parsing these logs. It is using field terminated characters to parse the log and split the fields. - - [10/Apr/2007:10:39:11 +0300] "GET / HTTP/1.1" 500 606 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv: 
Gecko/20061201 Firefox/ (Ubuntu-feisty)"

Here are the steps for parsing a log file using Hadoop MapReduce:

    1. Load log files into the HDFS location using this Hadoop command:
      hadoop fs -put <local file path of weblogs>  <hadoop HDFS location>
    2. The Opencsv2.3.jar framework is used for parsing log records.

  1. Below is the Mapper program for parsing the log file from the HDFS location.
    public static class ParseMapper 
         extends Mapper<Object, Text, NullWritable,Text >{
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                   ) throws IOException, InterruptedException {
         CSVParser parse = new CSVParser(' ','\"');
         String sp[]=parse.parseLine(value.toString());
         int spSize=sp.length;
         StringBuffer rec= new StringBuffer();
         for(int i=0;i<spSize;i++){
         context.write(NullWritable.get(), word);
  2. The command below is the Hadoop-based log parse execution. TheMapReduce programis attached in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any change and move it to the Hadoop distributed job tracker system.
    hadoop jar <path of logparse jar> <hadoop HDFS logfile path>  <output path of parsed log file>
  3. The output file is stored in the HDFS location, and the output file name starts with "part-".

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date