The World Wide Web has an estimated 2 billion users and contains anywhere from 15 to 45 billion Web pages, with around 10 million pages added each day. With such large numbers, almost every website owner and developer who has a decent presence on the Internet faces a complex problem: how to make sense of their web pages and all the users who visit their websites.
Every Web server worth its salt logs the user activities for the websites it supports and the Web pages it serves up to the virtual world. These Web logs are mostly used for debugging issues or to get insight into the details, which are interesting from a business or performance point of view. Over time, the size of the logs keeps increasing until it becomes very difficult to manually extract any important information out of them, particularly for busy websites. The Hadoop framework does a good job at tackling this challenge in a timely, reliable, and cost-efficient manner.
This article explains the formidable task of Web log analysis using the Hadoop framework and Pig scripting language, which are well suited to handle large amounts of unstructured data. We propose a solution based on the Pig framework that aggregates data at an hourly, daily or yearly granularity. The proposed architecture features a data-collection and a database layer as an end-to-end solution, but in this article we focus on the analysis layer, which is implemented in the Pig Latin language.
Figure 1 shows an illustration of the layered architecture.
Click here for larger image
Figure 1. Log Analysis Solution Architecture
Analyzing Logs Generated by Web Servers
The challenge for the proposed solution is to analyze Web logs generated by Apache Web Server. Apache Web logs follow a standard pattern that is customizable. Their description and sample Web logs can be found easily on the Web. These logs contain information in various fields, such as timestamp, IP address, page visited, referrer, and browser, among others. Each row in the Web log corresponds to a visit or event on a Web page.
The size of Web logs can range anywhere from a few KB to hundreds of GB. We have to design solutions that based on different dimensions such as timestamp, browser and country can extract patterns and information out of these logs and provide us vital bits of information, such as the number of hits for a particular website or Web page, the number of unique users, and so on. Each potential problem can be divided into a particular use case and can then be solved. For our purpose here, we will take two use cases:
Find number of hits and number of unique users for Year dimension.
Find number of hits and number of unique users for Time, Country, and City dimensions.
These use cases will demonstrate the basic approach and steps required to solve these problems. Other problems can be tackled easily by making slight modifications to the implementations and/or approach used in solving the above-mentioned use cases.
The technologies used are the Apache Hadoop framework, Apache Pig, the Java programming language, and regular expressions (regex). Although this article focuses on Apache Web logs, the approach is generic enough so that it can be implemented to logs generated by any other server or system.
Hadoop Solution Architecture
The proposed architecture is a layered architecture, and each layer has components. It scales according to the number of logs generated by your Web servers, enables you to harness the data to get key insights, and is based on an economical, scalable platform. You can continue adding and retaining data for longer periods to enrich your knowledge base.
The Log Analysis Software Stack
- Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is mainly divided into two modules:
- HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster.
- Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS.
- The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets -- HBase can save very large tables having million of rows. It's a distributed database and can also keep multiple versions of a single row.
- The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language.
- Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage.
Data Flow and Components
- Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS.
- Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as we are showcasing the Pig framework applicability for log analysis at large scale.
- The database HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing.
- The data-access Web service is a REST-based service that eases the access and integrations with data clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based clients.