With the digitalization of the world, the data analytics function of extracting information or generating knowledge from raw data is becoming increasingly important. Parsing Weblogs to retrieve important information for analysis is one of the applications of data analytics. Many companies have turned to this application of data analytics for their basic needs.
For example, Walmart would want to analyze the bestselling product category for a region so that they could notify users living in that region about the latest products under that category. Another use case could be to capture the area details -- using IP address information -- about the regions that produce the most visits to their site.
All user transactions and on-site actions are normally captured in weblogs on a company's websites. To retrieve all this information, developers must parse these weblogs, which are huge. While sequential parsing would be very slow and time consuming, parallelizing the parsing process makes it fast and efficient. But the process of parallelized parsing requires developers to split the weblogs into smaller chunks of data, and the partition of the data should be done in such a way that the final results will be consolidated without losing any vital information from the original data.
Hadoop's MapReduce framework is a natural choice for parallel processing. Through Hadoop's MapReduce utility, the weblog files can be split into smaller chunks and distributed across different nodes/systems over the cluster to produce their respective results. These results are then consolidated and the final results are obtained as per the user's requirements.
In this article, I will walk through the complete process of weblog parsing using Hadoop, which is divided into three phases:
- Setting up the Hadoop cluster
- Transferring the Apache weblog files over HDFS
- Parsing the Apache weblogs
Setting Up the Hadoop Cluster
Here are the hardware and software prerequisites for implementing weblog parsing with Hadoop:
- Hadoop: Latest available version of Hadoop
- Java 6 or later: Hadoop can run only on Java 6 or later versions
- SSH client: To run the commands over the Hadoop machine
- Number of machines in the cluster: 2
- Operating system: CentOS 5.5 or Solaris, etc.
- Memory: 1GB
- Processor: 1GHz
Download and unzip Hadoop using this command under a directory path:
tar vxf hadoop-0.19.2.tar
A folder, hadoop-0.19.2, will be created under your mentioned path.
After Hadoop has been set up, a few configuration changes are required. Certain files under
yourdirpath/hadoop-0.19.2/config need to be configured:
- hadoop-env.sh -- This file contains some environment variable settings used by Hadoop. The only variable you should need to change in this file is JAVA_HOME, which specifies the path to the Java installation used by Hadoop.
export JAVA_HOME=yourdirpath/Java/jdk1.6.0_10 (where java has been installed)
- hadoop-site.xml -- This file contains site-specific settings for all Hadoop daemons and MapReduce jobs. It is empty by default. Settings in this file override those in
mapred-default.xml. This file should contain settings that must be respected by all servers and clients in a Hadoop installation (e.g., the location of the namenode and the jobtracker). Set the variable
fs.default.name to the Namenode's intended host:port. Set the variable
mapred.job.tracker to the jobtracker's intended host:port. Also define
hadoop.tmp.dir for the default storage location.
- masters -- This file lists the hosts, one per line, where the Hadoop master daemon will run. It can be single entry as localhost also.
- slaves -- This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. By default, this contains the single entry localhost.
Finally, run the command to start the datanodes and tasktrackers on different host machines:
Transferring the Apache Weblog Files Over HDFS
HDFS stands for Hadoop Distributed File System. The Apache weblog files that need to be parsed are transferred into HDFS. This is so the data contained in those files can be distributed over the Hadoop cluster through HDFS for faster access and better performance. These files then can be processed in parallel over the cluster for better throughput as well as for reliable data availability.
Here are the steps for loading the weblog files into the HDFS:
You can check the HDFS to see whether your file has been uploaded using this link.
Note: The required values will depend on the hadoop-site.xml file, for example: