Hadoop for Data Analytics: Implementing a Weblog Parser

With the digitalization of the world, the data analytics function of extracting information or generating knowledge from raw data is becoming increasingly important. Parsing Weblogs to retrieve important information for analysis is one of the applications of data analytics. Many companies have turned to this application of data analytics for their basic needs.

For example, Walmart would want to analyze the bestselling product category for a region so that they could notify users living in that region about the latest products under that category. Another use case could be to capture the area details — using IP address information — about the regions that produce the most visits to their site.

All user transactions and on-site actions are normally captured in weblogs on a company’s websites. To retrieve all this information, developers must parse these weblogs, which are huge. While sequential parsing would be very slow and time consuming, parallelizing the parsing process makes it fast and efficient. But the process of parallelized parsing requires developers to split the weblogs into smaller chunks of data, and the partition of the data should be done in such a way that the final results will be consolidated without losing any vital information from the original data.

Hadoop’s MapReduce framework is a natural choice for parallel processing. Through Hadoop’s MapReduce utility, the weblog files can be split into smaller chunks and distributed across different nodes/systems over the cluster to produce their respective results. These results are then consolidated and the final results are obtained as per the user’s requirements.

In this article, I will walk through the complete process of weblog parsing using Hadoop, which is divided into three phases:

  1. Setting up the Hadoop cluster
  2. Transferring the Apache weblog files over HDFS
  3. Parsing the Apache weblogs

Setting Up the Hadoop Cluster

Here are the hardware and software prerequisites for implementing weblog parsing with Hadoop:

    Software:

  • Hadoop: Latest available version of Hadoop
  • Java 6 or later: Hadoop can run only on Java 6 or later versions
  • SSH client: To run the commands over the Hadoop machine
    Hardware:

  • Number of machines in the cluster: 2
  • Operating system: CentOS 5.5 or Solaris, etc.
  • Memory: 1GB
  • Processor: 1GHz

Download and unzip Hadoop using this command under a directory path:

gunzip hadoop-0.19.2.tar.gztar vxf hadoop-0.19.2.tar

A folder, hadoop-0.19.2, will be created under your mentioned path.

After Hadoop has been set up, a few configuration changes are required. Certain files under yourdirpath/hadoop-0.19.2/config need to be configured:

  • hadoop-env.sh — This file contains some environment variable settings used by Hadoop. The only variable you should need to change in this file is JAVA_HOME, which specifies the path to the Java installation used by Hadoop.
     export JAVA_HOME=yourdirpath/Java/jdk1.6.0_10 (where java has been installed)
  • hadoop-site.xml — This file contains site-specific settings for all Hadoop daemons and MapReduce jobs. It is empty by default. Settings in this file override those in hadoop-default.xml and mapred-default.xml. This file should contain settings that must be respected by all servers and clients in a Hadoop installation (e.g., the location of the namenode and the jobtracker). Set the variable fs.default.name to the Namenode’s intended host:port. Set the variable mapred.job.tracker to the jobtracker’s intended host:port. Also define hadoop.tmp.dir for the default storage location.

    For example:

    fs.default.namehdfs://10.66.163.60:8020/        mapred.job.tracker       10.66.163.60:8021     dfs.datanode.http.address    0.0.0.0:54311      dfs.http.address    0.0.0.0:54314  
  • masters — This file lists the hosts, one per line, where the Hadoop master daemon will run. It can be single entry as localhost also.
  • slaves — This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. By default, this contains the single entry localhost.

Finally, run the command to start the datanodes and tasktrackers on different host machines:

[email protected]:~/hadoop$ bin/start-all.sh

Transferring the Apache Weblog Files Over HDFS

HDFS stands for Hadoop Distributed File System. The Apache weblog files that need to be parsed are transferred into HDFS. This is so the data contained in those files can be distributed over the Hadoop cluster through HDFS for faster access and better performance. These files then can be processed in parallel over the cluster for better throughput as well as for reliable data availability.

Here are the steps for loading the weblog files into the HDFS:

  • Make a directory structure over HDFS:
    [email protected]:hadoop$ bin/hadoop dfs –mkdir /user/yourUserName
  • Upload the weblog file from your local file system to HDFS:
    [email protected]:hadoop$ bin/hadoop dfs -put /dirpathlocalsystem/webLogFile.txt /user/yourUserName

You can check the HDFS to see whether your file has been uploaded using this link.

http://machineName:dfsDatanodeHttpAddressPortNumber/browsedirectory.jsp?dir=/&namenodeInfoPort=portNumber

Note: The required values will depend on the hadoop-site.xml file, for example:

http://hadoopserver:54311/browseDirectory.jsp?dir=/&namenodeInfoPort=54314

Parsing the Apache Weblogs

Parsing incoming Apache logs from the Web server can retrieve useful information such as the number of times the site has been accessed, from which locations (using the IP address), and how frequently different browsers were used. The output data can be dumped in the database for further analysis. This entire process (see Figure 1) is achieved with the Hadoop MapReduce feature.


Figure 1. Flow of the MapReduce Program

Here are the steps to implement the parsing program presented in Figure 1.

1. Load the Apache weblog files in the Hadoop DFS:

[email protected]:hadoop$ bin/hadoop dfs -put /dirpathlocalsystem/webLogFile.txt 
/user/yourUserName/hadoop/dfsdata/weblogfilename.log

2. Create the weblog parser class. Create a class that will include the map/reduce implementation of the log parser, for example, ApacheWebLogParser.

3. Create the initialize method. Create a method to get the connection handle for the database being used and call this method at the initialization of the class.

private void createConnection(String driverClassName, String url) throws Exception{Class.forName(driverClassName);   connection = DriverManager.getConnection(url);   connection.setAutoCommit(false);}private void initialize(String driverClassName, String url) throws Exception{   if(!this.initialized)   {      createConnection(driverClassName, url);      this.initialized = true;   }}

4. Create a class that will map the output of the reducer class to the database. This class implements the DBWritable, which makes sure that this class object will be dumped into the database. Below is the code for the reduce record.

 static class ReduceRecord implements Writable, DBWritable{   String browser;       /*browser name*/long b_frequency;   /*browser frequency (frequency at which different browsers are                                                                                         being used)*/   String ip_address;   /*ip address*/long ip_frequency;  /*ip address frequency(frequency of requests coming from different                                                                                             regions)*/public ReduceRecord(String browser, long b_frequency, String ip_address, long ip_frequency)   {            /*create a database table with below mentioned fields*/      this.browser = browser;      this.b_frequency = b_frequency;      this.ip_address = ip_address;      this.ip_frequency = ip_frequency;   }    public void readFields(DataInput in) throws IOException   {}public void write(DataOutput out) throws IOException   {}   public void readFields(ResultSet resultSet) throws SQLException   {}public void write(PreparedStatement statement) throws SQLException   {      statement.setString(1, browser);      statement.setLong(2, b_frequency);      statement.setString(3, ip_address);      statement.setLong(4, ip_frequency);   }}

5. Create the mapper class. Inside the ApacheWebLogParser class, create a mapper class and write the map method, which includes the logs parser.

The mapper program will read the Apache weblog files, parse the logs line by line, and collect the IP address and Browser-like information from the logs. The weblogs are delimited by semi-colon. Therefore, every line is split using the delimiter and the required information is collected and passed to the reducer program. Below is the code for the mapper class.

static class LogMapper extends MapReduceBase implements Mapper{public void map(LongWritable key, Text value,OutputCollector output, Reporter reporter) throws IOException   {      String line = value.toString();      if(line.startsWith("#") == false) /*to see if line is not a comment*/      {         String strTokens[] = line.split(" ");         String strIPAddress = strTokens[3];         String strBrowser = strTokens[9];                     final IntWritable one = new IntWritable(1);         int i = 0;         while(i < 2)         {            if(i == 0)            {               output.collect(new Text(strBrowser), one);            }            else if(i == 1)            {               output.collect(new Text(strIPAddress), one);            }            i++;         }      }   }}

7. Write the Data in the database. The run method of the ToolRunner interface, which ApacheWebLogParser class is implementing to, must be implemented. This method will start the flow of the complete program. It will first get the database details and create the database connections using the database configurations. It also sets the DBOutputFormat to the field names of the table in which the data needs to be dumped. This method then creates the job with the input and output details of the map and the reduce class. Below is the code for the run() method.

public int run(String[] args) throws Exception{   String driverClassName = DRIVER_CLASS; /* driver class is for the database       driver that is being used*/   String url = DB_URL; /*path where the database has been installed*/   initialize(driverClassName, url);   JobConf job = new JobConf(getConf(), ApacheLogParser.class);   job.setJobName("ApacheLogParser");   job.setMapperClass(LogMapper.class);   job.setReducerClass(LogReducer.class);   DBConfiguration.configureDB(job, driverClassName, url);String [] LogFieldNames = {"browser", "b_frequency", "ip_address", "ip_frequency"};   DBOutputFormat.setOutput(job, "apachelog_data", LogFieldNames);   /*apachelog_data is the table name in the database*/   job.setMapOutputKeyClass(Text.class);   job.setMapOutputValueClass(IntWritable.class);   job.setOutputKeyClass(ReduceRecord.class);   job.setOutputValueClass(NullWritable.class);   List other_args = new ArrayList();   for(int i=0; i < args.length; ++i)    {          try           {             if("-m".equals(args[i]))              {                job.setNumMapTasks(Integer.parseInt(args[++i]));             }              else if("-r".equals(args[i]))              {                job.setNumReduceTasks(Integer.parseInt(args[++i]));             }             else             {                other_args.add(args[i]);             }          }          catch(NumberFormatException except)           {System.out.println("ERROR: Integer expected instead of " + args[i]);             return printUsage();          }          catch(ArrayIndexOutOfBoundsException except)           {System.out.println("ERROR: Required parameter missing from " +args[i-1]);             return printUsage();          }       }       if(other_args.size() != 2)        {System.out.println("ERROR: Wrong number of parameters: " +other_args.size() + " instead of 2.");          return printUsage();       }       FileInputFormat.setInputPaths(job, other_args.get(0));       JobClient.runJob(job);       return 0;}

8. Run the parser. Now the ApacheWebLogParser program is converted to a JAR. For example, LogParser.jar and stored in all the Hadoop cluster nodes using the below mentioned command:

[email protected]:~/hadoop$ bin/hadoop jar path/LogParser.jar package.ApacheWebLogparser dfspath/weblog.log dfspath/reduceoutput

Note: The orange highlight is the path specific to the installed location.

9. Analyze the data from the database. After the data is in the database it can be used for further analysis. But this is the most critical part of the whole process. As the final objective is to deduce the desired information from the loads of the raw data, the analysis techniques and tools implemented at this stage are the most importance.

For example, the targeted information at this stage of the implementation is:

  • Different country aggregated data so that we can fetch all the IP address to find locations and their individual frequency of hitting the website
  • The information related to the kind of browsers being used and the frequency of their use

After this data is collected, different charting tools can generate the results graphically for better understanding. In this case, GoogleCharts can be used to generate pie and bar charts for the collected data (see Figures 2-4).

Conclusion

When handling huge amounts of data, normal serialized processing will be very slow and inefficient. On the other hand, Hadoop DFS is capable of handling this data in parallel by:

  • Splitting the data into smaller chunks and distributing it over Hadoop cluster nodes
  • Distributing the business-processing logic over the Hadoop cluster nodes

Hadoop DFS also is scalable because new nodes can be added in the clusters.

Hadoop DFS also always freedom from JDBC. The map output, after getting reduced using the reducer program, is dumped directly into the database, so there is no need to make a JDBC call every time.

Overall, the parallel processing of data, the scalability, and the bypassing of JDBC overhead results in good performance.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

homes in the real estate industry

Exploring the Latest Tech Trends Impacting the Real Estate Industry

The real estate industry is changing thanks to the newest technological advancements. These new developments — from blockchain and AI to virtual reality and 3D printing — are poised to change how we buy and sell homes. Real estate brokers, buyers, sellers, wholesale real estate professionals, fix and flippers, and beyond may

man on floor with data

DevX Quick Guide to Data Ingestion

One of the biggest trends of the 21st century is the massive surge in internet usage. With major innovations such as smart technology, social media, and online shopping sites, the internet has become an essential part of everyday life for a large portion of the population. Due to this internet

payment via phone

7 Ways Technology Has Changed Traditional Payments

In today’s digital world, technology has changed how we make payments. From contactless cards to mobile wallets, it’s now easier to pay for goods and services without carrying cash or using a checkbook. This article will look at seven of the most significant ways technology has transformed traditional payment methods.