RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Hadoop for Data Analytics: Implementing a Weblog Parser : Page 2

How does Hadoop handle big data? Really well, actually. Find out exactly how by implementing a Hadoop-based log parser.


Parsing the Apache Weblogs

Parsing incoming Apache logs from the Web server can retrieve useful information such as the number of times the site has been accessed, from which locations (using the IP address), and how frequently different browsers were used. The output data can be dumped in the database for further analysis. This entire process (see Figure 1) is achieved with the Hadoop MapReduce feature.

Figure 1. Flow of the MapReduce Program

Here are the steps to implement the parsing program presented in Figure 1.

1. Load the Apache weblog files in the Hadoop DFS:

you@your-machine:hadoop$ bin/hadoop dfs -put /dirpathlocalsystem/webLogFile.txt 

2. Create the weblog parser class. Create a class that will include the map/reduce implementation of the log parser, for example, ApacheWebLogParser.

3. Create the initialize method. Create a method to get the connection handle for the database being used and call this method at the initialization of the class.

private void createConnection(String driverClassName, String url) throws Exception
   connection = DriverManager.getConnection(url);
private void initialize(String driverClassName, String url) throws Exception
      createConnection(driverClassName, url);
      this.initialized = true;

4. Create a class that will map the output of the reducer class to the database. This class implements the DBWritable, which makes sure that this class object will be dumped into the database. Below is the code for the reduce record.

 static class ReduceRecord implements Writable, DBWritable
   String browser;       /*browser name*/
long b_frequency;   /*browser frequency (frequency at which different browsers are 
                                                                                        being used)*/
   String ip_address;   /*ip address*/
long ip_frequency;  /*ip address frequency(frequency of requests coming from different 

public ReduceRecord(String browser, long b_frequency, String ip_address, long ip_frequency)
   {            /*create a database table with below mentioned fields*/
      this.browser = browser;
      this.b_frequency = b_frequency;
      this.ip_address = ip_address;
      this.ip_frequency = ip_frequency;

    public void readFields(DataInput in) throws IOException

public void write(DataOutput out) throws IOException

   public void readFields(ResultSet resultSet) throws SQLException

public void write(PreparedStatement statement) throws SQLException
      statement.setString(1, browser);
      statement.setLong(2, b_frequency);
      statement.setString(3, ip_address);
      statement.setLong(4, ip_frequency);

5. Create the mapper class. Inside the ApacheWebLogParser class, create a mapper class and write the map method, which includes the logs parser.

The mapper program will read the Apache weblog files, parse the logs line by line, and collect the IP address and Browser-like information from the logs. The weblogs are delimited by semi-colon. Therefore, every line is split using the delimiter and the required information is collected and passed to the reducer program. Below is the code for the mapper class.

static class LogMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      String line = value.toString();
      if(line.startsWith("#") == false) /*to see if line is not a comment*/
         String strTokens[] = line.split(" ");
         String strIPAddress = strTokens[3];
         String strBrowser = strTokens[9];
         final IntWritable one = new IntWritable(1);
         int i = 0;
         while(i < 2)
            if(i == 0)
               output.collect(new Text(strBrowser), one);
            else if(i == 1)
               output.collect(new Text(strIPAddress), one);

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date