Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Hadoop for Data Analytics: Implementing a Weblog Parser : Page 2

How does Hadoop handle big data? Really well, actually. Find out exactly how by implementing a Hadoop-based log parser.


advertisement

Parsing the Apache Weblogs

Parsing incoming Apache logs from the Web server can retrieve useful information such as the number of times the site has been accessed, from which locations (using the IP address), and how frequently different browsers were used. The output data can be dumped in the database for further analysis. This entire process (see Figure 1) is achieved with the Hadoop MapReduce feature.


Figure 1. Flow of the MapReduce Program



Here are the steps to implement the parsing program presented in Figure 1.

1. Load the Apache weblog files in the Hadoop DFS:

you@your-machine:hadoop$ bin/hadoop dfs -put /dirpathlocalsystem/webLogFile.txt
/user/yourUserName/hadoop/dfsdata/weblogfilename.log

2. Create the weblog parser class. Create a class that will include the map/reduce implementation of the log parser, for example, ApacheWebLogParser.

3. Create the initialize method. Create a method to get the connection handle for the database being used and call this method at the initialization of the class.

private void createConnection(String driverClassName, String url) throws Exception { Class.forName(driverClassName); connection = DriverManager.getConnection(url); connection.setAutoCommit(false); } private void initialize(String driverClassName, String url) throws Exception { if(!this.initialized) { createConnection(driverClassName, url); this.initialized = true; } }

4. Create a class that will map the output of the reducer class to the database. This class implements the DBWritable, which makes sure that this class object will be dumped into the database. Below is the code for the reduce record.

static class ReduceRecord implements Writable, DBWritable { String browser; /*browser name*/ long b_frequency; /*browser frequency (frequency at which different browsers are being used)*/ String ip_address; /*ip address*/ long ip_frequency; /*ip address frequency(frequency of requests coming from different regions)*/ public ReduceRecord(String browser, long b_frequency, String ip_address, long ip_frequency) { /*create a database table with below mentioned fields*/ this.browser = browser; this.b_frequency = b_frequency; this.ip_address = ip_address; this.ip_frequency = ip_frequency; } public void readFields(DataInput in) throws IOException {} public void write(DataOutput out) throws IOException {} public void readFields(ResultSet resultSet) throws SQLException {} public void write(PreparedStatement statement) throws SQLException { statement.setString(1, browser); statement.setLong(2, b_frequency); statement.setString(3, ip_address); statement.setLong(4, ip_frequency); } }

5. Create the mapper class. Inside the ApacheWebLogParser class, create a mapper class and write the map method, which includes the logs parser.

The mapper program will read the Apache weblog files, parse the logs line by line, and collect the IP address and Browser-like information from the logs. The weblogs are delimited by semi-colon. Therefore, every line is split using the delimiter and the required information is collected and passed to the reducer program. Below is the code for the mapper class.

static class LogMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); if(line.startsWith("#") == false) /*to see if line is not a comment*/ { String strTokens[] = line.split(" "); String strIPAddress = strTokens[3]; String strBrowser = strTokens[9]; final IntWritable one = new IntWritable(1); int i = 0; while(i < 2) { if(i == 0) { output.collect(new Text(strBrowser), one); } else if(i == 1) { output.collect(new Text(strIPAddress), one); } i++; } } } }



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap