Login | Register   
RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Hadoop for Data Analytics: Implementing a Weblog Parser : Page 3

How does Hadoop handle big data? Really well, actually. Find out exactly how by implementing a Hadoop-based log parser.




Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js

7. Write the Data in the database. The run method of the ToolRunner interface, which ApacheWebLogParser class is implementing to, must be implemented. This method will start the flow of the complete program. It will first get the database details and create the database connections using the database configurations. It also sets the DBOutputFormat to the field names of the table in which the data needs to be dumped. This method then creates the job with the input and output details of the map and the reduce class. Below is the code for the run() method.

public int run(String[] args) throws Exception
   String driverClassName = DRIVER_CLASS; /* driver class is for the database 
      driver that is being used*/
   String url = DB_URL; /*path where the database has been installed*/

   initialize(driverClassName, url);

   JobConf job = new JobConf(getConf(), ApacheLogParser.class);

   DBConfiguration.configureDB(job, driverClassName, url);
String [] LogFieldNames = {"browser", "b_frequency", "ip_address", "ip_frequency"};
   DBOutputFormat.setOutput(job, "apachelog_data", LogFieldNames);
   /*apachelog_data is the table name in the database*/


   List<String> other_args = new ArrayList<String>();
   for(int i=0; i < args.length; ++i) 
             else if("-r".equals(args[i])) 
          catch(NumberFormatException except) 
System.out.println("ERROR: Integer expected instead of " + args[i]);
             return printUsage();
          catch(ArrayIndexOutOfBoundsException except) 
System.out.println("ERROR: Required parameter missing from " +args[i-1]);
             return printUsage();
       if(other_args.size() != 2) 
System.out.println("ERROR: Wrong number of parameters: " +other_args.size() + " instead of 2.");
          return printUsage();
       FileInputFormat.setInputPaths(job, other_args.get(0));
       return 0;

8. Run the parser. Now the ApacheWebLogParser program is converted to a JAR. For example, LogParser.jar and stored in all the Hadoop cluster nodes using the below mentioned command:

you@your-machine:~/hadoop$ bin/hadoop jar path/LogParser.jar package.ApacheWebLogparser dfspath/weblog.log dfspath/reduceoutput

Note: The orange highlight is the path specific to the installed location.

9. Analyze the data from the database. After the data is in the database it can be used for further analysis. But this is the most critical part of the whole process. As the final objective is to deduce the desired information from the loads of the raw data, the analysis techniques and tools implemented at this stage are the most importance.

For example, the targeted information at this stage of the implementation is:

  • Different country aggregated data so that we can fetch all the IP address to find locations and their individual frequency of hitting the website
  • The information related to the kind of browsers being used and the frequency of their use

After this data is collected, different charting tools can generate the results graphically for better understanding. In this case, GoogleCharts can be used to generate pie and bar charts for the collected data (see Figures 2-4).


When handling huge amounts of data, normal serialized processing will be very slow and inefficient. On the other hand, Hadoop DFS is capable of handling this data in parallel by:

  • Splitting the data into smaller chunks and distributing it over Hadoop cluster nodes
  • Distributing the business-processing logic over the Hadoop cluster nodes

Hadoop DFS also is scalable because new nodes can be added in the clusters.

Hadoop DFS also always freedom from JDBC. The map output, after getting reduced using the reducer program, is dumped directly into the database, so there is no need to make a JDBC call every time.

Overall, the parallel processing of data, the scalability, and the bypassing of JDBC overhead results in good performance.

Ira Agrawal works as a Technical Manager with Infosys Labs, where she has worked on different aspects of distributed computing including various middleware and products based on DSM, SOA and virtualization technologies.
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



Thanks for your registration, follow us on our social networks to keep up-to-date