RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Hive and Hadoop for Data Analytics on Large Web Logs-2 : Page 2

Developers can use Apache Hive and Hadoop for data analytics on large web logs, analyzing users' browsing patterns and behavior.


Hadoop Hive Configuration

The sections to follow explain how to configure Hive for weblog analytics.


  • Java 1.6
  • Hadoop 0.20.x.

Installing Hadoop Hive from a Stable Release

First, download the latest stable release of Hive from one of the Apache download mirrors.

Next, unpack the tarball, which will create a subdirectory named hive-x.y.z:

 $ tar -xzvf hive-x.y.z.tar.gz

Point the environment variable HIVE_HOME to the installation directory:

      $ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

       $ export PATH=$HIVE_HOME/bin:$PATH

Running Hadoop Hive

Because Hive uses Hadoop either:

  • you must have Hadoop in your path, OR
  • export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup are as follows:

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

I also find it useful, but not necessary, to set HIVE_HOME as follows:

$ export HIVE_HOME=<hive-install-dir>

To use the Hive command line interface (CLI) from the shell:

$ $HIVE_HOME/bin/hive

Hive Runtime Configuration

  • Hive queries are executed using MapReduce queries. Therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
  • The CLI command SETcan be used to set any Hadoop (or Hive) configuration variable. For example:
    hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
    hive> SET -v;

Hive, MapReduce and Local-Mode

The Hive compiler generates MapReduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by this variable:


This usually points to a MapReduce cluster with multiple nodes, but Hadoop also provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over small data sets because in such cases local mode execution is usually much faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode runs with only one reducer and can be very slow when processing larger data sets.

Starting with version 0.7, Hive fully supports local mode execution, which you can activate by enabling the following option:

Hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/&lt:username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).

Starting with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode automatically. The relevant options are:

 Hive> SET hive.exec.mode.local.auto=false;

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date