Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Hive and Hadoop for Data Analytics on Large Web Logs-2 : Page 2


advertisement

Hadoop Hive Configuration

The sections to follow explain how to configure Hive for weblog analytics.

Requirements

  • Java 1.6
  • Hadoop 0.20.x.

Installing Hadoop Hive from a Stable Release

First, download the latest stable release of Hive from one of the Apache download mirrors.



Next, unpack the tarball, which will create a subdirectory named hive-x.y.z:

$ tar -xzvf hive-x.y.z.tar.gz

Point the environment variable HIVE_HOME to the installation directory:

$ cd hive-x.y.z $ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

$ export PATH=$HIVE_HOME/bin:$PATH

Running Hadoop Hive

Because Hive uses Hadoop either:

  • you must have Hadoop in your path, OR
  • export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup are as follows:

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

I also find it useful, but not necessary, to set HIVE_HOME as follows:

$ export HIVE_HOME=<hive-install-dir>

To use the Hive command line interface (CLI) from the shell:

$ $HIVE_HOME/bin/hive

Hive Runtime Configuration

  • Hive queries are executed using MapReduce queries. Therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
  • The CLI command SET can be used to set any Hadoop (or Hive) configuration variable. For example:

    hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v;

Hive, MapReduce and Local-Mode

The Hive compiler generates MapReduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by this variable:

mapred.job.tracker

This usually points to a MapReduce cluster with multiple nodes, but Hadoop also provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over small data sets because in such cases local mode execution is usually much faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode runs with only one reducer and can be very slow when processing larger data sets.

Starting with version 0.7, Hive fully supports local mode execution, which you can activate by enabling the following option:

Hive> SET mapred.job.tracker=local;

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<:username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space).

Starting with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode automatically. The relevant options are:

Hive> SET hive.exec.mode.local.auto=false;



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap