Hadoop Hive Configuration
The sections to follow explain how to configure Hive for weblog analytics.
Requirements
Installing Hadoop Hive from a Stable Release
First, download the latest stable release of Hive from one of the Apache download mirrors.
Next, unpack the tarball, which will create a subdirectory named hive-x.y.z
:
$ tar -xzvf hive-x.y.z.tar.gz
Point the environment variable HIVE_HOME
to the installation directory:
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin
to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
Running Hadoop Hive
Because Hive uses Hadoop either:
- you must have Hadoop in your path, OR
- export
HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp
and /user/hive/warehouse
(aka hive.metastore.warehouse.dir) and set them chmod g+w
in HDFS before a table can be created in Hive.
Commands to perform this setup are as follows:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
I also find it useful, but not necessary, to set HIVE_HOME
as follows:
$ export HIVE_HOME=<hive-install-dir>
To use the Hive command line interface (CLI) from the shell:
$ $HIVE_HOME/bin/hive
Hive Runtime Configuration
Hive, MapReduce and Local-Mode
The Hive compiler generates MapReduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by this variable:
mapred.job.tracker
This usually points to a MapReduce cluster with multiple nodes, but Hadoop also provides an option to run MapReduce jobs locally on the user's PC. This can be very useful for running queries over small data sets because in such cases local mode execution is usually much faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode runs with only one reducer and can be very slow when processing larger data sets.
Starting with version 0.7, Hive fully supports local mode execution, which you can activate by enabling the following option:
Hive> SET mapred.job.tracker=local;
In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<:username>/mapred/local
). (Otherwise, the user will get an exception allocating local disk space).
Starting with version 0.7, Hive also supports a mode to run MapReduce jobs in local-mode automatically. The relevant options are:
Hive> SET hive.exec.mode.local.auto=false;