Login | Register   
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Exploring Various Hadoop Installation Modes

This tutorial provides a simple explanation of the three different installation modes and the reasons for the use of each mode.


advertisement

Overview

Apache Hadoop can be installed in different modes as per the requirement. These different modes are configured during installation and by default, Hadoop is installed in Standalone mode. The other modes are Pseudo distributed mode and distributed mode. The purpose of this tutorial is to explain different installation modes in a simple way so that the readers can follow it and do their own work.

In this article, I will discuss different installation modes and their details.

Introduction



We all know that Apache Hadoop is an open source framework that allows distributed processing of large sets of data set across different clusters using simple programming. Hadoop has the ability to scale up to thousands of computers from a single server. Thus in these conditions installation of Hadoop becomes most critical. We can install Hadoop in three different modes:

  • Standalone mode - Single Node Cluster
  • Pseudo distributed mode - Single Node Cluster
  • Distributed mode. - Multi Node Cluster

Purpose for Different Installation Modes

When Apache Hadoop is used in a production environment, multiple server nodes are used for distributed computing. But for understanding the basics and playing around with Hadoop, single node installation is sufficient. There is another mode known as 'pseudo distributed' mode. This mode is used to simulate the multi node environment on a single server.

In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system should have java version 1.6.x installed on it.

Standalone Mode Installation

Now, let us check the standalone mode installation process by following the steps mentioned below.

Install Java

Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.

  • Step 1 - If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE, install sun-java6 from Canonical Partner Repository by using the following command.

    Note: The Canonical Partner Repository contains free of cost closed source third party software. But the Canonical does not have access to the source code instead they just package and test it.

    Add the canonical partner to the apt repositories using -

    $ sudo add-apt-repository "deb http://archive.canonical.com/lucid partner"

  • Step 2 - Update the source list.

    $ sudo apt-get update

  • Step 3 - Install JDK version 1.6.x from Sun/Oracle.

    $ sudo apt-get install sun-java6-jdk

  • Step 4 - Once JDK installation is over make sure that it is correctly setup using - version 1.6.x from Sun/Oracle.

    user@ubuntu:~# java -version java version "1.6.0_45" Java(TM) SE Runtime Environment (build 1.6.0_45-b02) Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)

Add Hadoop User

  • Step 5 - Add a dedicated Hadoop unix user into you system as under to isolate this installation from other software -

    $ sudo adduser hadoop_admin

Download the Hadoop binary and install

  • Step 6 - Download Apache Hadoop from the apache web site. Hadoop comes in the form of tar-gx format. Copy this binary into the /usr/local/installables folder. The folder - installables should be created first under /usr/local before this step. Now run the following commands as sudo

    $ cd /usr/local/installables $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2

Define env variable - JAVA_HOME

  • Step 7 - Open the Hadoop configuration file (hadoop-env.sh) in the location - /usr/local/installables/hadoop-0.20.2/conf/hadoop-env.sh and define the JAVA_HOME as under –

    export JAVA_HOME=path/where/jdk/is/installed

    (e.g. /usr/bin/java)

Installation in Single mode

  • Step 8 - Now go to the HADOOP_HOME directory (location where HADOOP is extracted) and run the following command –

    $ bin/hadoop

    The following output will be displayed –

    Usage: hadoop [--config confdir] COMMAND

    Some of the COMMAND options are mentioned below. There are other options available and can be checked using the command mentioned above.

    namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client mradmin run a Map-Reduce admin client fsck run a DFS filesystem checking utility

The above output indicates that Standalone installation is completed successfully. Now you can run the sample examples of your choice by calling –

$ bin/hadoop jar hadoop-*-examples.jar <NAME> <PARAMS>

Pseudo Distributed Mode Installation

This is a simulated multi node environment based on a single node server.

Here, the first step required is to configure the SSH in order to access and manage the different nodes. It is mandatory to have the SSH access to the different nodes. Once the SSH is configured, enabled and is accessible we should start configuring the Hadoop. The following configuration files need to be modified:

  • conf/core-site.xml
  • conf/hdfs-site.xml
  • conf/mapred.xml

Open the all the configuration files in vi editor and update the configuration.

Configure core-site.xml file:

$ vi conf/core-site.xml

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> </property> </configuration>

Configure hdfs-site.xml file:

$ vi conf/hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

Configure mapred.xml file:

$ vi conf/mapred.xml

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>

Once these changes are done, we need to format the name node by using the following command. The command prompt will show all the messages one after another and finally success message.

$ bin/hadoop namenode –format

Our setup is done for pseudo distributed node. Let's now start the single node cluster by using the following command. It will again show some set of messages on the command prompt and start the server process.

$ /bin/start-all.sh

Now we should check the status of Hadoop process by executing the jps command as shown below. It will show all the running processes.

$ jps 14799 NameNode 14977 SecondaryNameNode 15183 DataNode 15596 JobTracker 15897 TaskTracker

Stopping the Single node Cluster: We can stop the single node cluster by using the following command. The command prompt will display all the stopping processes.

$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode

Distributed Mode Installation

Before we start the distributed mode installation, we must ensure that we have the pseudo distributed setup done and we have at least two machines, one acting as master and the other acting as a slave. Now we run the following commands in sequence.

  • $ bin/stop-all.sh- Make sure none of the nodes are running
  • Open the /etc/hosts file and add the following entries for master and slave
    <IP ADDRESS> master
    <IP ADDRESS> slave
  • $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave- This command should be executed on master to have the passwordless ssh. We should login using the same username on all the machines. If we need a password, we can set it manually.
  • Now we open the two files - conf/master and conf/slaves. The conf/master defines the name nodes of our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
  • Edit the conf/core-site.xml file to have the following entries -

    <property> <name>fs.default.name</name> <value>hdfs://master:54310</value> </property>

  • Edit the conf/mapred-site.xml file to have the following entries -

    <property> <name>mapred.job.tracker</name> <value>hdfs://master:54311</value> </property>

  • Edit the conf/hdfs-site.xml file to have the following entries -

    <property> <name>dfs.replication</name> <value>2</value> </property>

  • Edit the conf/mapred-site.xml file to have the following entries -

    <property> <name>mapred.local.dir</name> <value>${hadoop-tmp}/mapred/local</value> </property> <property> <name>mapred.map.tasks</name> <value>50</value> </property> <property> <name>mapred.reduce.tasks</name> <value>5</value> </property>

Now start the master by using the following command.

bin/start-dfs.sh

Once started, check the status on the master by using jps command. You should get the following output –

14799 NameNode 15314 Jps 16977 secondaryNameNode

On the slave, the output should be as shown as:

15183 DataNode 15616 Jps

Now start the MapReduce daemons using the following command.

$ bin/start-mapred.sh

Once started, check the status on the master by using jps command. You should get the following output:

16017 Jps 14799 NameNode 15596 JobTracker 14977 SecondaryNameNode

On the slaves, the output should be as shown below.

15183 DataNode 15897 TaskTracker 16284 Jps

Summary

This article has covered different Hadoop installation modes and their technical details. You should always be careful when selecting the installation mode as each has its own purpose. Beginners should start with single mode installation and then proceed with other options.

 

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.



   
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap