devxlogo

Exploring Various Hadoop Installation Modes

Exploring Various Hadoop Installation Modes

Overview

Apache Hadoop can be installed in different modes as per the requirement. These different modes are configured during installation and by default, Hadoop is installed in Standalone mode. The other modes are Pseudo distributed mode and distributed mode. The purpose of this tutorial is to explain different installation modes in a simple way so that the readers can follow it and do their own work.

In this article, I will discuss different installation modes and their details.

Introduction

We all know that Apache Hadoop is an open source framework that allows distributed processing of large sets of data set across different clusters using simple programming. Hadoop has the ability to scale up to thousands of computers from a single server. Thus in these conditions installation of Hadoop becomes most critical. We can install Hadoop in three different modes:

  • Standalone mode – Single Node Cluster
  • Pseudo distributed mode – Single Node Cluster
  • Distributed mode. – Multi Node Cluster

Purpose for Different Installation Modes

When Apache Hadoop is used in a production environment, multiple server nodes are used for distributed computing. But for understanding the basics and playing around with Hadoop, single node installation is sufficient. There is another mode known as ‘pseudo distributed’ mode. This mode is used to simulate the multi node environment on a single server.

In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system should have java version 1.6.x installed on it.

Standalone Mode Installation

Now, let us check the standalone mode installation process by following the steps mentioned below.

Install Java

Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.

  • Step 1 – If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE, install sun-java6 from Canonical Partner Repository by using the following command.

    Note: The Canonical Partner Repository contains free of cost closed source third party software. But the Canonical does not have access to the source code instead they just package and test it.

    Add the canonical partner to the apt repositories using –

    	$ sudo add-apt-repository "deb http://archive.canonical.com/lucid partner"      
  • Step 2 – Update the source list.
    		$ sudo apt-get update 	
  • Step 3 – Install JDK version 1.6.x from Sun/Oracle.
    	$ sudo apt-get install sun-java6-jdk	
  • Step 4 – Once JDK installation is over make sure that it is correctly setup using – version 1.6.x from Sun/Oracle.
    	user@ubuntu:~# java -version	java version "1.6.0_45"	Java(TM) SE Runtime Environment (build 1.6.0_45-b02)	Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)	

Add Hadoop User

  • Step 5 – Add a dedicated Hadoop unix user into you system as under to isolate this installation from other software –
    	$ sudo adduser hadoop_admin	

Download the Hadoop binary and install

  • Step 6 – Download Apache Hadoop from the apache web site. Hadoop comes in the form of tar-gx format. Copy this binary into the /usr/local/installables folder. The folder – installables should be created first under /usr/local before this step. Now run the following commands as sudo
    	$ cd /usr/local/installables	$ sudo tar xzf hadoop-0.20.2.tar.gz	$ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2   	

Define env variable – JAVA_HOME

  • Step 7 – Open the Hadoop configuration file (hadoop-env.sh) in the location – /usr/local/installables/hadoop-0.20.2/conf/hadoop-env.sh and define the JAVA_HOME as under ?
     export JAVA_HOME=path/where/jdk/is/installed 

    (e.g. /usr/bin/java)

Installation in Single mode

  • Step 8 – Now go to the HADOOP_HOME directory (location where HADOOP is extracted) and run the following command ?
    	$ bin/hadoop              

    The following output will be displayed ?

               	Usage: hadoop [--config confdir] COMMAND	

    Some of the COMMAND options are mentioned below. There are other options available and can be checked using the command mentioned above.

      	namenode -format                format the DFS filesystem  	secondarynamenode               run the DFS secondary namenode  	namenode                        run the DFS namenode  	datanode                        run a DFS datanode  	dfsadmin                        run a DFS admin client  	mradmin                         run a Map-Reduce admin client  	fsck                            run a DFS filesystem checking utility  	

The above output indicates that Standalone installation is completed successfully. Now you can run the sample examples of your choice by calling ?

   $  bin/hadoop jar hadoop-*-examples.jar  

Pseudo Distributed Mode Installation

This is a simulated multi node environment based on a single node server.

Here, the first step required is to configure the SSH in order to access and manage the different nodes. It is mandatory to have the SSH access to the different nodes. Once the SSH is configured, enabled and is accessible we should start configuring the Hadoop. The following configuration files need to be modified:

  • conf/core-site.xml
  • conf/hdfs-site.xml
  • conf/mapred.xml

Open the all the configuration files in vi editor and update the configuration.

Configure core-site.xml file:

$ vi conf/core-site.xml
fs.default.namehdfs://localhost:9000hadoop.tmp.dir/tmp/hadoop-${user.name}

Configure hdfs-site.xml file:

$ vi conf/hdfs-site.xml
dfs.replication1

Configure mapred.xml file:

$ vi conf/mapred.xml
mapred.job.tracker localhost:9001

Once these changes are done, we need to format the name node by using the following command. The command prompt will show all the messages one after another and finally success message.

$ bin/hadoop namenode ?format

Our setup is done for pseudo distributed node. Let’s now start the single node cluster by using the following command. It will again show some set of messages on the command prompt and start the server process.

$ /bin/start-all.sh

Now we should check the status of Hadoop process by executing the jps command as shown below. It will show all the running processes.

$ jps14799 NameNode14977 SecondaryNameNode 15183 DataNode15596 JobTracker15897 TaskTracker

Stopping the Single node Cluster: We can stop the single node cluster by using the following command. The command prompt will display all the stopping processes.

$ bin/stop-all.shstopping jobtrackerlocalhost: stopping tasktrackerstopping namenodelocalhost: stopping datanodelocalhost: stopping secondarynamenode

Distributed Mode Installation

Before we start the distributed mode installation, we must ensure that we have the pseudo distributed setup done and we have at least two machines, one acting as master and the other acting as a slave. Now we run the following commands in sequence.

  • $ bin/stop-all.sh– Make sure none of the nodes are running
  • Open the /etc/hosts file and add the following entries for master and slave
    master
    slave
  • $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave– This command should be executed on master to have the passwordless ssh. We should login using the same username on all the machines. If we need a password, we can set it manually.
  • Now we open the two files – conf/master and conf/slaves. The conf/master defines the name nodes of our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
  • Edit the conf/core-site.xml file to have the following entries –
      		fs.default.name  		hdfs://master:54310 	
  • Edit the conf/mapred-site.xml file to have the following entries –
      		mapred.job.tracker  		hdfs://master:54311 	
  • Edit the conf/hdfs-site.xml file to have the following entries –
      		dfs.replication  		2 	
  • Edit the conf/mapred-site.xml file to have the following entries –
      		mapred.local.dir  		${hadoop-tmp}/mapred/local 		  		mapred.map.tasks  		50 		  		mapred.reduce.tasks  		5 	

Now start the master by using the following command.

 bin/start-dfs.sh 

Once started, check the status on the master by using jps command. You should get the following output ?

14799 NameNode15314 Jps16977 secondaryNameNode

On the slave, the output should be as shown as:

15183 DataNode15616 Jps

Now start the MapReduce daemons using the following command.

$ bin/start-mapred.sh

Once started, check the status on the master by using jps command. You should get the following output:

16017 Jps14799 NameNode15596 JobTracker14977 SecondaryNameNode

On the slaves, the output should be as shown below.

15183 DataNode15897 TaskTracker16284 Jps

Summary

This article has covered different Hadoop installation modes and their technical details. You should always be careful when selecting the installation mode as each has its own purpose. Beginners should start with single mode installation and then proceed with other options.

?

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.

devxblackblue

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.

About Our Journalist