Apache Hadoop can be installed in different modes as per the requirement. These different modes are configured during installation and by default, Hadoop is installed in Standalone mode. The other modes are Pseudo distributed mode and distributed mode. The purpose of this tutorial is to explain different installation modes in a simple way so that the readers can follow it and do their own work.
In this article, I will discuss different installation modes and their details.
We all know that Apache Hadoop is an open source framework that allows distributed processing of large sets of data set across different clusters using simple programming. Hadoop has the ability to scale up to thousands of computers from a single server. Thus in these conditions installation of Hadoop becomes most critical. We can install Hadoop in three different modes:
- Standalone mode - Single Node Cluster
- Pseudo distributed mode - Single Node Cluster
- Distributed mode. - Multi Node Cluster
Purpose for Different Installation Modes
When Apache Hadoop is used in a production environment, multiple server nodes are used for distributed computing. But for understanding the basics and playing around with Hadoop, single node installation is sufficient. There is another mode known as 'pseudo distributed' mode. This mode is used to simulate the multi node environment on a single server.
In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system should have java version 1.6.x installed on it.
Standalone Mode Installation
Now, let us check the standalone mode installation process by following the steps mentioned below.
Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.
- Step 1 - If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE, install sun-java6 from Canonical Partner Repository by using the following command.
Note: The Canonical Partner Repository contains free of cost closed source third party software. But the Canonical does not have access to the source code instead they just package and test it.
Add the canonical partner to the apt repositories using -
$ sudo add-apt-repository "deb http://archive.canonical.com/lucid partner"
- Step 2 - Update the source list.
$ sudo apt-get update
- Step 3 - Install JDK version 1.6.x from Sun/Oracle.
$ sudo apt-get install sun-java6-jdk
- Step 4 - Once JDK installation is over make sure that it is correctly setup using - version 1.6.x from Sun/Oracle.
user@ubuntu:~# java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b02)
Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)
Add Hadoop User
Download the Hadoop binary and install
Define env variable - JAVA_HOME
Installation in Single mode
- Step 8 - Now go to the HADOOP_HOME directory (location where HADOOP is extracted) and run the following command –
The following output will be displayed –
Usage: hadoop [--config confdir] COMMAND
Some of the COMMAND options are mentioned below. There are other options available and can be checked using the command mentioned above.
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
mradmin run a Map-Reduce admin client
fsck run a DFS filesystem checking utility
The above output indicates that Standalone installation is completed successfully. Now you can run the sample examples of your choice by calling –
$ bin/hadoop jar hadoop-*-examples.jar <NAME> <PARAMS>
Pseudo Distributed Mode Installation
This is a simulated multi node environment based on a single node server.
Here, the first step required is to configure the SSH in order to access and manage the different nodes. It is mandatory to have the SSH access to the different nodes. Once the SSH is configured, enabled and is accessible we should start configuring the Hadoop. The following configuration files need to be modified:
Open the all the configuration files in vi editor and update the configuration.
Configure core-site.xml file:
$ vi conf/core-site.xml
Configure hdfs-site.xml file:
$ vi conf/hdfs-site.xml
Configure mapred.xml file:
$ vi conf/mapred.xml
Once these changes are done, we need to format the name node by using the following command. The command prompt will show all the messages one after another and finally success message.
$ bin/hadoop namenode –format
Our setup is done for pseudo distributed node. Let's now start the single node cluster by using the following command. It will again show some set of messages on the command prompt and start the server process.
Now we should check the status of Hadoop process by executing the jps command as shown below. It will show all the running processes.
Stopping the Single node Cluster: We can stop the single node cluster by using the following command. The command prompt will display all the stopping processes.
localhost: stopping tasktracker
localhost: stopping datanode
localhost: stopping secondarynamenode
Distributed Mode Installation
Before we start the distributed mode installation, we must ensure that we have the pseudo distributed setup done and we have at least two machines, one acting as master and the other acting as a slave. Now we run the following commands in sequence.
$ bin/stop-all.sh- Make sure none of the nodes are running
- Open the /etc/hosts file and add the following entries for master and slave
<IP ADDRESS> master
<IP ADDRESS> slave
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave- This command should be executed on master to have the passwordless ssh. We should login using the same username on all the machines. If we need a password, we can set it manually.
- Now we open the two files - conf/master and conf/slaves. The conf/master defines the name nodes of our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
- Edit the conf/core-site.xml file to have the following entries -
- Edit the conf/mapred-site.xml file to have the following entries -
- Edit the conf/hdfs-site.xml file to have the following entries -
- Edit the conf/mapred-site.xml file to have the following entries -
Now start the master by using the following command.
Once started, check the status on the master by using jps command. You should get the following output –
On the slave, the output should be as shown as:
Now start the MapReduce daemons using the following command.
Once started, check the status on the master by using jps command. You should get the following output:
On the slaves, the output should be as shown below.
This article has covered different Hadoop installation modes and their technical details. You should always be careful when selecting the installation mode as each has its own purpose. Beginners should start with single mode installation and then proceed with other options.
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.