Overview
Apache Hadoop can be installed in different modes as per the requirement. These different modes are configured during installation and by default, Hadoop is installed in Standalone mode. The other modes are Pseudo distributed mode and distributed mode. The purpose of this tutorial is to explain different installation modes in a simple way so that the readers can follow it and do their own work.
In this article, I will discuss different installation modes and their details.
Introduction
We all know that Apache Hadoop is an open source framework that allows distributed processing of large sets of data set across different clusters using simple programming. Hadoop has the ability to scale up to thousands of computers from a single server. Thus in these conditions installation of Hadoop becomes most critical. We can install Hadoop in three different modes:
- Standalone mode – Single Node Cluster
- Pseudo distributed mode – Single Node Cluster
- Distributed mode. – Multi Node Cluster
Purpose for Different Installation Modes
When Apache Hadoop is used in a production environment, multiple server nodes are used for distributed computing. But for understanding the basics and playing around with Hadoop, single node installation is sufficient. There is another mode known as ‘pseudo distributed’ mode. This mode is used to simulate the multi node environment on a single server.
In this document we will discuss how to install Hadoop on Ubuntu Linux. In any mode, the system should have java version 1.6.x installed on it.
Standalone Mode Installation
Now, let us check the standalone mode installation process by following the steps mentioned below.
Install Java
Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.
- Step 1 – If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE, install sun-java6 from Canonical Partner Repository by using the following command.
Note: The Canonical Partner Repository contains free of cost closed source third party software. But the Canonical does not have access to the source code instead they just package and test it.
Add the canonical partner to the apt repositories using –
$ sudo add-apt-repository "deb http://archive.canonical.com/lucid partner"
- Step 2 – Update the source list.
$ sudo apt-get update
- Step 3 – Install JDK version 1.6.x from Sun/Oracle.
$ sudo apt-get install sun-java6-jdk
- Step 4 – Once JDK installation is over make sure that it is correctly setup using – version 1.6.x from Sun/Oracle.
[email protected]:~# java -version java version "1.6.0_45" Java(TM) SE Runtime Environment (build 1.6.0_45-b02) Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)
Add Hadoop User
- Step 5 – Add a dedicated Hadoop unix user into you system as under to isolate this installation from other software –
$ sudo adduser hadoop_admin
Download the Hadoop binary and install
- Step 6 – Download Apache Hadoop from the apache web site. Hadoop comes in the form of tar-gx format. Copy this binary into the /usr/local/installables folder. The folder – installables should be created first under /usr/local before this step. Now run the following commands as sudo
$ cd /usr/local/installables $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2
Define env variable – JAVA_HOME
- Step 7 – Open the Hadoop configuration file (hadoop-env.sh) in the location – /usr/local/installables/hadoop-0.20.2/conf/hadoop-env.sh and define the JAVA_HOME as under ?
export JAVA_HOME=path/where/jdk/is/installed
(e.g. /usr/bin/java)
Installation in Single mode
- Step 8 – Now go to the HADOOP_HOME directory (location where HADOOP is extracted) and run the following command ?
$ bin/hadoop
The following output will be displayed ?
Usage: hadoop [--config confdir] COMMAND
Some of the COMMAND options are mentioned below. There are other options available and can be checked using the command mentioned above.
namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client mradmin run a Map-Reduce admin client fsck run a DFS filesystem checking utility
The above output indicates that Standalone installation is completed successfully. Now you can run the sample examples of your choice by calling ?
$ bin/hadoop jar hadoop-*-examples.jar
Pseudo Distributed Mode Installation
This is a simulated multi node environment based on a single node server.
Here, the first step required is to configure the SSH in order to access and manage the different nodes. It is mandatory to have the SSH access to the different nodes. Once the SSH is configured, enabled and is accessible we should start configuring the Hadoop. The following configuration files need to be modified:
- conf/core-site.xml
- conf/hdfs-site.xml
- conf/mapred.xml
Open the all the configuration files in vi editor and update the configuration.
Configure core-site.xml file:
$ vi conf/core-site.xml
fs.default.name hdfs://localhost:9000 hadoop.tmp.dir /tmp/hadoop-${user.name}
Configure hdfs-site.xml file:
$ vi conf/hdfs-site.xml
dfs.replication 1
Configure mapred.xml file:
$ vi conf/mapred.xml
mapred.job.tracker localhost:9001
Once these changes are done, we need to format the name node by using the following command. The command prompt will show all the messages one after another and finally success message.
$ bin/hadoop namenode ?format
Our setup is done for pseudo distributed node. Let’s now start the single node cluster by using the following command. It will again show some set of messages on the command prompt and start the server process.
$ /bin/start-all.sh
Now we should check the status of Hadoop process by executing the jps command as shown below. It will show all the running processes.
$ jps14799 NameNode14977 SecondaryNameNode 15183 DataNode15596 JobTracker15897 TaskTracker
Stopping the Single node Cluster: We can stop the single node cluster by using the following command. The command prompt will display all the stopping processes.
$ bin/stop-all.shstopping jobtrackerlocalhost: stopping tasktrackerstopping namenodelocalhost: stopping datanodelocalhost: stopping secondarynamenode
Distributed Mode Installation
Before we start the distributed mode installation, we must ensure that we have the pseudo distributed setup done and we have at least two machines, one acting as master and the other acting as a slave. Now we run the following commands in sequence.
$ bin/stop-all.sh
– Make sure none of the nodes are running- Open the /etc/hosts file and add the following entries for master and slave
master
slave $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave
– This command should be executed on master to have the passwordless ssh. We should login using the same username on all the machines. If we need a password, we can set it manually.- Now we open the two files – conf/master and conf/slaves. The conf/master defines the name nodes of our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
- Edit the conf/core-site.xml file to have the following entries –
fs.default.name hdfs://master:54310 - Edit the conf/mapred-site.xml file to have the following entries –
mapred.job.tracker hdfs://master:54311 - Edit the conf/hdfs-site.xml file to have the following entries –
dfs.replication 2 - Edit the conf/mapred-site.xml file to have the following entries –
mapred.local.dir ${hadoop-tmp}/mapred/local mapred.map.tasks 50 mapred.reduce.tasks 5
Now start the master by using the following command.
bin/start-dfs.sh
Once started, check the status on the master by using jps command. You should get the following output ?
14799 NameNode15314 Jps16977 secondaryNameNode
On the slave, the output should be as shown as:
15183 DataNode15616 Jps
Now start the MapReduce daemons using the following command.
$ bin/start-mapred.sh
Once started, check the status on the master by using jps command. You should get the following output:
16017 Jps14799 NameNode15596 JobTracker14977 SecondaryNameNode
On the slaves, the output should be as shown below.
15183 DataNode15897 TaskTracker16284 Jps
Summary
This article has covered different Hadoop installation modes and their technical details. You should always be careful when selecting the installation mode as each has its own purpose. Beginners should start with single mode installation and then proceed with other options.
?
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.