Learn more about how the HDFS Federation helps to enhance an existing HDFS architecture and how it provides a clear separation between namespace and storage, enabling scalability and isolation at the cluster level.
Hadoop Federation separates the namespace layer and storage layer and enables the block storage layer. It also expands the architecture of an existing HDFS cluster to allow new implementations and use cases. The current HDFS architecture has two layers:
- Namespace – This layer manages files, directories and blocks. This layer supports the basic file system operations such as listing of files, creation of files, modification of files and deletion of files and folders.
- Block Storage – This layer has two parts:
- Block Management – This manages the DataNodes in the cluster and provides operations such as creation, deletion, modification and search. It also takes care of the replication management.
- Physical Storage – This stores the blocks and provides access for read or write operations.
In the current HDFS architecture, we have only one namespace for the whole cluster which is managed by a single name node. Using this approach it becomes easier to implement the HDFS cluster. This layering of architecture works fine for smaller setups while for larger organizations, where a huge volume of data needs to be taken care of at a rapid speed, it was found that this approach has some limitations that are handled by the Hadoop Federation. So Hadoop Federation can be defined as the advanced architecture to overcome the limitations of current HDFS implementation.
Let's explore the limitations as explained below:
- Tightly Coupled Block Storage and Namespace – In the current architecture the block storage and the namespace are tightly coupled which makes the alternate implementations of name nodes challenging and restricts other services to use the block storage directly.
- Namespace Scalability – The HDFS cluster scales horizontally by adding DataNodes but we can't add more namespace to an existing cluster horizontally. We can scale namespace vertically on a single namenode. The namenode stores the complete file system metadata within its memory which limits the number of blocks, files and directories to be supported on the file system that needs to be accommodated in the memory of the single namenode.
- Performance – The current file system operations are limited to the throughput of a single name node which at present supports 60000 concurrent tasks. But the upcoming Map Reduce from Apache will have a support for more than 100,000 concurrent tasks and thus will require multiple nodes.
- Isolation – In general the HDFS deployments are available on a multi-tenant environment where a single cluster is shared by multiple organizations. In this setup, a separate namespace is not possible for one application or one organization.
Hadoop Federation allows scaling the name service horizontally. It uses several NameNodes or namespaces which are independent of each other. These independent NameNodes are federated --they don't require inter coordination. These DataNodes are used as common storage by all the NameNodes. Each DataNode is registered with all the NameNodes in the cluster. These DataNodes send periodic reports and responds to the commands from the name nodes. We have a block pool that is a set of blocks that belong to a single namespace. In a cluster, the DataNodes stores blocks for all the block pools. Each block pool is managed independently. This enables the name space to generate block ids for new blocks without informing other namespaces. If one namenode fails for any reason, the datanode keeps on serving from other NameNodes.
One NameSpace and its block are collectively called Namespace Volume. When a namespace or a NameNode is deleted the corresponding block pool at the DataNode is deleted automatically. In the process of cluster up-gradation, each namespace volume is upgraded as a unit.
Benefits of Hadoop Federation
Hadoop Federation comes with some advantages and benefits:
- Scalability and Isolation – Multiple NameNodes horizontally scale up in the file system namespace. This actually separates namespace volumes for users and categories of application and provides an absolute isolation.
- Generic Storage Service – The block level pool abstraction allows the architecture to build new file systems on top of block storage. We can easily build new applications on the block storage layer without using the file system interface. Customized categories of block pool can also be built which are different from the default block pool.
- Simple Design – Namenodes and namespaces are independent of each other. There is hardly any scenario that requires changing the existing name nodes. Each name node is built to be robust. Federation is also backward compatible. It easily integrates with the existing single node deployments that work without any configuration changes.
Configuring an HDFS Federation
Configuration of Hadoop Federation is designed in such a way that all the nodes in the cluster have the same configuration. The configuration is carried out in the following steps:
- Step 1: The following parameters needs to be added in the existing configuration
- dfs.nameservices – This is configured with a list of comma separated NameServiceIDs. This parameter is used by DataNodes to determine all the NameNodes in the cluster.
- Step 2: The following configurations needs to be suffixed with the corresponding name service ID into the common configuration file.
- Secondary NameNode
A sample configuration file for two NameNodes is shown below.
Listing 1: A Sample configuration file for two nodes
Formatting the NameNode
- Step 1: A single name node can be formatted using the following:
$HADOOP_USER_HOME/bin/hdfs namenode -format [-clusterId <cluster_id>]
The cluster id should be unique and must not conflict with any other exiting cluster id. If not provided, a unique cluster id is generated at the time of formatting.
- Step 2: Additional NameNode can be formatted using the following command:
$HADOOP_PREFIX_HOME/bin/hdfs namenode -format -clusterId <cluster_id>
It is important that the cluster id mentioned here should be the same of that mentioned in the step 1. If these two are different, the additional NameNode won't be the part of the federated cluster.
Starting and stopping the cluster
Check the commands to start and stop the cluster.
Add a new namenode to an existing cluster
We have already described that multiple NameNodes are at the heart of Hadoop Federation. So it is important to understand the steps to add new NameNodes and scale horizontally.
The following steps are needed to add new NameNodes:
- The configuration parameter – dfs.nameservices needs to be added in the configuration.
- NameServiceID must be suffixed in the configuration
- New Namenode related to the config must be added in the configuration files.
- The configuration file should be propagated to all the nodes in the cluster.
- Start the new NameNode and the secondary NameNode
- Refresh the other DataNodes to pick the newly added NameNode by running the following command:
- The above command must be executed against all DataNodes on the cluster.
HDFS Federation has been introduced to overcome the limitations of earlier HDFS implementation. Adding scalability at the namespace layer is the most important feature of HDFS Federation architecture. But HDFS Federation is also backward compatible, so the single NameNode configuration will also work without any changes.
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.