This article will explore the basics of the Hadoop Distributed File System (HDFS), the underlying file system of the Apache Hadoop framework. HDFS is a distributed storage space that spans across thousands of commodity hardware nodes. This file system provides fault tolerance, efficient throughput, streaming data access and reliability. The architecture of HDFS is suitable for storing a large volume of data and processing it quickly. HDFS is a part of Apache eco-system.
Apache Hadoop is a software framework provided by the open source community. This is helpful in storing and processing of data-sets of large scale on clusters of commodity hardware. Hadoop is licensed under the Apache License 2.0.
The Apache Hadoop framework consists of the following modules:
- Hadoop Common – The common module contains libraries and utilities that are required by other modules of Hadoop.
- Hadoop Distributed File System (HDFS) – This is the distributed file-system that stores data on the commodity machines. This also provides a very high aggregate bandwidth across the cluster.
- Hadoop YARN – This is the resource-management platform that is responsible for managing compute resources over the clusters and using them for scheduling of users' applications.
- Hadoop MapReduce – This is the programming model used for large scale data processing.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (a single machine or entire rack) are obvious and thus should be automatically handled in the software application by the Hadoop framework. Apache Hadoop's HDFS components are originally derived from Google's MapReduce and Google File System (GFS) respectively.
Hadoop Distributed File System (HDFS)
HDFS is a primary distributed storage used by the Hadoop applications. An HDFS cluster primarily consists of a NameNode and the DataNode. The NameNode manages the file system metadata and DataNodes are used to store the actual data.
The HDFS architecture diagram explains the basic interactions among NameNode, the DataNodes, and the clients. The client's component calls the NameNode for file metadata or file modifications. The client then performs the actual file I/O operation directly with the DataNodes.
Figure 1: HDFS Architecture
Salient Features of HDFS
The following are some of the most important features:
- Hadoop, including HDFS, is a perfect match for distributed storage and distributed processing using low cost commodity hardware. Hadoop is scalable, fault tolerant and very simple to expand. MapReduce is well known for its simplicity and applicability in the case of large set of distributed applications.
- HDFS is highly configurable. The default configuration setup is good enough for most applications. In general, the default configuration needs to be tuned only for very large clusters.
- Hadoop is written based on the Java platform and is supported on nearly all major platforms.
- Hadoop supports shell and shell-like commands to communicate with HDFS directly.
- The NameNode and DataNodes have their own built in web servers that make it easy to check current status of the cluster.
- New features and updates are frequently implemented in HDFS. The following list is a subset of the useful features available in HDFS:
- File permissions and authentication.
- Rack awareness: This helps to take a node's physical location into account while scheduling tasks and allocating storage.
- Safemode: This is the administrative tool mainly used for maintenance purposes.
- fsck: This is a utility used to diagnose the health of the file system and to find missing files or blocks.
- fetchdt: This is a utility used to fetch a DelegationToken and store it in a file on the local system.
- Rebalancer: This is a tool used to balance the cluster when the data is unevenly distributed across DataNodes.
- Upgrade and rollback: Once the software is upgraded, it is possible to roll back to the HDFS' state before the upgrade in case of any unexpected problem.
- Secondary NameNode: This node performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
- Checkpoint node: This node performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes made to the HDFS. It also replaces the role/function previously filled by the Secondary NameNode. As an alternative, the NameNode allows multiple nodes as check points, as long as there are no Backup nodes available (registered) with the system.
- Backup node: This can be defined as an extension to the Checkpoint node. Along with checkpointing, it is also used to receive a stream of edits from the NameNode. Thus it maintains its own in-memory copy of the namespace. It is always in sync with the active NameNode and namespace state. Only one Backup node is allowed to be registered with the NameNode at a time.
Goal of HDFS
Hadoop has a goal to use commonly available servers in a very large cluster, where each and every server has a set of inexpensive internal disk drives. For better performance, the MapReduce API tries to assign the workloads on these servers where the data is stored to be processed. This is known as data locality. Because of this, in a Hadoop environment, it is not recommended to use a storage area network (SAN), or a network attached storage (NAS). For Hadoop deployments using a SAN or NAS, the extra network communica¬tion overhead can cause performance bottlenecks, especially in case of larger clus¬ters.
Now, consider a situation in which we have a cluster of 1000-machines, and each of these machines has three internal disk drives. Think of the failure rate of a cluster composed of 3000 inexpensive drives + 1000 inexpensive servers! The component mean time to failure (MTTF) you are going to experience in a Hadoop cluster is likely similar to the zipper on your kid's jacket - it is bound to fail. The best part about Hadoop is that the reality of the MTTF rates associated with inexpen¬sive hardware is actually well understood and accepted.
This forms a part of the strength of Hadoop. Hadoop has built-in fault tolerance and fault-compensation capabilities. The same goes for HDFS, as the data is divided into blocks and chunks, and copies of these chunks/blocks are stored on other servers across the Ha¬doop cluster.
Let us consider a file that contains the telephone numbers of all the residents in the United States of America. Those who have their last starting name with A could be stored on server 1; people having their last name begin with B are on server 2, and so on.
In a Hadoop environment, pieces of this phonebook would be stored and distributed on the entire cluster. To reconstruct the data of the entire phonebook, your program would need access the blocks from every server in the cluster. To achieve higher availability, HDFS replicates smaller pieces of data onto two additional servers by default. One can talk about redundancy here but the argument to support redundancy is to avoid the failure condition and provide a fault tolerance solution.
This redundancy can be increased or decreased on a per-file basis or for the whole environment. This redundancy offers multiple benefits. The most obvious being that the data is highly available. In addition to this, the data redundancy allows the Hadoop cluster to break work up into smaller chunks and run those smaller jobs on all the servers in the cluster for better scalability. Finally, as an end user we get the benefit of data locality, which is critical while working with large data sets.
We have seen that HDFS is one of the major components in the Apache Hadoop eco-system. The file system is the underlying storage structure, which is very powerful compared to the local file system.
Hope you have enjoyed the article and understood the basic concepts of HDFS. Keep reading.
About the Author
Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.