Exploring the Hadoop Distributed File System (HDFS)

Exploring the Hadoop Distributed File System (HDFS)

This article will explore the basics of the Hadoop Distributed File System (HDFS), the underlying file system of the Apache Hadoop framework. HDFS is a distributed storage space that spans across thousands of commodity hardware nodes. This file system provides fault tolerance, efficient throughput, streaming data access and reliability. The architecture of HDFS is suitable for storing a large volume of data and processing it quickly. HDFS is a part of Apache eco-system.

Introduction

Apache Hadoop is a software framework provided by the open source community. This is helpful in storing and processing of data-sets of large scale on clusters of commodity hardware. Hadoop is licensed under the Apache License 2.0.

The Apache Hadoop framework consists of the following modules:

  • Hadoop Common ? The common module contains libraries and utilities that are required by other modules of Hadoop.
  • Hadoop Distributed File System (HDFS) ? This is the distributed file-system that stores data on the commodity machines. This also provides a very high aggregate bandwidth across the cluster.
  • Hadoop YARN ? This is the resource-management platform that is responsible for managing compute resources over the clusters and using them for scheduling of users’ applications.
  • Hadoop MapReduce ? This is the programming model used for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (a single machine or entire rack) are obvious and thus should be automatically handled in the software application by the Hadoop framework. Apache Hadoop’s HDFS components are originally derived from Google’s MapReduce and Google File System (GFS) respectively.

Hadoop Distributed File System (HDFS)

HDFS is a primary distributed storage used by the Hadoop applications. An HDFS cluster primarily consists of a NameNode and the DataNode. The NameNode manages the file system metadata and DataNodes are used to store the actual data.

The HDFS architecture diagram explains the basic interactions among NameNode, the DataNodes, and the clients. The client’s component calls the NameNode for file metadata or file modifications. The client then performs the actual file I/O operation directly with the DataNodes.


Figure 1: HDFS Architecture

Salient Features of HDFS

The following are some of the most important features:

  • Hadoop, including HDFS, is a perfect match for distributed storage and distributed processing using low cost commodity hardware. Hadoop is scalable, fault tolerant and very simple to expand. MapReduce is well known for its simplicity and applicability in the case of large set of distributed applications.
  • HDFS is highly configurable. The default configuration setup is good enough for most applications. In general, the default configuration needs to be tuned only for very large clusters.
  • Hadoop is written based on the Java platform and is supported on nearly all major platforms.
  • Hadoop supports shell and shell-like commands to communicate with HDFS directly.
  • The NameNode and DataNodes have their own built in web servers that make it easy to check current status of the cluster.
  • New features and updates are frequently implemented in HDFS. The following list is a subset of the useful features available in HDFS:
    • File permissions and authentication.
    • Rack awareness: This helps to take a node’s physical location into account while scheduling tasks and allocating storage.
    • Safemode: This is the administrative tool mainly used for maintenance purposes.
    • fsck: This is a utility used to diagnose the health of the file system and to find missing files or blocks.
    • fetchdt: This is a utility used to fetch a DelegationToken and store it in a file on the local system.
    • Rebalancer: This is a tool used to balance the cluster when the data is unevenly distributed across DataNodes.
    • Upgrade and rollback: Once the software is upgraded, it is possible to roll back to the HDFS’ state before the upgrade in case of any unexpected problem.
    • Secondary NameNode: This node performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
    • Checkpoint node: This node performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes made to the HDFS. It also replaces the role/function previously filled by the Secondary NameNode. As an alternative, the NameNode allows multiple nodes as check points, as long as there are no Backup nodes available (registered) with the system.
    • Backup node: This can be defined as an extension to the Checkpoint node. Along with checkpointing, it is also used to receive a stream of edits from the NameNode. Thus it maintains its own in-memory copy of the namespace. It is always in sync with the active NameNode and namespace state. Only one Backup node is allowed to be registered with the NameNode at a time.

Goal of HDFS

Hadoop has a goal to use commonly available servers in a very large cluster, where each and every server has a set of inexpensive internal disk drives. For better performance, the MapReduce API tries to assign the workloads on these servers where the data is stored to be processed. This is known as data locality. Because of this, in a Hadoop environment, it is not recommended to use a storage area network (SAN), or a network attached storage (NAS). For Hadoop deployments using a SAN or NAS, the extra network communica?tion overhead can cause performance bottlenecks, especially in case of larger clus?ters.

Now, consider a situation in which we have a cluster of 1000-machines, and each of these machines has three internal disk drives. Think of the failure rate of a cluster composed of 3000 inexpensive drives + 1000 inexpensive servers! The component mean time to failure (MTTF) you are going to experience in a Hadoop cluster is likely similar to the zipper on your kid’s jacket – it is bound to fail. The best part about Hadoop is that the reality of the MTTF rates associated with inexpen?sive hardware is actually well understood and accepted.

This forms a part of the strength of Hadoop. Hadoop has built-in fault tolerance and fault-compensation capabilities. The same goes for HDFS, as the data is divided into blocks and chunks, and copies of these chunks/blocks are stored on other servers across the Ha?doop cluster.

Case Study

Let us consider a file that contains the telephone numbers of all the residents in the United States of America. Those who have their last starting name with A could be stored on server 1; people having their last name begin with B are on server 2, and so on.

In a Hadoop environment, pieces of this phonebook would be stored and distributed on the entire cluster. To reconstruct the data of the entire phonebook, your program would need access the blocks from every server in the cluster. To achieve higher availability, HDFS replicates smaller pieces of data onto two additional servers by default. One can talk about redundancy here but the argument to support redundancy is to avoid the failure condition and provide a fault tolerance solution.

This redundancy can be increased or decreased on a per-file basis or for the whole environment. This redundancy offers multiple benefits. The most obvious being that the data is highly available. In addition to this, the data redundancy allows the Hadoop cluster to break work up into smaller chunks and run those smaller jobs on all the servers in the cluster for better scalability. Finally, as an end user we get the benefit of data locality, which is critical while working with large data sets.

Conclusion

We have seen that HDFS is one of the major components in the Apache Hadoop eco-system. The file system is the underlying storage structure, which is very powerful compared to the local file system.

Hope you have enjoyed the article and understood the basic concepts of HDFS. Keep reading.

?

About the Author

Kaushik Pal is a technical architect with 15 years of experience in enterprise application and product development. He has expertise in web technologies, architecture/design, java/j2ee, Open source and big data technologies. You can find more of his work at www.techalpine.com and you can email him here.

devx-admin

devx-admin

Share the Post:
Poland Energy Future

Westinghouse Builds Polish Power Plant

Westinghouse Electric Company and Bechtel have come together to establish a formal partnership in order to design and construct Poland’s inaugural nuclear power plant at

EV Labor Market

EV Industry Hurting For Skilled Labor

The United Auto Workers strike has highlighted the anticipated change towards a future dominated by electric vehicles (EVs), a shift which numerous people think will

Soaring EV Quotas

Soaring EV Quotas Spark Battle Against Time

Automakers are still expected to meet stringent electric vehicle (EV) sales quotas, despite the delayed ban on new petrol and diesel cars. Starting January 2023,

Affordable Electric Revolution

Tesla Rivals Make Bold Moves

Tesla, a name synonymous with EVs, has consistently been at the forefront of the automotive industry’s electric revolution. The products that Elon Musk has developed

Poland Energy Future

Westinghouse Builds Polish Power Plant

Westinghouse Electric Company and Bechtel have come together to establish a formal partnership in order to design and construct Poland’s inaugural nuclear power plant at the Lubiatowo-Kopalino site in Pomerania.

EV Labor Market

EV Industry Hurting For Skilled Labor

The United Auto Workers strike has highlighted the anticipated change towards a future dominated by electric vehicles (EVs), a shift which numerous people think will result in job losses. However,

Soaring EV Quotas

Soaring EV Quotas Spark Battle Against Time

Automakers are still expected to meet stringent electric vehicle (EV) sales quotas, despite the delayed ban on new petrol and diesel cars. Starting January 2023, more than one-fifth of automobiles

Affordable Electric Revolution

Tesla Rivals Make Bold Moves

Tesla, a name synonymous with EVs, has consistently been at the forefront of the automotive industry’s electric revolution. The products that Elon Musk has developed are at the forefront because

Sunsets' Technique

Inside the Climate Battle: Make Sunsets’ Technique

On February 12, 2023, Luke Iseman and Andrew Song from the solar geoengineering firm Make Sunsets showcased their technique for injecting sulfur dioxide (SO₂) into the stratosphere as a means

AI Adherence Prediction

AI Algorithm Predicts Treatment Adherence

Swoop, a prominent consumer health data company, has unveiled a cutting-edge algorithm capable of predicting adherence to treatment in people with Multiple Sclerosis (MS) and other health conditions. Utilizing artificial

Personalized UX

Here’s Why You Need to Use JavaScript and Cookies

In today’s increasingly digital world, websites often rely on JavaScript and cookies to provide users with a more seamless and personalized browsing experience. These key components allow websites to display

Geoengineering Methods

Scientists Dimming the Sun: It’s a Good Thing

Scientists at the University of Bern have been exploring geoengineering methods that could potentially slow down the melting of the West Antarctic ice sheet by reducing sunlight exposure. Among these

why startups succeed

The Top Reasons Why Startups Succeed

Everyone hears the stories. Apple was started in a garage. Musk slept in a rented office space while he was creating PayPal with his brother. Facebook was coded by a

Bold Evolution

Intel’s Bold Comeback

Intel, a leading figure in the semiconductor industry, has underperformed in the stock market over the past five years, with shares dropping by 4% as opposed to the 176% return

Semiconductor market

Semiconductor Slump: Rebound on the Horizon

In recent years, the semiconductor sector has faced a slump due to decreasing PC and smartphone sales, especially in 2022 and 2023. Nonetheless, as 2024 approaches, the industry seems to

Elevated Content Deals

Elevate Your Content Creation with Amazing Deals

The latest Tech Deals cater to creators of different levels and budgets, featuring a variety of computer accessories and tools designed specifically for content creation. Enhance your technological setup with

Learn Web Security

An Easy Way to Learn Web Security

The Web Security Academy has recently introduced new educational courses designed to offer a comprehensible and straightforward journey through the intricate realm of web security. These carefully designed learning courses

Military Drones Revolution

Military Drones: New Mobile Command Centers

The Air Force Special Operations Command (AFSOC) is currently working on a pioneering project that aims to transform MQ-9 Reaper drones into mobile command centers to better manage smaller unmanned

Tech Partnership

US and Vietnam: The Next Tech Leaders?

The US and Vietnam have entered into a series of multi-billion-dollar business deals, marking a significant leap forward in their cooperation in vital sectors like artificial intelligence (AI), semiconductors, and

Huge Savings

Score Massive Savings on Portable Gaming

This week in tech bargains, a well-known firm has considerably reduced the price of its portable gaming device, cutting costs by as much as 20 percent, which matches the lowest

Cloudfare Protection

Unbreakable: Cloudflare One Data Protection Suite

Recently, Cloudflare introduced its One Data Protection Suite, an extensive collection of sophisticated security tools designed to protect data in various environments, including web, private, and SaaS applications. The suite

Drone Revolution

Cool Drone Tech Unveiled at London Event

At the DSEI defense event in London, Israeli defense firms exhibited cutting-edge drone technology featuring vertical-takeoff-and-landing (VTOL) abilities while launching two innovative systems that have already been acquired by clients.

2D Semiconductor Revolution

Disrupting Electronics with 2D Semiconductors

The rapid development in electronic devices has created an increasing demand for advanced semiconductors. While silicon has traditionally been the go-to material for such applications, it suffers from certain limitations.

Cisco Growth

Cisco Cuts Jobs To Optimize Growth

Tech giant Cisco Systems Inc. recently unveiled plans to reduce its workforce in two Californian cities, with the goal of optimizing the company’s cost structure. The company has decided to

FAA Authorization

FAA Approves Drone Deliveries

In a significant development for the US drone industry, drone delivery company Zipline has gained Federal Aviation Administration (FAA) authorization, permitting them to operate drones beyond the visual line of

Mortgage Rate Challenges

Prop-Tech Firms Face Mortgage Rate Challenges

The surge in mortgage rates and a subsequent decrease in home buying have presented challenges for prop-tech firms like Divvy Homes, a rent-to-own start-up company. With a previous valuation of

Lighthouse Updates

Microsoft 365 Lighthouse: Powerful Updates

Microsoft has introduced a new update to Microsoft 365 Lighthouse, which includes support for alerts and notifications. This update is designed to give Managed Service Providers (MSPs) increased control and

Website Lock

Mysterious Website Blockage Sparks Concern

Recently, visitors of a well-known resource website encountered a message blocking their access, resulting in disappointment and frustration among its users. While the reason for this limitation remains uncertain, specialists