Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconBig Databreadcumb forward arrow iconWhat is Hadoop Distributed File System (HDFS)? Architecture, Features & Operations

What is Hadoop Distributed File System (HDFS)? Architecture, Features & Operations

Last updated:
3rd Feb, 2020
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
What is Hadoop Distributed File System (HDFS)? Architecture, Features & Operations

Hadoop Distributed File System or HDFS is Hadoop’s primary storage system. It stores large data files that run on commodity hardware. This storage system is scalable, easily expandable, and tolerant to faults. 

When there is too much data stored on one physical machine, it becomes that storage is divided across several machines to avoid loss of data. HDFS is one such distributed file storage system that manages storage operations across many physical machines. Here is an HDFS tutorial for you to better understand how this system works. Let’s start with its architecture.

HDFS Architecture

Hadoop Distributed File System has a master-slave architecture with the following components:

  1. Namenode: It is the commodity hardware that holds both the namenode software and the Linux/GNU OS. Namenode software can smoothly run on commodity hardware without encountering any trouble whatsoever. The system with namenode as a component functions as the master server. It performs tasks that include regulating how clients are accessing files, managing the file system namespace, and executing operations, including opening, closing, and renaming of directories and files. 
  2. Datanode: It is a commodity hardware that contains the data node software and Linux/GNU OS. Nodes in a cluster will always have data nodes associated with them. These nodes are responsible for managing the storage of the commodity hardware/system. Some of the tasks that data nodes perform include read/write operations according to client request and creation, replication, and deletion of blocks based on the instructions given by the namenode. 
  3. Block: The entire user data is stored in HDFS files. Every file belonging to a particular file system is divided into one or more than one segments, which are then stored is datanodes. The file segments that files get divided into are blocks. So, the smallest data that HDFS is capable of reading or writing is a block. Initially, every block is 64MB in size. But, this size can be increased according to HDFS configuration changes. 

HDFS architecture gives a clear and unambiguous picture of HDFS works. It consists of several datanodes but just a single namenode. Metadata is stored in the namenode while the actual worker of the two types of nodes is the datanode. Nodes are organised in different racks on which data blocks are stored to improve fault tolerance and data reliability. The clients have to interact with the namenode to read/write a file. The cluster has several datanodes that use the local disk to store available data. Datanode and namenode are perpetually in touch with each other. Datanode is also responsible for replicating data using the replication feature to different datanodes.

Ads of upGrad blog

Explore our Popular Software Engineering Courses

Read and write operations in HDFS take place at the smallest level, i.e. the block level. The concept of data replication is central to how HDFS works – high availability of data is ensured during node failure by creating replicas of blocks and distribution of those in the entire cluster. 

HDFS Operations

HDFS and Linux file system are quite similar to each other. So, HDFS allows us to perform all the operations that we are used to performing with local file systems – we can create a director, change permissions, copy files, and do a lot more. We also have several file access rights, including reading, writing, and executing. 

Read operation in HDFS: If you want to read a file stored in HDFS, you will have to interact with namenode. As already mentioned, all the metadata is stored in the namenode. Once you interact with namenode, it will give you the address of the datanode where the file you are looking for is stored. You can then interact with the datanode whose address you have been given by namenode, and then read the information from there.

You interact with the file system’s API, which requests namenode to share block address. Before giving this information, namenode runs a check to find out whether you have the right to access this data or not. Once this check is done, namenode either shares the block location or denies access due to restrictions. 

You are given a form of a token by the namenode, which you are required to show to the respective datanode for accessing a file. This is a form of security mechanism that HDFS employs to ensure that the right person is accessing data. The datanode will only let you read the file after you display the token. 

Write operation: The writing operation follows the same initial pattern. You need to request the namenode to let you write data. In return, it will provide you with the location of the datanode on which the write operation has to be performed. As soon as you are done with performing this operation, the datanode will start replicating these blocks of written data on other datanodes. Once the replication is done, you will receive an acknowledgement. The authentication mechanism in the write operation is the same as the read operation.

In-Demand Software Development Skills

HDFS Features

  1. Availability: There aren’t too many file systems that come with the high availability of HDFS. The file system follows a mechanism of replicating data in the form of block replicas on the datanodes (slaves) throughout a cluster. To access this data, you need to interact with datanodes that contains the blocks of information they are looking for.
  2. Reliability: Hadoop Distributed File System is a highly reliable data storage system. The amount of data that can be stored on HDFS ranges in petabytes. It uses a cluster to store all its data, which is separated to form blocks. It then uses nodes of the cluster to store these blocks. 
  3. Fault tolerance: This feature is the working strength of HDFS in conditions that aren’t as conducive as they usually are. HDFS tolerates faults like no other file system does. It safeguards your data from the effects of any unforeseen, even in the future. As already alluded to, replication of data is done on different machines. What happens when any of these machines stop working? It could have been a significant problem with any other system, but not HDFS. HDFS allows you to access your data from any other machine that also has a copy the data blocks that you are looking for. This is called true fault-tolerance. 
  4. Scalability: HDFS use different nodes in a cluster to store data. When storage requirements rise, you can always go ahead and scale the cluster. This is another feature that is unique to HDFS. The file distribution system provides you with two mechanisms to scale the cluster – horizontal and vertical scalability. 
  5. Replication: Replication is a feature that sets HDFS apart from other storage systems. Replication minimises the instances of data loss due to an unfavourable event, such as node crashing, hardware failure, and others. The process of replication is carried out regularly and on different machines. So, there is no data loss if a machine goes down. You can use any other machine to get your data.

Explore Our Software Development Free Courses

HDFS Objectives

  1. Managing huge data sets: Unlike other file distribution systems, HDFS has the required architecture in place to manage applications that feature huge datasets. Depending on the enormity of the data sets in question, it can have hundreds of nodes for every cluster. 
  2. Fault detection and recovery: Not many can beat HDFS in its capabilities associated with detecting faults and then appropriately dealing with them. A large number of commodity hardware in question makes HDFS exposed to frequent failure of components. However, this isn’t a disadvantage. Every system dealing with such commodity hardware is open to failure. However, the question remains whether or not the system is adept at quickly and automatically detecting those failures and providing recovery at the same time. HDFS certainly is capable. 
  3. Increased throughput: HDFS processes every task efficiently as the actual computation is carried out near the data itself.  This is especially very important when we are dealing with huge datasets. This mechanism increases throughput and does away with the problem of network traffic significantly.

To conclude, we would like to say that HDFS can store huge amounts of data in a reliable manner and without feeling the effects of hardware failure. It is also highly fault-tolerant, highly available, and highly scalable.

Ads of upGrad blog

To travel the depths of Hadoop and to become an expert, check out upGrad & BITS Pilani’s PG Certification in Big Data & Analytics. Get hands-on experience, one-to-one with industry experts, dedicated mentor, BITS Pilani Alumni Status, and more.

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.


Utkarsh Singh

Blog Author
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Big Data Course

Frequently Asked Questions (FAQs)

1What is Hadoop?

Hadoop is an open-source software framework that is developed by the Apache Software Foundation for developing data-intensive, distributed computing applications. Hadoop is built in such a way that it can work on a single computer to thousands of computers. Errors are handled at the application layer rather than relying on hardware for dependability, which is a core Hadoop idea. Its working and architecture were influenced by studies on Google MapReduce and Google File System. It is a project with high-level specs, and it was built with the Java programming language by a global community of programmers.

2What do distributed file systems do?

A distributed file system, commonly known as DFS, is a file system that keeps data on a server. The data is obtained and analyzed as if it were stored locally on the client computer. The DFS allows users on a network to communicate information and files in a regulated and permitted manner. The server enables client users to transfer files and keep data as if they were doing it locally. The servers, on the other hand, have complete control over the data and provide access to the clients. In comparison to other solutions, a DFS provides practical and well-managed data storage sharing possibilities on a network.

3What are the disadvantages of HDFS?

Hadoop's fundamental flaw is that it isn't designed for tiny data sets. Due to its large capacity architecture, HDFS is unable to enable the random reading of tiny. Hadoop is only ideal for batch processing and not for streaming data. As a result, overall performance suffers. The MapReduce framework does not fully use the Hadoop cluster's memory. Hadoop is a framework for batch processing. It implies that it accepts a large quantity of data, processes it, and outputs the result. Batch processing is particularly efficient for processing large amounts of data, but it is dependent on the size of the data being processed and the system's computational capability, which might cause a substantial delay in the output. Hadoop isn't designed to handle real-time processing.

Explore Free Courses

Suggested Blogs

13 Ultimate Big Data Project Ideas & Topics for Beginners [2023]
Big Data Project Ideas Big Data is an exciting subject. It helps you find patterns and results you wouldn’t have noticed otherwise. This skill
Read More

by upGrad

07 Sep 2023

Big Data Architects Salary in India: For Freshers & Experienced [2023]
Big Data – the name indicates voluminous data, which can be both structured and unstructured. Many companies collect, curate, and store data, but how
Read More

by Rohit Sharma

04 Sep 2023

12 Exciting Spark Project Ideas & Topics For Beginners [2023]
What is Spark? Spark is an essential instrument in advanced analytics as it can swiftly handle all sorts of data, independent of quantity or complexi
Read More

by Rohit Sharma

29 Aug 2023

35 Must Know Big Data Interview Questions and Answers 2023: For Freshers & Experienced
Introduction The demand for potential candidates is increasing rapidly in the big data technologies field. There are plenty of opportunities in this
Read More

by Mohit Soni

29 Aug 2023

Top 5 Big Data Use Cases in Healthcare
Thanks to improved healthcare services, today, the average human lifespan has increased to a great extent. While this is a commendable milestone for h
Read More

by upGrad

28 Aug 2023

Big Data Career Opportunities: Ultimate Guide [2023]
Big data is the term used for the data, which is either too big, changes with a speed that is hard to keep track of, or the nature of which is just to
Read More

by Rohit Sharma

22 Aug 2023

Apache Spark Dataframes: Features, RDD & Comparison
Have you ever wondered about the concept behind spark dataframes? The spark dataframes are the extension version of the Resilient Distributed Dataset,
Read More

by Rohit Sharma

21 Aug 2023

Big Data Tutorial for Beginners: All You Need to Know
Big Data, as a concept, has been evoked in almost every conversation about digital innovations, the Internet of Things (IoT), and data science researc
Read More

by Mohit Soni

28 Jun 2023

Cassandra Vs Hadoop: Difference Between Cassandra and Hadoop
Big Data is thriving, and so are the technologies associated with it. Cassandra and Hadoop are a few of the popular technologies, which are used for,
Read More

by Rohan Vats

28 Jun 2023

Schedule 1:1 free counsellingTalk to Career Expert
footer sticky close icon