Hadoop Distributed File System or HDFS is Hadoop’s primary storage system. It stores large data files that run on commodity hardware. This storage system is scalable, easily expandable, and tolerant to faults.
When there is too much data stored on one physical machine, it becomes that storage is divided across several machines to avoid loss of data. HDFS is one such distributed file storage system that manages storage operations across many physical machines. Here is an HDFS tutorial for you to better understand how this system works. Let’s start with its architecture.
Hadoop Distributed File System has a master-slave architecture with the following components:
- Namenode: It is the commodity hardware that holds both the namenode software and the Linux/GNU OS. Namenode software can smoothly run on commodity hardware without encountering any trouble whatsoever. The system with namenode as a component functions as the master server. It performs tasks that include regulating how clients are accessing files, managing the file system namespace, and executing operations, including opening, closing, and renaming of directories and files.
- Datanode: It is a commodity hardware that contains the data node software and Linux/GNU OS. Nodes in a cluster will always have data nodes associated with them. These nodes are responsible for managing the storage of the commodity hardware/system. Some of the tasks that data nodes perform include read/write operations according to client request and creation, replication, and deletion of blocks based on the instructions given by the namenode.
- Block: The entire user data is stored in HDFS files. Every file belonging to a particular file system is divided into one or more than one segments, which are then stored is datanodes. The file segments that files get divided into are blocks. So, the smallest data that HDFS is capable of reading or writing is a block. Initially, every block is 64MB in size. But, this size can be increased according to HDFS configuration changes.
HDFS architecture gives a clear and unambiguous picture of HDFS works. It consists of several datanodes but just a single namenode. Metadata is stored in the namenode while the actual worker of the two types of nodes is the datanode. Nodes are organised in different racks on which data blocks are stored to improve fault tolerance and data reliability. The clients have to interact with the namenode to read/write a file. The cluster has several datanodes that use the local disk to store available data. Datanode and namenode are perpetually in touch with each other. Datanode is also responsible for replicating data using the replication feature to different datanodes.
Read and write operations in HDFS take place at the smallest level, i.e. the block level. The concept of data replication is central to how HDFS works – high availability of data is ensured during node failure by creating replicas of blocks and distribution of those in the entire cluster.
HDFS and Linux file system are quite similar to each other. So, HDFS allows us to perform all the operations that we are used to performing with local file systems – we can create a director, change permissions, copy files, and do a lot more. We also have several file access rights, including reading, writing, and executing.
Read operation in HDFS: If you want to read a file stored in HDFS, you will have to interact with namenode. As already mentioned, all the metadata is stored in the namenode. Once you interact with namenode, it will give you the address of the datanode where the file you are looking for is stored. You can then interact with the datanode whose address you have been given by namenode, and then read the information from there.
You interact with the file system’s API, which requests namenode to share block address. Before giving this information, namenode runs a check to find out whether you have the right to access this data or not. Once this check is done, namenode either shares the block location or denies access due to restrictions.
You are given a form of a token by the namenode, which you are required to show to the respective datanode for accessing a file. This is a form of security mechanism that HDFS employs to ensure that the right person is accessing data. The datanode will only let you read the file after you display the token.
Write operation: The writing operation follows the same initial pattern. You need to request the namenode to let you write data. In return, it will provide you with the location of the datanode on which the write operation has to be performed. As soon as you are done with performing this operation, the datanode will start replicating these blocks of written data on other datanodes. Once the replication is done, you will receive an acknowledgement. The authentication mechanism in the write operation is the same as the read operation.
- Availability: There aren’t too many file systems that come with the high availability of HDFS. The file system follows a mechanism of replicating data in the form of block replicas on the datanodes (slaves) throughout a cluster. To access this data, you need to interact with datanodes that contains the blocks of information they are looking for.
- Reliability: Hadoop Distributed File System is a highly reliable data storage system. The amount of data that can be stored on HDFS ranges in petabytes. It uses a cluster to store all its data, which is separated to form blocks. It then uses nodes of the cluster to store these blocks.
- Fault tolerance: This feature is the working strength of HDFS in conditions that aren’t as conducive as they usually are. HDFS tolerates faults like no other file system does. It safeguards your data from the effects of any unforeseen, even in the future. As already alluded to, replication of data is done on different machines. What happens when any of these machines stop working? It could have been a significant problem with any other system, but not HDFS. HDFS allows you to access your data from any other machine that also has a copy the data blocks that you are looking for. This is called true fault-tolerance.
- Scalability: HDFS use different nodes in a cluster to store data. When storage requirements rise, you can always go ahead and scale the cluster. This is another feature that is unique to HDFS. The file distribution system provides you with two mechanisms to scale the cluster – horizontal and vertical scalability.
- Replication: Replication is a feature that sets HDFS apart from other storage systems. Replication minimises the instances of data loss due to an unfavourable event, such as node crashing, hardware failure, and others. The process of replication is carried out regularly and on different machines. So, there is no data loss if a machine goes down. You can use any other machine to get your data.
- Managing huge data sets: Unlike other file distribution systems, HDFS has the required architecture in place to manage applications that feature huge datasets. Depending on the enormity of the data sets in question, it can have hundreds of nodes for every cluster.
- Fault detection and recovery: Not many can beat HDFS in its capabilities associated with detecting faults and then appropriately dealing with them. A large number of commodity hardware in question makes HDFS exposed to frequent failure of components. However, this isn’t a disadvantage. Every system dealing with such commodity hardware is open to failure. However, the question remains whether or not the system is adept at quickly and automatically detecting those failures and providing recovery at the same time. HDFS certainly is capable.
- Increased throughput: HDFS processes every task efficiently as the actual computation is carried out near the data itself. This is especially very important when we are dealing with huge datasets. This mechanism increases throughput and does away with the problem of network traffic significantly.
To conclude, we would like to say that HDFS can store huge amounts of data in a reliable manner and without feeling the effects of hardware failure. It is also highly fault-tolerant, highly available, and highly scalable.
To travel the depths of Hadoop and to become an expert, check out upGrad & BITS Pilani’s PG Certification in Big Data & Analytics. Get hands-on experience, one-to-one with industry experts, dedicated mentor, BITS Pilani Alumni Status, and more.
If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.