Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconBig Databreadcumb forward arrow iconWhat Is Hadoop Yarn Architecture & It’s Components

What Is Hadoop Yarn Architecture & It’s Components

Last updated:
5th Oct, 2022
Views
Read Time
13 Mins
share image icon
In this article
Chevron in toc
View All
What Is Hadoop Yarn Architecture & It’s Components

Hadoop YARN Introduction

In Hadoop v2.0, YARN stands as the primary component, revolutionizing the Hadoop ecosystem. It enables versatile data processing, including batch, stream, interactive, and graph processing, stored within HDFS. Unlike its predecessor, Hadoop 1.0, which tasked the Job Tracker with both application and resource management, YARN’s architecture decouples these functions. 

Through YARN, various data processing engines can efficiently utilize resources from HDFS, boosting system efficiency significantly. YARN’s architecture schedules application processing via its distinct components while progressively allocating diverse resources for optimal utilization. This systematic resource management is crucial for processing vast volumes of data effectively. 

In my experience, leveraging YARN has streamlined our data processing workflows, enhancing overall efficiency and scalability. Its ability to manage resources dynamically has been instrumental in meeting our organization’s growing data processing needs. Aspiring professionals entering the field should grasp the significance of YARN in modern data processing architectures for optimal performance and scalability. 

Check out our free courses to get an edge over the competition.

Ads of upGrad blog

Why YARN?

MapReduce performs functions of Resource Management and Processing. Hadoop v1.0 is also known as MapReduce Version 1 (MRV1). There was only a single master for Job Tracker. 

You won’t belive how this Program Changed the Career of Students

Explore our Popular Software Engineering Courses

In the previous version of Hadoop that is Hadoop version 1.0, which is also known as MapReduce version 1 (MRV1) use to perform both the task of process and resource management by itself. It has a job tracker module that is responsible for everything. Hence it is the single master that allocates resources for applications, performs scheduling for demand and also monitors the jobs of processing in the system. Hadoop version 1.0 reduces tasks & assigns maps on several sub-processes which is called Task Trackers. Task Tracker also reports the progress of processes in a periodical manner. But the main issue is not that, the problem is this design of a single master for all, resulting in bottlenecking issue. Also, the computational resource utilization was inefficient. Thus scalability became an issue with this version of Hadoop. But on the bright side, this issue is resolved by YARN, a vital core component in its successor Hadoop version 2.0 which was introduced in the year 2012 by Yahoo and Hortonworks. The basic idea behind this relief is separating MapReduce from Resource Management and Job scheduling instead of a single master. Thus, YARN is now responsible for Job scheduling and Resource Management. 

 In Hadoop 2.0, The concept of Application Master and Resource Manager was introduced by YARN. Across the cluster of Hadoop, the utilization of resources is monitored by the Resource Manager. 

  There are some features of YARN because of which it got very famous, which are:

  1. Multi-tenancy: YARN has allowed access to multiple data processing engines such as batch processing engine, stream processing engine, interactive processing engine, graph processing engine and much more. This has given the benefit of multi-tenancy to the company.
  2. Cluster Utilization: Clusters are utilized in an optimized way because clusters are used dynamically in Hadoop with the help of YARN.
  3. Compatibility: YARN is also compatible with the first version of Hadoop, i.e. Hadoop 1.0, because it uses the existing map-reduce apps. So YARN can also be used with Hadoop 1.0.
  4. Scalability: Thousands of clusters and nodes are allowed by the scheduler in Resource Manager of YARN to be managed and extended by Hadoop.

Explore Our Software Development Free Courses

Components of YARN

  • Container: 

In the Container, one can find physical resources like a disk on a single node, CPU cores, RAM. Container Launch Context (CLC) is used to invoke containers. Data about the dependencies, security tokens, environment variables which are maintained as a record known as Container Launch Context (CLC).

  1. On a specific host, an application can only use specified memory from the CPU and Memory. This specified amount of memory can only be used after the permission has been granted by the Container. 
  2. Container Launch Context is used to manage YARN Containers. It is also called Container LifeCycle (CLC). Necessary commands for the creation of the process is stored in this record. It also saves the payload for Node Manager services, security tokens, dependencies, map of environment variables. 
  • Application Master: 

In a framework, when a single job is submitted, it is called an application. Monitoring the application progress, application status tracking, negotiation of resources with resource manager is the responsibility of the application manager. All the requirement of an application to run is done by sending the Container Launch Context (CLC). The application master posts container Launch Context (CLC) by requesting the Container from the node manager. From time to time, the resource manager receives a health report after the application has started.

  •  Node Manager: 

The node manager takes care of individual nodes in the Hadoop cluster and also manages containers related to each specific node. It is registered with the Resource Manager and sends each node’s health status to the Resource Manager, stating if the node process has finished working with the resource. As its primary goal is to manage each specific node container that is assigned by the resource manager. The node manager also creates a container process when requested by the Application master. When the application master sends and asks the attached Container from the node manager by a CLC(Container Launch Context) which includes everything an application needs to execute. Then the node manager creates the requested process container and runs it. Node manager is also responsible for monitoring resource usage by individual Container and reporting it to the Resource manager. Thus node manager and resource manager collaborate to communicate between nodes and manage resource usage by each node in the cluster. It can also kill containers if directed by the Resources manager. Finally, node managers log everything by the log management system in it. 

A particular node is taken care of by the Node Manager. The Node Manager manages the workflow and application of the node. Log management is performed, and the Node Manager monitors resource usage. The resource manager gives directions to kill a container to the Node Manager. The application master requests the Node manager to start the container process. The creation of a container process is the responsibility of the Node Manager. 

  • Resource Manager: 

Resource management and assignment of all the apps is the responsibility of Resource Manager and is the master daemon of YARN. Requests received by the resource manager are forwarded to the corresponding node manager. According to the application, resources are allocated by the resource manager for completion. 

  1. Utilization of Cluster is optimized, such as keeping the usage of all the resources active against different kinds of limitations such as SLAs, fairness and capacity guarantees.
  2. The Resource Manager does the allocation of available resources.
  3. The Resource Manager arbitrates cluster resources.
  4. The actual processing of requests takes place in nodes, and the node managers manage it. Whenever any request for processing is received, it transfers the requests in parts to its corresponding node managers.
  5. Resource Manager is the highest authority for the allocation of resources.

  There are two primary components of the Resource Manager, which are: –

  •  Application Manager – 

The application manager is responsible for managing a set of submitted tasks or applications. It first verifies and validates the submitted application’s specifications and may reject the applications if there are not enough resources available. It also ensures no other application exists with the same ID which is already submitted that can be caused by an erroneous or a malicious client. Then it forwards the submitted application after validation to the scheduler. Finally, it also observes the states of applications and manages finished applications to save some Resource Manager’s memory. The application manager keeps a cache of finished applications and moves out old, finished applications to accommodate space for freshly submitted applications.

  •  Scheduler –

 Based on Resource Availability and Application Allocation, Scheduler schedules the tasks. There is no other task performed by scheduler like no restart of the job after failing, tracking or monitoring of tasks. The different types of scheduler plugins are Fair Scheduler and Capacity Scheduler, which are supported by the YARN scheduler for the partition of cluster resources.

In-Demand Software Development Skills

YARN advantages

As someone deeply familiar with Hadoop YARN architecture, I can attest to its numerous advantages in the realm of big data. YARN, or Yet Another Resource Negotiator, revolutionizes how distributed applications are processed in Hadoop v2.0. Its architecture separates the processing layer from the resource management layer, offering unparalleled flexibility and scalability. YARN enables diverse data processing engines like batch processing, stream processing, and graph processing to efficiently utilize resources stored in Hadoop Distributed File System (HDFS). This results in optimized system efficiency and enhanced performance across various distributed applications. In my experience, leveraging YARN architecture in big data environments significantly boosts productivity and streamlines data processing workflows. 

Steps of Workflow of Application in Hadoop YARN

 An application is submitted by the client.

  1. Application Manager is started by the allocation of the Container by the Resource Manager.
  2. Resource Manager and Application Manager register with each other.
  3. The Application Manager does the negotiation of the Container to the Resource Manager.
  4. The Node Manager launches the Container after being notified by the Application Manager.
  5. Execution of Application code is done in the Container.
  6. The Application Manager or Resource Manager monitors the status of the application after being contacted by the client.
  7. Un-Registration of Application Manager is done with Resource Manager after the process is complete.

Read our Popular Articles related to Software Development

Features of Hadoop Yarn Architecture

After learning what is yarn in Big data, let’s take a look at diverse features of Hadoop yarn architecture and obtain more clarity on what is yarn in Big data and  what is yarn in Hadoop!

 

  • The engineers retained all of the MapReduce functionality present in Hadoop version 1 while constructing the latest edition. As a result, there was no cost associated with updating to version 2 as all version 1 MapReduce programs would function. The majority of programs are binary-compatible; at worst, in a few uncommon circumstances, programs might need to be recompiled.
  • Version 2 of the Yarn architecture also included resource containers in a way that was more flexible. One or more computing cores and a specific quantity of memory make up a resource container. These resource containers were known as “mapper slots” or “reducer slots” in Hadoop version 1. They are generic slots that are dynamically controlled in version 2. Thus, upgrading to Hadoop version 2 immediately results in an increase in efficiency for the majority of clusters.
  • Yarn Hadoop, a specific resource scheduler, has been used to overcome the scalability problem. Running or keeping track of job status is not the resource scheduler’s duty. The YARN ResourceManager component, in fact, doesn’t care what kind of task the user is doing; it just allocates resources and leaves the user alone. Additionally, a failover ResourceManager component is supported by this architecture, removing the single point of failure.
  • Hadoop clusters may now accept non-MapReduce workloads since Yarn Hadoop is a generic scheduler. Applications must interact with the ResourceManager and make resource requests, but the core computations could be driven by the needs of the user rather than the MapReduce data flow. One such implementation is Apache Spark, which may be considered a memory-resident MapReduce. The computation is often moved to the node where the data is stored on disc via Hadoop MapReduce. Additionally, it will save interim findings on the node drives. Bypassing disc writes and reads, and Spark keeps everything in memory. To put it another way, Spark creates incredibly quick and scalable applications by moving processing to where the data is in memory.
  • Application adaptability is another feature of the updated Yarn Hadoop architecture. Version 1 needed an update of the entire cluster before any modifications to the MapReduce process could be made. A separate cluster was necessary even for testing new MapReduce iterations. Version 2 allows the simultaneous operation of different MapReduce versions. Other apps created to operate on Yarn big data are likewise subject to this agility. New versions may be tested on the same cluster and using the same data as the production version. Finally, because Yarn big data apps don’t have to be created in Java, Yarn gives the user the option to abandon Java.

Here’s why Hadoop Yarn is worth using 

Highlighting and learning what is yarn in Hadoop, we can now learn more about yarn and why is it worth using here-

  • MapReduce isn’t the only application engine in Hadoop.
  • Hadoop is an environment for data processing that offers a foundation for handling any kind of data.
  • Now, the idea of a “data lake” is conceivable, where essentially infinite volumes of unstructured raw data may be kept for upcoming or ongoing analysis.
  • MapReduce, graph processing, in-memory, bespoke, high-performance computing, and other analysis engines are just a few of the increasing number of analysis engines that developers may use thanks to Hadoop’s ability to conduct Extract, Transform, and Load (ETL) during runtime.

Wrapping Up

Ads of upGrad blog

If you’re keen on delving deeper into the Domain of Big Data, I highly recommend exploring our Advanced Certificate Programme in Big Data from IIIT Bangalore. 

Additionally, consider enrolling in online Software Development Courses offered by leading global universities. Whether you opt for Executive PG Programs, Advanced Certificate Programs, or Masters Programs, these courses can significantly accelerate your career trajectory. 

In my experience, pursuing specialized programs in Big Data and Software Development has been instrumental in advancing my career. These courses provided me with invaluable insights and practical skills that I could immediately apply in my professional endeavors. 

By investing in continuous learning and upskilling, you can stay abreast of industry trends and remain competitive in today’s rapidly evolving job market.

Profile
Siddhant Khanvilkar is an experienced Content Marketer with a high degree of expertise in SEO and Web Analytics. Siddhant has a Degree in Mass Media with a Specialization in Advertising.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Big Data Course

Frequently Asked Questions (FAQs)

11. What are the major differences between MapReduce and YARN?

YARN, or Yet Another Resource Negotiator, operates on a framework used to organize CPU and memory allocation. MapReduce is self-defined. YARN was incorporated in Hadoop v2.0, whereas MapReduce has been around since Hadoop v1.0. YARN is now responsible for ensuring resource management. MapReduce previously was designated to tackle resource management and data processing simultaneously. The model of YARN is very generic compared to MapReduce, which is less generic. Applications that don’t work with the MapReduce model are compatible with YARN. MapReduce, however, executes its model-based programs and apps. In terms of scalability, YARN is more scalable compared to MapReduce. The default size of YARN is 128MB which reduces down to 64MB in the case of MapReduce.

22. Which is a better package manager, YARN or NPM?

YARN and NPM are package managers, i.e., a collection of a certain set of tools that automate several processes like installation, configuration, etc. The automation is done by developers. Speaking in terms of who is better, let’s begin with performance. YARN stands undefeated in performance measures due to its ability to parallelly download and cache. The next aspect is scalability; YARN and NPM are well-to-do products in their respective sectors. There is a big community that supports the two. Therefore, both are scalable and will be used in the future. Before code execution, YARN verifies the integrity of its packages with checksums. NPM allows code to execute on the packages automatically.

33. Why does Hadoop need YARN in v2.0?

Hadoop’s v1.0 was highly unstable and had a few shortcomings such as scalability issues, batch processing, etc. Despite being proficient with data processing, Hadoop used MapReduce to process large datasets. With YARN’s entry in the recent release, the flow has been smooth. Hadoop has no room to accommodate the shortcomings it previously suffered from. Moreover, YARN also runs non-MapReduce apps using its framework and several batch jobs.

4What are the major differences between MapReduce and YARN?

YARN, or Yet Another Resource Negotiator, operates on a framework used to organize CPU and memory allocation. MapReduce is self-defined. YARN was incorporated in Hadoop v2.0, whereas MapReduce has been around since Hadoop v1.0. YARN is now responsible for ensuring resource management. MapReduce previously was designated to tackle resource management and data processing simultaneously. The model of YARN is very generic compared to MapReduce, which is less generic. Applications that don’t work with the MapReduce model are compatible with YARN. MapReduce, however, executes its model-based programs and apps. In terms of scalability, YARN is more scalable compared to MapReduce. The default size of YARN is 128MB which reduces down to 64MB in the case of MapReduce.

5Which is a better package manager, YARN or NPM?

YARN and NPM are package managers, i.e., a collection of a certain set of tools that automate several processes like installation, configuration, etc. The automation is done by developers. Speaking in terms of who is better, let’s begin with performance. YARN stands undefeated in performance measures due to its ability to parallelly download and cache. The next aspect is scalability; YARN and NPM are well-to-do products in their respective sectors. There is a big community that supports the two. Therefore, both are scalable and will be used in the future. Before code execution, YARN verifies the integrity of its packages with checksums. NPM allows code to execute on the packages automatically.

6Why does Hadoop need YARN in v2.0?

Hadoop’s v1.0 was highly unstable and had a few shortcomings such as scalability issues, batch processing, etc. Despite being proficient with data processing, Hadoop used MapReduce to process large datasets. With YARN’s entry in the recent release, the flow has been smooth. Hadoop has no room to accommodate the shortcomings it previously suffered from. Moreover, YARN also runs non-MapReduce apps using its framework and several batch jobs.

Explore Free Courses

Suggested Blogs

Top 10 Hadoop Commands [With Usages]
12161
In this era, with huge chunks of data, it becomes essential to deal with them. The data springing from organizations with growing customers is way lar
Read More

by Rohit Sharma

12 Apr 2024

Characteristics of Big Data: Types & 5V’s
6593
Introduction The world around is changing rapidly, we live a data-driven age now. Data is everywhere, from your social media comments, posts, and lik
Read More

by Rohit Sharma

04 Mar 2024

50 Must Know Big Data Interview Questions and Answers 2024: For Freshers & Experienced
7710
Introduction The demand for potential candidates is increasing rapidly in the big data technologies field. There are plenty of opportunities in this
Read More

by Mohit Soni

What is Big Data – Characteristics, Types, Benefits & Examples
186391
Lately the term ‘Big Data’ has been under the limelight, but not many people know what is big data. Businesses, governmental institutions, HCPs (Healt
Read More

by Abhinav Rai

18 Feb 2024

Cassandra vs MongoDB: Difference Between Cassandra & MongoDB [2023]
5495
Introduction Cassandra and MongoDB are among the most famous NoSQL databases used by large to small enterprises and can be relied upon for scalabilit
Read More

by Rohit Sharma

31 Jan 2024

13 Ultimate Big Data Project Ideas & Topics for Beginners [2024]
101038
Big Data Project Ideas Big Data is an exciting subject. It helps you find patterns and results you wouldn’t have noticed otherwise. This skill
Read More

by upGrad

16 Jan 2024

Be A Big Data Analyst – Skills, Salary & Job Description
899823
In an era dominated by Big Data, one cannot imagine that the skill set and expertise of traditional Data Analysts are enough to handle the complexitie
Read More

by upGrad

16 Dec 2023

12 Exciting Hadoop Project Ideas & Topics For Beginners [2024]
21110
Hadoop Project Ideas & Topics Today, big data technologies power diverse sectors, from banking and finance, IT and telecommunication, to manufact
Read More

by Rohit Sharma

29 Nov 2023

Top 10 Exciting Data Engineering Projects & Ideas For Beginners [2024]
40396
Data engineering is an exciting and rapidly growing field that focuses on building, maintaining, and improving the systems that collect, store, proces
Read More

by Rohit Sharma

21 Sep 2023

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon