Hadoop YARN Introduction
YARN is the main component of Hadoop v2.0. YARN helps to open up Hadoop by allowing to process and run data for batch processing, stream processing, interactive processing and graph processing which are stored in HDFS. In this way, It helps to run different types of distributed applications other than MapReduce.
In the YARN architecture, the processing layer is separated from the resource management layer. To create a split between the application manager and resource manager was the Job tracker’s responsibility in the version of Hadoop 1.0. YARN allows the data stored in HDFS (Hadoop Distributed File System) to be processed and run by various data processing engines such as batch processing, stream processing, interactive processing, graph processing and many more. Thus the efficiency of the system is increased with the use of YARN. The processing of the application is scheduled in YARN through its different components. Many different kinds of resources are also progressively allocated for optimum utilization. YARN helps a lot in the proper usage of the available resources, which is very necessary for the processing of a high volume of data.
Check out our free courses to get an edge over the competition.
Why YARN?
MapReduce performs functions of Resource Management and Processing. Hadoop v1.0 is also known as MapReduce Version 1 (MRV1). There was only a single master for Job Tracker.
You won’t belive how this Program Changed the Career of StudentsExplore our Popular Software Engineering Courses
In the previous version of Hadoop that is Hadoop version 1.0, which is also known as MapReduce version 1 (MRV1) use to perform both the task of process and resource management by itself. It has a job tracker module that is responsible for everything. Hence it is the single master that allocates resources for applications, performs scheduling for demand and also monitors the jobs of processing in the system. Hadoop version 1.0 reduces tasks & assigns maps on several sub-processes which is called Task Trackers. Task Tracker also reports the progress of processes in a periodical manner. But the main issue is not that, the problem is this design of a single master for all, resulting in bottlenecking issue. Also, the computational resource utilization was inefficient. Thus scalability became an issue with this version of Hadoop. But on the bright side, this issue is resolved by YARN, a vital core component in its successor Hadoop version 2.0 which was introduced in the year 2012 by Yahoo and Hortonworks. The basic idea behind this relief is separating MapReduce from Resource Management and Job scheduling instead of a single master. Thus, YARN is now responsible for Job scheduling and Resource Management.
In Hadoop 2.0, The concept of Application Master and Resource Manager was introduced by YARN. Across the cluster of Hadoop, the utilization of resources is monitored by the Resource Manager.
There are some features of YARN because of which it got very famous, which are:
- Multi-tenancy: YARN has allowed access to multiple data processing engines such as batch processing engine, stream processing engine, interactive processing engine, graph processing engine and much more. This has given the benefit of multi-tenancy to the company.
- Cluster Utilization: Clusters are utilized in an optimized way because clusters are used dynamically in Hadoop with the help of YARN.
- Compatibility: YARN is also compatible with the first version of Hadoop, i.e. Hadoop 1.0, because it uses the existing map-reduce apps. So YARN can also be used with Hadoop 1.0.
- Scalability: Thousands of clusters and nodes are allowed by the scheduler in Resource Manager of YARN to be managed and extended by Hadoop.
Explore Our Software Development Free Courses
Components of YARN
- Container:
In the Container, one can find physical resources like a disk on a single node, CPU cores, RAM. Container Launch Context (CLC) is used to invoke containers. Data about the dependencies, security tokens, environment variables which are maintained as a record known as Container Launch Context (CLC).
- On a specific host, an application can only use specified memory from the CPU and Memory. This specified amount of memory can only be used after the permission has been granted by the Container.
- Container Launch Context is used to manage YARN Containers. It is also called Container LifeCycle (CLC). Necessary commands for the creation of the process is stored in this record. It also saves the payload for Node Manager services, security tokens, dependencies, map of environment variables.
- Application Master:
In a framework, when a single job is submitted, it is called an application. Monitoring the application progress, application status tracking, negotiation of resources with resource manager is the responsibility of the application manager. All the requirement of an application to run is done by sending the Container Launch Context (CLC). The application master posts container Launch Context (CLC) by requesting the Container from the node manager. From time to time, the resource manager receives a health report after the application has started.
- Node Manager:
The node manager takes care of individual nodes in the Hadoop cluster and also manages containers related to each specific node. It is registered with the Resource Manager and sends each node’s health status to the Resource Manager, stating if the node process has finished working with the resource. As its primary goal is to manage each specific node container that is assigned by the resource manager. The node manager also creates a container process when requested by the Application master. When the application master sends and asks the attached Container from the node manager by a CLC(Container Launch Context) which includes everything an application needs to execute. Then the node manager creates the requested process container and runs it. Node manager is also responsible for monitoring resource usage by individual Container and reporting it to the Resource manager. Thus node manager and resource manager collaborate to communicate between nodes and manage resource usage by each node in the cluster. It can also kill containers if directed by the Resources manager. Finally, node managers log everything by the log management system in it.
A particular node is taken care of by the Node Manager. The Node Manager manages the workflow and application of the node. Log management is performed, and the Node Manager monitors resource usage. The resource manager gives directions to kill a container to the Node Manager. The application master requests the Node manager to start the container process. The creation of a container process is the responsibility of the Node Manager.
- Resource Manager:
Resource management and assignment of all the apps is the responsibility of Resource Manager and is the master daemon of YARN. Requests received by the resource manager are forwarded to the corresponding node manager. According to the application, resources are allocated by the resource manager for completion.
- Utilization of Cluster is optimized, such as keeping the usage of all the resources active against different kinds of limitations such as SLAs, fairness and capacity guarantees.
- The Resource Manager does the allocation of available resources.
- The Resource Manager arbitrates cluster resources.
- The actual processing of requests takes place in nodes, and the node managers manage it. Whenever any request for processing is received, it transfers the requests in parts to its corresponding node managers.
- Resource Manager is the highest authority for the allocation of resources.
There are two primary components of the Resource Manager, which are: –
- Application Manager –
The application manager is responsible for managing a set of submitted tasks or applications. It first verifies and validates the submitted application’s specifications and may reject the applications if there are not enough resources available. It also ensures no other application exists with the same ID which is already submitted that can be caused by an erroneous or a malicious client. Then it forwards the submitted application after validation to the scheduler. Finally, it also observes the states of applications and manages finished applications to save some Resource Manager’s memory. The application manager keeps a cache of finished applications and moves out old, finished applications to accommodate space for freshly submitted applications.
- Scheduler –
Based on Resource Availability and Application Allocation, Scheduler schedules the tasks. There is no other task performed by scheduler like no restart of the job after failing, tracking or monitoring of tasks. The different types of scheduler plugins are Fair Scheduler and Capacity Scheduler, which are supported by the YARN scheduler for the partition of cluster resources.
In-Demand Software Development Skills
Steps of Workflow of Application in Hadoop YARN
An application is submitted by the client.
- Application Manager is started by the allocation of the Container by the Resource Manager.
- Resource Manager and Application Manager register with each other.
- The Application Manager does the negotiation of the Container to the Resource Manager.
- The Node Manager launches the Container after being notified by the Application Manager.
- Execution of Application code is done in the Container.
- The Application Manager or Resource Manager monitors the status of the application after being contacted by the client.
- Un-Registration of Application Manager is done with Resource Manager after the process is complete.
Read our Popular Articles related to Software Development
Why Learn to Code? How Learn to Code? | How to Install Specific Version of NPM Package? | Types of Inheritance in C++ What Should You Know? |
Features of Hadoop Yarn Architecture
After learning what is yarn in Big data, let’s take a look at diverse features of Hadoop yarn architecture and obtain more clarity on what is yarn in Big data and what is yarn in Hadoop!
- The engineers retained all of the MapReduce functionality present in Hadoop version 1 while constructing the latest edition. As a result, there was no cost associated with updating to version 2 as all version 1 MapReduce programs would function. The majority of programs are binary-compatible; at worst, in a few uncommon circumstances, programs might need to be recompiled.
- Version 2 of the Yarn architecture also included resource containers in a way that was more flexible. One or more computing cores and a specific quantity of memory make up a resource container. These resource containers were known as “mapper slots” or “reducer slots” in Hadoop version 1. They are generic slots that are dynamically controlled in version 2. Thus, upgrading to Hadoop version 2 immediately results in an increase in efficiency for the majority of clusters.
- Yarn Hadoop, a specific resource scheduler, has been used to overcome the scalability problem. Running or keeping track of job status is not the resource scheduler’s duty. The YARN ResourceManager component, in fact, doesn’t care what kind of task the user is doing; it just allocates resources and leaves the user alone. Additionally, a failover ResourceManager component is supported by this architecture, removing the single point of failure.
- Hadoop clusters may now accept non-MapReduce workloads since Yarn Hadoop is a generic scheduler. Applications must interact with the ResourceManager and make resource requests, but the core computations could be driven by the needs of the user rather than the MapReduce data flow. One such implementation is Apache Spark, which may be considered a memory-resident MapReduce. The computation is often moved to the node where the data is stored on disc via Hadoop MapReduce. Additionally, it will save interim findings on the node drives. Bypassing disc writes and reads, and Spark keeps everything in memory. To put it another way, Spark creates incredibly quick and scalable applications by moving processing to where the data is in memory.
- Application adaptability is another feature of the updated Yarn Hadoop architecture. Version 1 needed an update of the entire cluster before any modifications to the MapReduce process could be made. A separate cluster was necessary even for testing new MapReduce iterations. Version 2 allows the simultaneous operation of different MapReduce versions. Other apps created to operate on Yarn big data are likewise subject to this agility. New versions may be tested on the same cluster and using the same data as the production version. Finally, because Yarn big data apps don’t have to be created in Java, Yarn gives the user the option to abandon Java.
Here’s why Hadoop Yarn is worth using
Highlighting and learning what is yarn in Hadoop, we can now learn more about yarn and why is it worth using here-
- MapReduce isn’t the only application engine in Hadoop.
- Hadoop is an environment for data processing that offers a foundation for handling any kind of data.
- Now, the idea of a “data lake” is conceivable, where essentially infinite volumes of unstructured raw data may be kept for upcoming or ongoing analysis.
- MapReduce, graph processing, in-memory, bespoke, high-performance computing, and other analysis engines are just a few of the increasing number of analysis engines that developers may use thanks to Hadoop’s ability to conduct Extract, Transform, and Load (ETL) during runtime.
Wrapping Up
If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.
Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.