Apache YARN

In this article we will talk about two new names YARN and MR2 introduced in Hadoop 2.0

    • What is YARN?
    • Why there was a need of YARN (Yet Another Resource Negotiator), a new framework in Hadoop 2.0?
    • What are the benefits of YARN framework over earlier MapReduce framework of Hadoop 1.0?
    • What is the difference between MR1 in Hadoop 1.0 and MR2 in Hadoop2.0?

YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0.Let’s have a look on how Hadoop architecture has changed from Hadoop 1.0 to Hadoop 2.0.

HDFS federation brings important measures of scalability and reliability to Hadoop. YARN, the other major advance in Hadoop 2, brings significant performance improvements for some applications, supports additional processing models, and implements a more flexible execution engine.

YARN is a resource manager that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1. YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop.

Like an operating system on a server, YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform. In Hadoop 1, users had the option of writing MapReduce programs in Java, in Python, Ruby or other scripting languages using streaming, or using Pig, a data transformation language. Regardless of which method was used, all fundamentally relied on the MapReduce processing model to run.

YARN supports multiple processing models in addition to MapReduce. One of the most significant benefits of this is that we are no longer limited to working the often I/O intensive, high latency MapReduce framework. This advance means Hadoop users should be familiar with the pros and cons of the new processing models and understand when to apply them to particular use cases.

YARN-in-Hadoop-1

As shown, in Hadoop 2.0 a new layer has been introduced between HDFS and MapReduce. This is YARN framework which is responsible for doing Cluster Resource Management.

Cluster Resource Management:
Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc.

YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.

YARN-in-Hadoop-2

Why YARN was needed? Before we understand the need of YARN, we should understand how cluster resource management was done in Hadoop 1.0 and what the problem in that approach was.

Cluster Resource Management in Hadoop 1.0: In Hadoop 1.0, there is tight coupling between Cluster Resource Management and MapReduce programming model.
Job Tracker, which does resource management, is part of, MapReduce Framework.
 YARN-in-Hadoop-3

In MapReduce framework, MapReduce job (MapReduce application) is divided between number of tasks called mappers and reducers. Each task runs on one of the machine (DataNode) of the cluster, and each machine has a limited number of predefined slots (map slot, reduce slot) for running tasks concurrently.

Here, JobTracker is responsible for both managing the cluster’s resources and driving the execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs and monitors each task, and if a task fails, it allocates a new slot and reattempts the task. After a task finishes, the job tracker cleans up temporary resources and releases the task’s slot to make it available for other jobs.

Problems with this approach in Hadoop 1.0:
    1. It limits scalability: JobTracker runs on single machine doing several task like
        • Resource management
        • Job and task scheduling and
        • Monitoring

      Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.

    2. Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.
    3. Problem with Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and reduce slots for each TaskTrackers. Resource Utilization issues occur because maps slots might be ‘full’ while reduce slots is empty (and vice-versa). Here the compute resources (DataNode) could sit idle which are reserved for Reduce slots even when there is immediate need for those resources to be used as Mapper slots.
    4. Limitation in running non-MapReduce Application: In Hadoop 1.0, Job tracker was tightly integrated with MapReduce and only supporting application that obeys MapReduce programming framework can run on Hadoop.
      Let’s try to understand point 4 in more detail.Hadoop distributed file system (HDFS) makes it cheap to store large amounts of data, and its scalable MapReduce analysis engine makes it possible to extract insights from that data. MapReduce works on batch-driven data analysis, where the input data is partitioned into smaller batches that can be processed in parallel across many machines in the Hadoop cluster. But MapReduce, while powerful enough to express many data analysis algorithms, is not always the optimal choice of programming paradigm. It‘s often desirable to run other computation paradigms in the Hadoop cluster – here are some examples.

        • Problem in performing real-time analysis: MapReduce is batch driven. What if I want to do perform real time analysis instead of batch-processing (where results is available after several hours).There are many applications which need results in real time like fraud detection algorithm. There are real time engines like Apache Storm which can work better in this case. But in Hadoop 1.0, due to tight coupling these engines cannot run independently.
        • Problem in running Message-Passing approach: It is a stateful process that runs on each node of a distributed network. The processes communicate with each other by sending messages, and alter their state based on the messages they receive. This is not possible in MapReduce.
        • Problem in running Ad-hoc query: Many users like to query their big data using SQL. Apache Hive can execute a SQL query as a series of MapReduce jobs, but it has shortcomings in terms of performance.
          Recently, some new approaches such as Apache Tajo , Facebook’s Presto and Cloudera’s Impala drastically improve the performance, but they require to run services in other form than MapReduce form.
          It is not possible to run all such non Map Reduce jobs on Hadoop Cluster. Such jobs have to “disguise” themselves as mappers and reducers in order to be able to run on Hadoop 1.0.

Hadoop 2.0 solves all these problem with YARN:

YARN-in-Hadoop-4

YARN took over the task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.

YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.

Advantage of YARN:
    1. Yarn does efficient utilization of the resource.
      There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
    2. Yarn can even run application that do not follow MapReduce model.
      YARN decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce to do what is does best – process data.

Few Important Notes about YARN:

    1. YARN is backward compatible.
      This means that existing MapReduce job can run on Hadoop 2.0 without any change.
    2. No more JobTracker and TaskTracker needed in Hadoop 2.0
      JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).

        • Resource Manager
        • Node Manager(node specific)

      Central Resource Manager and node specific Node Manager together constitutes YARN.

YARN-in-Hadoop-5
MapReduce: Difference between MR1 and MR2:
  • Earlier version of map- reduce framework in Hadoop 1.0 is called as MR1. The new version of MapReduce is known as MR2.
  • No more JobTracker and TaskTracker needed in Hadoop 2. With the introduction of YARN in Hadoop2, the term JobTracker and TaskTracker disappeared. MapReduce is now streamlined to perform processing data.
  • The new model is more isolated and scalable as compared to the earlier MR1 system. MR2 is one kind of distributed application that run MapReduce framework on top of YARN. MapReduce perform data processing via YARN. Other tools can also perform data processing via YARN. Hence Yarn execution model is more generic than earlier MapReduce model.
  • MR1 was not able to do so. It would only run MapReduce applications.

Reference http://saphanatutorial.com/


Leave a Reply