MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.
- PayLoad– Applications implement the Map and the Reduce functions, and form the core of the job.
- Mapper– Mapper maps the input key/value pairs to a set of intermediate key/value pair.
- NamedNode– Node that manages the Hadoop Distributed File System (HDFS).
- DataNode– Node where data is presented in advance before any processing takes place.
- MasterNode– Node where JobTracker runs and which accepts job requests from clients.
- SlaveNode– Node where Map and Reduce program runs.
- JobTracker– Schedules jobs and tracks the assign jobs to Task tracker.
- Task Tracker– Tracks the task and reports status to JobTracker.
- Job– A program is an execution of a Mapper and Reducer across a dataset.
- Task– An execution of a Mapper or a Reducer on a slice of data.
- Task Attempt– A particular instance of an attempt to execute a task on a SlaveNode.
- Generally MapReduce paradigm is based on sending the computer to where the data resides!
- MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
- Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
- Reduce stage: This stage is the combination of the Shufflestage and the Reduce The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
- During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
- The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
- Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
- After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
A Map Reduce job splits the input data set into independent “chunks” that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system. The MapReduce framework and the HDFS are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data. The Map Reduce framework consists of a single master JobTracker and one slave TaskTracker per node. The master is responsible for scheduling job component tasks on the slaves, monitoring tasks, and re-executing failed tasks.
The slaves execute tasks as directed by the master. Minimally, applications specify input and output locations and supply map and reduce functions through implementation of appropriate interfaces or abstract classes. Although the Hadoop framework is implemented in Java, Map Reduce applications do not have to be written in Java. HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). A small Hadoop cluster includes a single master and multiple worker nodes.The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.
A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location.
For example: if node A contains data (x, y, z) and node B contains data (a, b, c), the job tracker schedules node B to perform map or reduce tasks on (a, b, c) and node A would be scheduled to perform map or reduce tasks on (x, y, z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer.
This technology is much simpler conceptually but very powerful when put along with Hadoop framework. There are two major steps:
In Map step master node takes input and divides into simple smaller chunks and provides it to other worker node. A function that parcels out work to different nodes in the distributed cluster.
- Reduce :
In Reduce step it collects al the small solution of the problem and returns as output in one unified answer. Both of these steps use function which relies on Key-Value pairs. This process runs on the various nodes in parallel and brings faster results for framework. Another function that collates the work and resolves the results into a single value. The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back periodical y with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes.