Hadoop MapReduce is a programming model and software framework used for writing applications that process large amounts of data. There are two phases in the MapReduce program, Map and Reduce.
The Map task includes splitting and mapping of the data by taking a dataset and converting it into another set of data, where the individual elements get broken down into tuples i.e. key/value pairs. After which the Reduce task shuffles and reduces the data, which means it combines the data tuples based on the key and modifies the value of the key accordingly.
In the Hadoop framework, MapReduce model is the core component for data processing. Using this model, it is very easy to scale an application to run over hundreds, thousands and many more machines in a cluster by only making a configuration change. This is also because the programs of the model in cloud computing are parallel in nature. Hadoop has the capability of running MapReduce in many languages such as Java, Ruby, Python and C++. Read more on mapreduce architecture.
Inputs and Outputs
The MapReduce model operates on <key, value> pairs. It views the input to the jobs as a set of <key, value> pairs and produces a different set of <key, value> pairs as the output of the jobs. Data input is supported by two classes in this framework, namely InputFormat and RecordReader.
The first is consulted to determine how the input data should be partitioned for the map tasks, while the latter reads the data from the inputs. For the data output also there are two classes, OutputFormat and RecordWriter. The first class performs a basic validation of the data sink properties and the second class is used to write each reducer output to the data sink.
What are the Phases of MapReduce?
In MapReduce a data goes through the following phases.
Input Splits: An input in the MapReduce model is divided into small fixed-size parts called input splits. This part of the input is consumed by a single map. The input data is generally a file or directory stored in the HDFS.
Mapping: This is the first phase in the map-reduce program execution where the data in each split is passed line by line, to a mapper function to process it and produce the output values.
Shuffling: It is a part of the output phase of Mapping where the relevant records are consolidated from the output. It consists of merging and sorting. So, all the key-value pairs which have the same keys are combined. In sorting, the inputs from the merging step are taken and sorted. It returns key-value pairs, sorting the output.
Reduce: All the values from the shuffling phase are combined and a single output value is returned. Thus, summarizing the entire dataset.
Also Read: Mapreduce Interview Questions & Answers
How does MapReduce Organize Work?
Hadoop divides a task into two parts, Map tasks which includes Splits and Mapping, and Reduce tasks which includes Shuffling and Reducing. These were mentioned in the phases in the above section. The execution of these tasks is controlled by two entities called JobTracker and Multiple Task tracker.
With every job that gets submitted for execution, there is a JobTracker that resides on the NameNode and multiple task trackers that reside on the DataNode. A job gets divided into multiple tasks that run onto multiple data nodes in the cluster. The JobTracker coordinates the activity by scheduling tasks to run on various data nodes.
The task tracker looks after the execution of individual tasks. It also sends the progress report to the JobTracker. Periodically, it sends a signal to the JobTracker to notify the current state of the system. When there is a task failure, the JobTracker reschedules it on a different task tracker.
Advantages of MapReduce
There are a number of advantages for applications which use this model. These are
- – Big data can be easily handled.
- – Datasets can be processed parallely.
- – All types of data such as structured, unstructured and semi-structured can be easily processed.
- – High scalability is provided.
- – Counting occurrences of words is easy and these applications can have massive data collection.
- – Large samples of respondents can be accessed quickly.
- – In data analysis, a generic tool can be used to search tools.
- – Load balancing time is offered in large clusters.
- – The process of extracting contexts of user locations, situations, etc. is easily possible.
- – Good generalization performance and convergence is provided to these applications.
Must Read: Mapreduce vs Apache Spark
We have described MapReduce in Hadoop in detail. We also provided a brief description of the framework along with definitions of both Map and Reduce in the introduction. The definitions of various terms used in this model were given along with details of the inputs and outputs.
A detailed explanation of the various phases involved in the MapReduce framework illustrated in detail how work gets organized. The list of advantages of using MapReduce for applications gives a clear picture of its use and relevance
If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.