Hadoop MapReduce is a programming model and software framework used for writing applications that process large amounts of data. There are two phases in the MapReduce program, Map and Reduce.
The Map task includes splitting and mapping of the data by taking a dataset and converting it into another set of data, where the individual elements get broken down into tuples i.e. key/value pairs. After which the Reduce task shuffles and reduces the data, which means it combines the data tuples based on the key and modifies the value of the key accordingly.
In the Hadoop framework, MapReduce model is the core component for data processing. Using this model, it is very easy to scale an application to run over hundreds, thousands and many more machines in a cluster by only making a configuration change. This is also because the programs of the model in cloud computing are parallel in nature. Hadoop has the capability of running MapReduce in many languages such as Java, Ruby, Python and C++. Read more on mapreduce architecture.
Inputs and Outputs
The MapReduce model operates on <key, value> pairs. It views the input to the jobs as a set of <key, value> pairs and produces a different set of <key, value> pairs as the output of the jobs. Data input is supported by two classes in this framework, namely InputFormat and RecordReader.
The first is consulted to determine how the input data should be partitioned for the map tasks, while the latter reads the data from the inputs. For the data output also there are two classes, OutputFormat and RecordWriter. The first class performs a basic validation of the data sink properties and the second class is used to write each reducer output to the data sink.
What are the Phases of MapReduce?
In MapReduce a data goes through the following phases.
Input Splits: An input in the MapReduce model is divided into small fixed-size parts called input splits. This part of the input is consumed by a single map. The input data is generally a file or directory stored in the HDFS.
Mapping: This is the first phase in the map-reduce program execution where the data in each split is passed line by line, to a mapper function to process it and produce the output values.
Shuffling: It is a part of the output phase of Mapping where the relevant records are consolidated from the output. It consists of merging and sorting. So, all the key-value pairs which have the same keys are combined. In sorting, the inputs from the merging step are taken and sorted. It returns key-value pairs, sorting the output.
Reduce: All the values from the shuffling phase are combined and a single output value is returned. Thus, summarizing the entire dataset.
Also Read: Mapreduce Interview Questions & Answers
How does MapReduce Organize Work?
Hadoop divides a task into two parts, Map tasks which includes Splits and Mapping, and Reduce tasks which includes Shuffling and Reducing. These were mentioned in the phases in the above section. The execution of these tasks is controlled by two entities called JobTracker and Multiple Task tracker.
With every job that gets submitted for execution, there is a JobTracker that resides on the NameNode and multiple task trackers that reside on the DataNode. A job gets divided into multiple tasks that run onto multiple data nodes in the cluster. The JobTracker coordinates the activity by scheduling tasks to run on various data nodes.
The task tracker looks after the execution of individual tasks. It also sends the progress report to the JobTracker. Periodically, it sends a signal to the JobTracker to notify the current state of the system. When there is a task failure, the JobTracker reschedules it on a different task tracker.
upGrad’s Exclusive Software Development Webinar for you –
SAAS Business – What is So Different?
Explore our Popular Software Engineering Courses
Advantages of MapReduce
There are a number of advantages for applications which use this model. These are
- – Big data can be easily handled.
- – Datasets can be processed parallely.
- – All types of data such as structured, unstructured and semi-structured can be easily processed.
- – High scalability is provided.
- – Counting occurrences of words is easy and these applications can have massive data collection.
- – Large samples of respondents can be accessed quickly.
- – In data analysis, a generic tool can be used to search tools.
- – Load balancing time is offered in large clusters.
- – The process of extracting contexts of user locations, situations, etc. is easily possible.
- – Good generalization performance and convergence is provided to these applications.
Must Read: Mapreduce vs Apache Spark
Explore Our Software Development Free Courses
|Blockchain Technology||React for Beginners||Core Java Basics|
In-Demand Software Development Skills
We have described MapReduce in Hadoop in detail. We also provided a brief description of the framework along with definitions of both Map and Reduce in the introduction. The definitions of various terms used in this model were given along with details of the inputs and outputs.
A detailed explanation of the various phases involved in the MapReduce framework illustrated in detail how work gets organized. The list of advantages of using MapReduce for applications gives a clear picture of its use and relevance
If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.
Learn Software Development Courses online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs or Masters Programs to fast-track your career.
Read our Popular Articles related to Software Development
What are the features of MapReduce?
Apache Hadoop works with a humongous amount of data, and due to its efficiency in storing data, it is very scalable. It distributes data across multiple servers at once. These servers are cost-effective and can work consecutively. Therefore, by adding more servers to the cluster, the efficiency gradually increases. By using Hadoop MapReduce, organisations save a lot of money as applications utilise large data nodes that may contain trillion terabytes of data. The second efficient use of MapReduce is how it enables companies to analyse new data sources. Thus, organisations can use these data in a structured and unstructured manner depending on their requirements. Another feature of MapReduce is its scalability with the application framework that easily accommodates large chunks of data sets. Hadoop utilises the Hadoop Distributed File System that maps data with their respective cluster, which makes MapReduce a faster mode to work with.
What are the applications of MapReduce with Hadoop?
The practical use of MapReduce with Hadoop can be seen in sectors like e-commerce, entertainment, and social networks. Giant companies like Amazon and Walmart operate on MapReduce to figure out the buying pattern of customers. With MapReduce’s support, companies make their product recommendations. Purchase history, logs, etc., are a few such examples. Social networks platforms like Twitter and Facebook use the MapReduce programming tool to evaluate critical information such as the number of likes, status updates, etc. In the entertainment industry, platforms like Netflix utilises MapReduce to find out the number of logged-in customers, total clicks made, etc. With these statistics, Netflix recommends suggestions based on their subscriber’s interests.
What are the parameters that are specified to run a MapReduce job?
The user using MapReduce must specify the input location of the job, input format, output format, and class that holds the reduce function, class that consists of the map function, input format, JAR file that has reducer, driver classes, and mapper explicitly. Moreover, the user must also provide the job’s input and output in the distributed file system.