If you’ve read our article on Big Data Tools and Technologies, you might remember coming across the term “MapReduce”. It’s one of the core components of the Hadoop architecture and forms the whole processing layer of Hadoop.
In this article, we’ll be talking about MapReduce in a bit more depth – but in a beginner-friendly way. For your ease of understanding, we’ve broken down the article as follows:
- Core Concepts
- A Complete SWOT Analysis
Introduction to MapReduce
MapReduce is essentially a concept that allows lightning fast processing of massive datasets on a distributed cluster. It is a programming model patented by Google which was first adopted by the guys at Apache and is now at the heart of the whole Hadoop ecosystem. Its simplicity is what makes it so effective and commendable.
MapReduce follows a very simple logic, one that’s been followed for ages for managing day to day tasks – when you have to deal with a lot of material, employing a large number of workers will pace up the process (the age-old adage “too many cooks spoil the broth” doesn’t fit here!).
However, at this point, one question arises – does MapReduce work well ONLY with huge datasets?
The answer for this is very simple – the datasets needn’t necessarily be extremely large. However, owing to economic and computational reasons, it’s recommended to put MapReduce to work only if you have large datasets, large enough for traditional computations. It’s always better to process small datasets on your local machine itself – the way it was before MapReduce was introduced. Because honestly, using MapReduce for small chunks of data is very much like trying to kill a spider using a machine gun. The spider will get killed, no doubt, but is it worth it?
The Core Concepts of MapReduce
By now, you have a fair idea of what exactly is MapReduce. In this section, we’ll talk a bit more about how exactly does MapReduce work – the core concepts that get the wheels in motion.
If you’ve ever encountered basic mathematics in your life (which we hope you have if you’re reading this!), you must be aware of the concepts of “ordered pairs”. It’s simply a way of expressing two pieces of data in (x,y) form. Ordered pairs are very useful for representing coordinates, fractions, employee IDs, and other similar forms of data. MapReduce, too, relies heavily on the concept of ordered pairs – basically, any data can be “mapped” into the form of an ordered pair, and can be “reduced” in a variety of ways, depending on the problem statement at hand.
Suppose you’re left alone in a digital library containing a hundred thousand books and given a daunting task of finding the number of times a particular word occurs in all the books combined.
Phew, how would you start?
The first approach would be to write a program that’d iterate over every word, right from Book 1 – Word 1, to Book n – Word n, and keep incrementing a counter variable each time you encounter the needed word. Just imagining this would make you realize the amount of time it’ll take to accomplish this task.
Now, let’s take a look at how MapReduce will respond to the task:
- MapReduce Requirements:
MapReduce works well in a distributed cluster. What is a distributed cluster, you ask?
Very simply, it is a large number (often thousands!) of commodity computers (low-cost computers) interwoven into a network that performs a single operation.
Now, assume you have one such cluster. For our problem statement, the input to this cluster will be the content of all the books. As soon as we feed this data into our system, it gets copied into each and every computer (node) on the cluster (assuming we’re only dealing with one job at the moment). This is done to provide a level of fault tolerance – to help you in case of data loss.
- Programming In MapReduce
Since MapReduce works in two phases – Map and reduce, we have to write two programs, Map() and Reduce(), to get a MapReduce workflow up and running. Earlier you had use Java to write these programs, but now because of the rapid growth in Data Science, the MapReduce framework has been made flexible enough to handle codes written in Python or R too.
The scripts you write will run one after the other, that is, in a successive manner. This brings us to one of the major drawbacks of MapReduce paradigm – the fact that you cannot run both the scripts in parallel. Accomplishing that will make MapReduce a lot faster, and that has been a subject of extensive research.
- Workflow Parallelization:
The input (content of all the books) will be divided into various segments (equal to the number of computers in the cluster, essentially), and each computer will be assigned a particular segment. Suppose we’ve divided the file based in lines, then node 1 might be asked to take care of lines 1 to 10,000, node 2 might be asked for like 10,001 to 20,000, and so on. (Remember these are just illustrative figures – the segments are much larger in real life)
Each machine will then run your Map() program.
- The Map() Program:
The Map() program does exactly what you were going to do before giving the task to MapReduce – scanning the file and reading one word at a time. The benefit here is that it’ll be running simultaneously on multiple computers, so it’ll reduce the time by manifolds. The output of the Map() function is an ordered pair. In this case, it will simply be (word, count), i.e. when the Map() function encounters the word “hello” (assuming that’s the word we’re looking for), it will simply eject (hello, 1) as an output. Since various parts of the code body are being worked on at the same time, you will have a mapped version of your dataset in no time! To appreciate the simplicity of MapReduce, realize that the operation being performed here is fairly basic requiring no intensive calculations.
- Shuffling of the Mapped results:
Once we get the final ordered pairs from the Map() function, the results are shuffled. That simply means that (word, count) pairs with the same word are transferred to a single machine.
Now, if you’re acquainted with programming, you’ll realize that there is a limit to which this can be achieved. There are literally billions of words and only a finite number of computers in the cluster. At this point, it’s important to note that the shuffle phase is not mandatory for MapReduce to function, it simply makes the Reduce() script much faster by arranging things in an order. For now, let’s just say that we shuffled our Map outputs, and now have all the instances of the word “hello” in a single computer in the form of an ordered pair – (hello, n) where n is the number of times “hello” occurred in one particular book.
- Finally, the Reduce() phase:
Finally, we come to the last phase – Reduce(). If you were paying attention to the whole workflow, you might have correctly guessed that Reduce() will just count the number of times each word appears in the input file. All it has to do is simply add the second component of our ordered pairs. If there are, say 200 mentions of the word “hello” in all the books, the final output that we will get after this phase will be – (hello, 200). Just by simple counting and addition, we’ve arrived at the result. All made possible by employing a number of computers and making them work simultaneously.
MapReduce SWOT Analysis
The MapReduce paradigm has made the lives of Data Scientists easier. Let’s have a look at what are the strengths, weaknesses, opportunities, and threats of this framework:
|Strengths||Processes massive datasets at lightning fast speeds by processing them parallelly in distributed clusters.|
|Weaknesses||The Map() and Reduce() scripts run successively.
Read/Write operations are performed on HDFS, which is slow by comparison
|Opportunities||Generating insights from previous discarded sets of data and processing Big Data generated by organizations.|
|Threats||Standalone MapReduce operations might slow down the pace of the Big Data industry without sufficient advancement in analytics.|
Now that you’re familiar with the inner-workings of MapReduce, you’ll appreciate that it is by no means a new technology – we have been using similar paradigms to handle large volume of data, like election results, for a very long time now. You’ll also realize the importance of keeping things simple – MapReduce does nothing fancy, at all, and that’s what makes it so powerful.
If you are interested to know more about Big Data, check out our PG Diploma in Software Development Specialization in Big Data program which is designed for working professionals and provides 7+ case studies & projects, covers 14 programming languages & tools, practical hands-on workshops, more than 400 hours of rigorous learning & job placement assistance with top firms.