A Hitchhiker’s Guide to MapReduce

If you’ve read our article on Big Data Tools and Technologies, you might remember coming across the term “MapReduce”. It’s one of the core components of the Hadoop architecture and forms the whole processing layer of Hadoop.

In this article, we’ll be talking about MapReduce in a bit more depth – but in a beginner-friendly way. For your ease of understanding, we’ve broken down the article as follows:A Hitchhiker’s Guide to MapReduce UpGrad Blog

  • Introduction
  • Core Concepts
  • A Complete SWOT Analysis

Introduction to MapReduce

MapReduce is essentially a concept that allows lightning fast processing of massive datasets on a distributed cluster. It is a programming model patented by Google which was first adopted by the guys at Apache and is now at the heart of the whole Hadoop ecosystem. Its simplicity is what makes it so effective and commendable.

MapReduce follows a very simple logic, one that’s been followed for ages for managing day to day tasks – when you have to deal with a lot of material, employing a large number of workers will pace up the process (the age-old adage “too many cooks spoil the broth” doesn’t fit here!).  

However, at this point, one question arises – does MapReduce work well ONLY with huge datasets?

The answer for this is very simple – the datasets needn’t necessarily be extremely large. However, owing to economic and computational reasons, it’s recommended to put MapReduce to work only if you have large datasets, large enough for traditional computations. It’s always better to process small datasets on your local machine itself – the way it was before MapReduce was introduced. Because honestly, using MapReduce for small chunks of data is very much like trying to kill a spider using a machine gun. The spider will get killed, no doubt, but is it worth it?

Explore our Popular Software Engineering Courses

The Core Concepts of MapReduce

By now, you have a fair idea of what exactly is MapReduce. In this section, we’ll talk a bit more about how exactly does MapReduce work – the core concepts that get the wheels in motion.

A Hitchhiker’s Guide to MapReduce UpGrad Blog
If you’ve ever encountered basic mathematics in your life (which we hope you have if you’re reading this!), you must be aware of the concepts of “ordered pairs”. It’s simply a way of expressing two pieces of data in (x,y) form. Ordered pairs are very useful for representing coordinates, fractions, employee IDs, and other similar forms of data. MapReduce, too, relies heavily on the concept of ordered pairs – basically, any data can be “mapped” into the form of an ordered pair, and can be “reduced” in a variety of ways, depending on the problem statement at hand.

Suppose you’re left alone in a digital library containing a hundred thousand books and given a daunting task of finding the number of times a particular word occurs in all the books combined.

Phew, how would you start?

The first approach would be to write a program that’d iterate over every word, right from Book 1 – Word 1, to Book n – Word n, and keep incrementing a counter variable each time you encounter the needed word. Just imagining this would make you realize the amount of time it’ll take to accomplish this task.

Now, let’s take a look at how MapReduce will respond to the task:

  • MapReduce Requirements:

MapReduce works well in a distributed cluster. What is a distributed cluster, you ask?
Very simply, it is a large number (often thousands!) of commodity computers (low-cost computers) interwoven into a network that performs a single operation.

Now, assume you have one such cluster. For our problem statement, the input to this cluster will be the content of all the books. As soon as we feed this data into our system, it gets copied into each and every computer (node) on the cluster (assuming we’re only dealing with one job at the moment). This is done to provide a level of fault tolerance – to help you in case of data loss.

  • Programming In MapReduce

Since MapReduce works in two phases – Map and reduce, we have to write two programs, Map() and Reduce(), to get a MapReduce workflow up and running. Earlier you had use Java to write these programs, but now because of the rapid growth in Data Science, the MapReduce framework has been made flexible enough to handle codes written in Python or R too.

The scripts you write will run one after the other, that is, in a successive manner. This brings us to one of the major drawbacks of MapReduce paradigm – the fact that you cannot run both the scripts in parallel. Accomplishing that will make MapReduce a lot faster, and that has been a subject of extensive research.

  • Workflow Parallelization:

The input (content of all the books) will be divided into various segments (equal to the number of computers in the cluster, essentially), and each computer will be assigned a  particular segment. Suppose we’ve divided the file based in lines, then node 1 might be asked to take care of lines 1 to 10,000, node 2 might be asked for like 10,001 to 20,000, and so on. (Remember these are just illustrative figures – the segments are much larger in real life)
Each machine will then run your Map() program.

  • The Map() Program:

The Map() program does exactly what you were going to do before giving the task to MapReduce – scanning the file and reading one word at a time. The benefit here is that it’ll be running simultaneously on multiple computers, so it’ll reduce the time by manifolds. The output of the Map() function is an ordered pair. In this case, it will simply be (word, count), i.e. when the Map() function encounters the word “hello” (assuming that’s the word we’re looking for), it will simply eject (hello, 1) as an output. Since various parts of the code body are being worked on at the same time, you will have a mapped version of your dataset in no time! To appreciate the simplicity of MapReduce, realize that the operation being performed here is fairly basic requiring no intensive calculations.

  • Shuffling of the Mapped results:

Once we get the final ordered pairs from the Map() function, the results are shuffled. That simply means that (word, count) pairs with the same word are transferred to a single machine.
Now, if you’re acquainted with programming, you’ll realize that there is a limit to which this can be achieved. There are literally billions of words and only a finite number of computers in the cluster. At this point, it’s important to note that the shuffle phase is not mandatory for MapReduce to function, it simply makes the Reduce() script much faster by arranging things in an order. For now, let’s just say that we shuffled our Map outputs, and now have all the instances of the word “hello” in a single computer in the form of an ordered pair – (hello, n) where n is the number of times “hello” occurred in one particular book.

  • Finally, the Reduce() phase:

Finally, we come to the last phase – Reduce(). If you were paying attention to the whole workflow, you might have correctly guessed that Reduce() will just count the number of times each word appears in the input file. All it has to do is simply add the second component of our ordered pairs. If there are, say 200 mentions of the word “hello” in all the books, the final output that we will get after this phase will be – (hello, 200). Just by simple counting and addition, we’ve arrived at the result. All made possible by employing a number of computers and making them work simultaneously.

Explore Our Software Development Free Courses

MapReduce SWOT Analysis

The MapReduce paradigm has made the lives of Data Scientists easier. Let’s have a look at what are the strengths, weaknesses, opportunities, and threats of this framework:

Strengths Processes massive datasets at lightning fast speeds by processing them parallelly in distributed clusters.
Weaknesses The Map() and Reduce() scripts run successively.
Read/Write operations are performed on HDFS, which is slow by comparison
Opportunities Generating insights from previous discarded sets of data and processing Big Data generated by organizations.
Threats Standalone MapReduce operations might slow down the pace of the  Big Data industry without sufficient advancement in analytics.

In-Demand Software Development Skills

In Conclusion…
Now that you’re familiar with the inner-workings of MapReduce, you’ll appreciate that it is by no means a new technology – we have been using similar paradigms to handle large volume of data, like election results, for a very long time now. You’ll also realize the importance of keeping things simple – MapReduce does nothing fancy, at all, and that’s what makes it so powerful.

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Read our Popular Articles related to Software Development

Learn Software Engineering degrees online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

What is the Hitchhiker’s Guide all about?

The Hitchhiker's Guide to the Galaxy is satirical science fiction by Douglas Adams. It is all about the misadventures of the permanently confused human central character Arthur Dent. After the Earth is destroyed, Dent sets out to wander the universe along with alien travel author Ford Prefect and some other odd creatures. This science fiction story is more about such modern times on Earth than anything about the future. And even though it has been more than 40 years since this book was first published, its theme remains much relatable even today. It explores various aspects of modern life and is one of the favorite bestsellers of all time.

What is MapReduce?

MapReduce is a model for programming or a pattern in the Hadoop framework used to work with Big Data stored in HDFS (Hadoop Distributed File System). Using this MapReduce framework, developers can write programs that can help process huge volumes of data, i.e., Big Data, in a reliable way. This framework is based on Java and contains two parts – Map and Reduce. It offers a tremendous advantage of scalability, which is an essential feature in applications that deal with Big Data. This is a simple yet very powerful capability that MapReduce offers, and this is what makes it a hugely popular framework among programmers.

Is Hadoop linked to the Cloud?

Hadoop can be described as an ecosystem of open-source software applications that allows distributed computing marked by scalability and reliability. The scalable and distributed computing capabilities of Hadoop make it ideal for storing and processing Big Data. The Cloud is ideally suited to provide the computational power that is needed to process huge volumes of datasets parallelly. The Cloud is all about flexibility and agility that is required to accommodate increasing volumes of Big Data as and when needed. Hadoop as a service is available on the cloud platform. Hadoop on the cloud is ideal for quickly processing medium to large-scale data.

Want to share this article?

Lead the Data Driven Technological Revolution

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Big Data Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks