Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconBig Databreadcumb forward arrow iconA Hitchhiker’s Guide to MapReduce

A Hitchhiker’s Guide to MapReduce

Last updated:
5th Feb, 2018
Read Time
9 Mins
share image icon
In this article
Chevron in toc
View All
A Hitchhiker’s Guide to MapReduce

If you’ve read our article on Big Data Tools and Technologies, you might remember coming across the term “MapReduce”. It’s one of the core components of the Hadoop architecture and forms the whole processing layer of Hadoop.

In this article, we’ll be talking about MapReduce in a bit more depth – but in a beginner-friendly way. For your ease of understanding, we’ve broken down the article as follows:A Hitchhiker’s Guide to MapReduce UpGrad Blog

  • Introduction
  • Core Concepts
  • A Complete SWOT Analysis

Introduction to MapReduce

MapReduce is essentially a concept that allows lightning fast processing of massive datasets on a distributed cluster. It is a programming model patented by Google which was first adopted by the guys at Apache and is now at the heart of the whole Hadoop ecosystem. Its simplicity is what makes it so effective and commendable.

MapReduce follows a very simple logic, one that’s been followed for ages for managing day to day tasks – when you have to deal with a lot of material, employing a large number of workers will pace up the process (the age-old adage “too many cooks spoil the broth” doesn’t fit here!).  

Ads of upGrad blog

However, at this point, one question arises – does MapReduce work well ONLY with huge datasets?

The answer for this is very simple – the datasets needn’t necessarily be extremely large. However, owing to economic and computational reasons, it’s recommended to put MapReduce to work only if you have large datasets, large enough for traditional computations. It’s always better to process small datasets on your local machine itself – the way it was before MapReduce was introduced. Because honestly, using MapReduce for small chunks of data is very much like trying to kill a spider using a machine gun. The spider will get killed, no doubt, but is it worth it?

Explore our Popular Software Engineering Courses

The Core Concepts of MapReduce

By now, you have a fair idea of what exactly is MapReduce. In this section, we’ll talk a bit more about how exactly does MapReduce work – the core concepts that get the wheels in motion.

A Hitchhiker’s Guide to MapReduce UpGrad Blog
If you’ve ever encountered basic mathematics in your life (which we hope you have if you’re reading this!), you must be aware of the concepts of “ordered pairs”. It’s simply a way of expressing two pieces of data in (x,y) form. Ordered pairs are very useful for representing coordinates, fractions, employee IDs, and other similar forms of data. MapReduce, too, relies heavily on the concept of ordered pairs – basically, any data can be “mapped” into the form of an ordered pair, and can be “reduced” in a variety of ways, depending on the problem statement at hand.

Suppose you’re left alone in a digital library containing a hundred thousand books and given a daunting task of finding the number of times a particular word occurs in all the books combined.

Phew, how would you start?

The first approach would be to write a program that’d iterate over every word, right from Book 1 – Word 1, to Book n – Word n, and keep incrementing a counter variable each time you encounter the needed word. Just imagining this would make you realize the amount of time it’ll take to accomplish this task.

Now, let’s take a look at how MapReduce will respond to the task:

  • MapReduce Requirements:

MapReduce works well in a distributed cluster. What is a distributed cluster, you ask?
Very simply, it is a large number (often thousands!) of commodity computers (low-cost computers) interwoven into a network that performs a single operation.

Now, assume you have one such cluster. For our problem statement, the input to this cluster will be the content of all the books. As soon as we feed this data into our system, it gets copied into each and every computer (node) on the cluster (assuming we’re only dealing with one job at the moment). This is done to provide a level of fault tolerance – to help you in case of data loss.

  • Programming In MapReduce

Since MapReduce works in two phases – Map and reduce, we have to write two programs, Map() and Reduce(), to get a MapReduce workflow up and running. Earlier you had use Java to write these programs, but now because of the rapid growth in Data Science, the MapReduce framework has been made flexible enough to handle codes written in Python or R too.

The scripts you write will run one after the other, that is, in a successive manner. This brings us to one of the major drawbacks of MapReduce paradigm – the fact that you cannot run both the scripts in parallel. Accomplishing that will make MapReduce a lot faster, and that has been a subject of extensive research.

  • Workflow Parallelization:

The input (content of all the books) will be divided into various segments (equal to the number of computers in the cluster, essentially), and each computer will be assigned a  particular segment. Suppose we’ve divided the file based in lines, then node 1 might be asked to take care of lines 1 to 10,000, node 2 might be asked for like 10,001 to 20,000, and so on. (Remember these are just illustrative figures – the segments are much larger in real life)
Each machine will then run your Map() program.

  • The Map() Program:

The Map() program does exactly what you were going to do before giving the task to MapReduce – scanning the file and reading one word at a time. The benefit here is that it’ll be running simultaneously on multiple computers, so it’ll reduce the time by manifolds. The output of the Map() function is an ordered pair. In this case, it will simply be (word, count), i.e. when the Map() function encounters the word “hello” (assuming that’s the word we’re looking for), it will simply eject (hello, 1) as an output. Since various parts of the code body are being worked on at the same time, you will have a mapped version of your dataset in no time! To appreciate the simplicity of MapReduce, realize that the operation being performed here is fairly basic requiring no intensive calculations.

  • Shuffling of the Mapped results:

Once we get the final ordered pairs from the Map() function, the results are shuffled. That simply means that (word, count) pairs with the same word are transferred to a single machine.
Now, if you’re acquainted with programming, you’ll realize that there is a limit to which this can be achieved. There are literally billions of words and only a finite number of computers in the cluster. At this point, it’s important to note that the shuffle phase is not mandatory for MapReduce to function, it simply makes the Reduce() script much faster by arranging things in an order. For now, let’s just say that we shuffled our Map outputs, and now have all the instances of the word “hello” in a single computer in the form of an ordered pair – (hello, n) where n is the number of times “hello” occurred in one particular book.

  • Finally, the Reduce() phase:

Finally, we come to the last phase – Reduce(). If you were paying attention to the whole workflow, you might have correctly guessed that Reduce() will just count the number of times each word appears in the input file. All it has to do is simply add the second component of our ordered pairs. If there are, say 200 mentions of the word “hello” in all the books, the final output that we will get after this phase will be – (hello, 200). Just by simple counting and addition, we’ve arrived at the result. All made possible by employing a number of computers and making them work simultaneously.

Explore Our Software Development Free Courses

MapReduce SWOT Analysis

The MapReduce paradigm has made the lives of Data Scientists easier. Let’s have a look at what are the strengths, weaknesses, opportunities, and threats of this framework:

StrengthsProcesses massive datasets at lightning fast speeds by processing them parallelly in distributed clusters.
WeaknessesThe Map() and Reduce() scripts run successively.
Read/Write operations are performed on HDFS, which is slow by comparison
OpportunitiesGenerating insights from previous discarded sets of data and processing Big Data generated by organizations.
ThreatsStandalone MapReduce operations might slow down the pace of the  Big Data industry without sufficient advancement in analytics.

In-Demand Software Development Skills

Ads of upGrad blog

In Conclusion…
Now that you’re familiar with the inner-workings of MapReduce, you’ll appreciate that it is by no means a new technology – we have been using similar paradigms to handle large volume of data, like election results, for a very long time now. You’ll also realize the importance of keeping things simple – MapReduce does nothing fancy, at all, and that’s what makes it so powerful.

If you are interested to know more about Big Data, check out our Advanced Certificate Programme in Big Data from IIIT Bangalore.

Read our Popular Articles related to Software Development

Learn Software Engineering degrees online from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.


Mohit Soni

Blog Author
Mohit Soni is working as the Program Manager for the BITS Pilani Big Data Engineering Program. He has been working with the Big Data Industry and BITS Pilani for the creation of this program. He is also an alumnus of IIT Delhi.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Big Data Course

Frequently Asked Questions (FAQs)

1What is the Hitchhiker’s Guide all about?

The Hitchhiker's Guide to the Galaxy is satirical science fiction by Douglas Adams. It is all about the misadventures of the permanently confused human central character Arthur Dent. After the Earth is destroyed, Dent sets out to wander the universe along with alien travel author Ford Prefect and some other odd creatures. This science fiction story is more about such modern times on Earth than anything about the future. And even though it has been more than 40 years since this book was first published, its theme remains much relatable even today. It explores various aspects of modern life and is one of the favorite bestsellers of all time.

2What is MapReduce?

MapReduce is a model for programming or a pattern in the Hadoop framework used to work with Big Data stored in HDFS (Hadoop Distributed File System). Using this MapReduce framework, developers can write programs that can help process huge volumes of data, i.e., Big Data, in a reliable way. This framework is based on Java and contains two parts – Map and Reduce. It offers a tremendous advantage of scalability, which is an essential feature in applications that deal with Big Data. This is a simple yet very powerful capability that MapReduce offers, and this is what makes it a hugely popular framework among programmers.

3Is Hadoop linked to the Cloud?

Hadoop can be described as an ecosystem of open-source software applications that allows distributed computing marked by scalability and reliability. The scalable and distributed computing capabilities of Hadoop make it ideal for storing and processing Big Data. The Cloud is ideally suited to provide the computational power that is needed to process huge volumes of datasets parallelly. The Cloud is all about flexibility and agility that is required to accommodate increasing volumes of Big Data as and when needed. Hadoop as a service is available on the cloud platform. Hadoop on the cloud is ideal for quickly processing medium to large-scale data.

Explore Free Courses

Suggested Blogs

13 Best Big Data Project Ideas & Topics for Beginners [2024]
Big Data Project Ideas Big Data is an exciting subject. It helps you find patterns and results you wouldn’t have noticed otherwise. This skill
Read More

by upGrad

29 May 2024

Characteristics of Big Data: Types & 5V’s
Introduction The world around is changing rapidly, we live a data-driven age now. Data is everywhere, from your social media comments, posts, and lik
Read More

by Rohit Sharma

04 May 2024

Top 10 Hadoop Commands [With Usages]
In this era, with huge chunks of data, it becomes essential to deal with them. The data springing from organizations with growing customers is way lar
Read More

by Rohit Sharma

12 Apr 2024

50 Must Know Big Data Interview Questions and Answers 2024: For Freshers & Experienced
Introduction The demand for potential candidates is increasing rapidly in the big data technologies field. There are plenty of opportunities in this
Read More

by Mohit Soni

What is Big Data – Characteristics, Types, Benefits & Examples
Lately the term ‘Big Data’ has been under the limelight, but not many people know what is big data. Businesses, governmental institutions, HCPs (Healt
Read More

by Abhinav Rai

18 Feb 2024

Cassandra vs MongoDB: Difference Between Cassandra & MongoDB [2023]
Introduction Cassandra and MongoDB are among the most famous NoSQL databases used by large to small enterprises and can be relied upon for scalabilit
Read More

by Rohit Sharma

31 Jan 2024

Be A Big Data Analyst – Skills, Salary & Job Description
In an era dominated by Big Data, one cannot imagine that the skill set and expertise of traditional Data Analysts are enough to handle the complexitie
Read More

by upGrad

16 Dec 2023

12 Exciting Hadoop Project Ideas & Topics For Beginners [2024]
Hadoop Project Ideas & Topics Today, big data technologies power diverse sectors, from banking and finance, IT and telecommunication, to manufact
Read More

by Rohit Sharma

29 Nov 2023

Top 10 Exciting Data Engineering Projects & Ideas For Beginners [2024]
Data engineering is an exciting and rapidly growing field that focuses on building, maintaining, and improving the systems that collect, store, proces
Read More

by Rohit Sharma

21 Sep 2023

Schedule 1:1 free counsellingTalk to Career Expert
footer sticky close icon