Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconThe What’s What of Data Warehousing and Data Mining

The What’s What of Data Warehousing and Data Mining

Last updated:
21st Feb, 2018
Views
Read Time
11 Mins
share image icon
In this article
Chevron in toc
View All
The What’s What of Data Warehousing and Data Mining

Enterprise data was stored in information silos that were physically apart from other data repositories, and each silo served specialized functions – but that was before Big Data hit the world (by a storm, if we may say). Now, it’s practically impossible to practice the same methods on such large datasets. Just imagine the number of data extracts it would require from so many of such physically separated information silos – only to run a simple query. All thanks to the extremely massive pile of data that lie with organizations & big data engineering methods. 

Let’s keep a close eye to how Data Warehousing and Data mining enters the scene. Data Warehouses were developed to combat this problem of data storage. Essentially, Data Warehouses can be thought of as a unified repository of data that comes from various sources and is in various formats. Data Mining, on the other hand, is the process of extracting knowledge from the said Data Warehouse.

In this article, we’ll take a detailed look at Data Warehouse and Data Mining. For better understanding, we’ve structured the article as follows:

  • What is Data Warehousing?
  • Data Warehouse Processes
  • What is Data Mining?
  • KDD Process
  • Real Life Use-Cases of Data Mining

What is Data Warehousing?

If we were to define Data Warehouse, it can be explained as a subject-oriented, time-variant, non-volatile, an integrated collection of data. The introduction to Data Warehousing also comprises compiled data from external sources. The purpose of designing a Warehouse is to analyze and induce business decisions by reporting data at a different aggregate level.  Before moving further from here, let’s first look at what these terms mean in the context of a Data Warehouse:

  • Subject-Oriented

    Organizations can use the Data Warehouse to analyze a specific subject area. Suppose you want to see how well your sales team has performed in the last 5 years – you can query your Warehouse, and it’ll tell you all you need to know. In this case, “sales” can be treated as a subject.

  • Time-Variant

    Data Warehouses are responsible for storing historical data for organizations. For example, a transaction system can hold the most recent address of a customer, but a Data Warehouse will hold all the previous addresses too. It continuously keeps adding data from various sources, apart from keeping the historical data – that’s what makes it a time-variant model. The data stored will always vary with time.

  • Non-Volatile

    Once data is stored in a Data Warehouse, it can’t be altered or modified. We can only add a modified copy of the data we want to modify.

  • Integrated:

    As we said earlier, a Data Warehouse holds data from multiple sources. Say we have two data sources – A and B. Both the sources might have completely different types of data stored in them, but when they are brought to a Warehouse, they’re made to undergo preprocessing. That is how a Data Warehouse integrates data from a number of sources.

Get Started in Data Science with Python

Data Warehouse Processes

Data Warehousing and Data Mining
Take a look at the above image. The data that is collected from various sources (operational system, ERP, CRM, Flat Files, etc.) is made to undergo an ETL process before it’s inserted into the data warehouse. This is essentially done to remove anomalies, if any, from the data – so that no harm is caused to the Data Warehouse. ETL stands for – Extraction, Transformation, and Loading. Let’s have a look at each of these processes in detail. To understand better, we’ll use an analogy – think of a gold rush and read on!

Explore our Popular Data Science Online Courses

  • Extraction

    Extraction is essentially done to collect all the required data from the source systems using as few resources as possible.

Think of this step like panning the river in search of gold nuggets as big as possible.

  • Transformation

    The main aim is to insert the extracted data into the database in a general format. This is because different sources will have different formats of storing the data – for example, one data source might have data in “dd/mm/yyyy” format, and the other might have it in “dd-mm-yy” format. In this step, we’ll convert this into a generalized format – one that’ll be used for data from all the sources.

Now you have a gold nugget. What do you do? Melt it down and remove the impurities.

  • Loading

    In this step, the transformed data is loaded into the target database.

Now you have pure gold – mould it into a ring and sell it away!
The process of bringing data from various sources and storing it in the Data Warehouse (after the ETL process, of course), is what is known as Data Warehousing.
Now, you have your data in place – all cleaned up and ready to go. What should be the next step? Extracting knowledge – yes!

Data Mining to the rescue!

How Can You Transition to Data Analytics?

Our learners also read: Top Python Courses for Free

upGrad’s Exclusive Data Science Webinar for you –

How upGrad helps for your Data Science Career?

 

What is Data Mining?

Data Mining is, quite simply, the process of extracting previously unknown but potentially useful information from the data sets. By “previously unknown”, we mean knowledge that can be acquired only after deeply mining the data warehouse – i.e., it won’t make sense on the surface. Data Mining essentially searches for the relationships global patterns that exist between the data elements.

Top Data Science Skills to Learn to upskill

For example, imagine you run a supermarket. Now, a customer’s purchase history might not look to reveal a lot on the surface, but, if analyzed carefully – recognizing the possible patterns, then merely this information is enough to give out a lot. If you haven’t guessed it yet, we’re talking about Target – a supermarket that figured out a teen girl (customer) was pregnant just by carefully studying her purchase history and looking for trends and patterns. So, the information that looked so trivial on the surface turned out to be of so much value when mined carefully – and that is exactly what we mean by “previously unknown knowledge”.

We feel it’ll be unfair to you if we give you the flavor of Data Warehousing and Data Mining and completely ignore the big picture – Knowledge Discovery in Databases (KDD). Data Mining forms one of the steps of a KDD process.Let’s talk a bit more about KDD.

Earn data science certification from the World’s top Universities. Join our Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Knowledge Discovery In Databases (KDD)

Data mining is one of the more crucial steps in the process of KDD. KDD basically covers everything from the selection of data to finally evaluating the mined data. The complete KDD cycle is shown in the image below:

Data Warehousing and Data Mining

Selection

It is of utmost importance to know the exact target data. Analyzing Data Mining to Data Warehousing subset is a very important step because removing unrelated data elements will reduce the search space during the Data Mining phase.

Read our popular Data Science Articles

Pre-processing

In this step, the selected data is freed from any anomalies and outliers. Basically, the data is completely cleaned in this phase. Like, if there are some missing data fields, they’re filled with appropriate values. For example, in the table that stores the details of your organization’s employees, suppose there’s a column for “Middle Name”. Chances are, it’ll be empty for many employees. In such a scenario, an appropriate value is chosen (N/A, for ex).

Transformation

This phase attempts to reduce the variety of data elements while preserving the quality of the info.

Data mining

This is the main phase of a KDD process. The transformed data is subjected to data-mining methods like grouping, clustering, regression, etc. This is done iteratively to bring the best results. Different techniques can be used depending on the requirements.

Evaluation

This is the final step. In this, the obtained knowledge is documented and presented for further analysis. Various Data Visualisation tools are used in this step to depicting the acquired knowledge in a beautiful and understandable way.
How Does Simpson’s Paradox Affect Data?

Real Life Use-Cases of Data Mining

Every organization from Amazon, Flipkart, Netflix, to Facebook, Twitter, Instagram, to even Walmart, is putting Data Mining to good use. In this section, we’ll talk about four broad use cases of Data Mining that are an integral part of your day-to-day life.

  • Service Providers

    Telecom service providers use Data Mining to predict the “churn” – a term used by them for when a customer ditches them for another provider. Apart from that, they collate billing information, website visits, customer care interactions, and other such things to give each customer a probability score. Then, those customers that are on a higher risk of “churning” are provided offers and incentives.

  • E-Commerce

    E-commerce is easily the most known use case when it comes to Data Mining. One of the most famous of them is, of course, Amazon. They use extremely sophisticated mining techniques. Check out the “People who viewed that product, also liked this” functionality for instance!

  • Supermarkets

    Supermarkets are also an interesting use case of Data Mining. Mining the purchase history of customers allows them to understand their purchasing patterns. This information is then used by the supermarkets to provide personalized offers to the customers. Oh, and did we tell you about what Target did using Data Mining? (Yes, we did!)

  • Retail

    Retailers club their customers into Recency, Frequency, and Monetary (RFM) groups. Using Data Mining, they target marketing to these groups. A customer who spends little but frequently and his last purchase was fairly recent will be handled differently than a customer who spent a lot but only once.

Who is a Data Scientist, a Data Analyst and a Data Engineer?

Wrapping Up…

Data Warehousing and Data Mining make up two of the most important processes that are quite literally running the world today. Almost every big thing today is a result of sophisticated data mining. Because un-mined data is as useful (or useless) as no data at all.

Again, to understand the difference between Data Mining And Data Warehousing you have to indulge in, from the introduction to Data Mining to Data Warehousing- which is a method all centralizing the data from disparate sources in one database. We can define Data warehousing as compiled historical data or real-time data feed that gives backs mostly organic and integrated information.

We hope this article gave you clarity on what is Data Warehousing and Data Mining and much more. To conclude, the process of collecting, storing and organizing information in a single database is considered to be as Data Warehousing vs. Data Mining is mostly extracting meaningful information from the data using a different perspective. All the useful information which is collected can be used afterward to solve future issues that might be an obstacle in the growth of the company and can even cut costs too. If you are looking for a bright and fascinating future and if exploration is your passion then starting from learning the Whats’ What of Data Warehousing and Data Mining would be an excellent option for you.

We hope this article gave you clarity on what these two terms mean and much more! If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Sumit Shukla

Blog Author
Sumit is a Level-1 Data Scientist, Sports Data Analyst and a Content Strategist for Artifical Intelligence and Machine Learning at UpGrad. He's certified in sports technology and science from FC Barcelona's technology innovation hub.

Frequently Asked Questions (FAQs)

1How do businesses use Data Warehousing and Data Mining?

Both data mining and data warehousing are business intelligence techniques for transforming information (or data) into usable knowledge.

Data mining is a statistical analysis method. Technical tools are used by analysts to query and sort through gigabytes of data in search of trends. Businesses then utilise this data to make better business decisions based on their understanding of the behaviours of their consumers and suppliers.

Data Warehousing is the process of designing how data is stored in order to facilitate reporting and analysis. According to data warehouse specialists, the numerous data stores are both conceptually and physically integrated and related to one another. The data of a company is typically saved in multiple databases.

2What is the core difference between Data Warehousing and Data Mining? Which is more practical in the business world?

A data warehouse is a data storage system. It usually entails a variety of data kinds acquired from multiple sources for a variety of objectives. The process of storing this data with discipline so that it may be retrieved later is known as data warehousing.

The process of extracting data is known as data mining. It entails locating the most pertinent information for a particular goal. It might come from your data warehouse, or from somewhere else entirely. You anticipate refining and cleaning the data you mine, just as you would with real ore.

The better your warehousing systems are, the easier it will be to mine.

3Are Data Mining and KDD process similar?

Although KDD and Data Mining are the terms that are frequently interchanged, they refer to two distinct but related concepts.

Data Mining is a component within the KDD process that deals with recognising patterns in data, whereas KDD is the whole process of extracting knowledge from data. To put it another way, Data Mining is just the application of a specific algorithm to achieve the KDD process’s ultimate purpose.

Explore Free Courses

Suggested Blogs

Data Science for Beginners: A Comprehensive Guide
5015
Data science is an important part of many industries today. Having worked as a data scientist for several years, I have witnessed the massive amounts
Read More

by Harish K

28 Feb 2024

6 Best Data Science Institutes in 2024 (Detailed Guide)
5020
Data science training is one of the most hyped skills in today’s world. Based on my experience as a data scientist, it’s evident that we are in
Read More

by Harish K

28 Feb 2024

Data Science Course Fees: The Roadmap to Your Analytics Career
5036
A data science course syllabus covers several basic and advanced concepts of statistics, data analytics, machine learning, and programming languages.
Read More

by Harish K

28 Feb 2024

Inheritance in Python | Python Inheritance [With Example]
17105
Python is one of the most popular programming languages. Despite a transition full of ups and downs from the Python 2 version to Python 3, the Object-
Read More

by Rohan Vats

27 Feb 2024

Data Mining Architecture: Components, Types & Techniques
10586
Introduction Data mining is the process in which information that was previously unknown, which could be potentially very useful, is extracted from a
Read More

by Rohit Sharma

27 Feb 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
79411
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

19 Feb 2024

Sorting in Data Structure: Categories & Types [With Examples]
137495
The arrangement of data in a preferred order is called sorting in the data structure. By sorting data, it is easier to search through it quickly and e
Read More

by Rohit Sharma

19 Feb 2024

Data Science Vs Data Analytics: Difference Between Data Science and Data Analytics
67775
Summary: In this article, you will learn, Difference between Data Science and Data Analytics Job roles Skills Career perspectives Which one is right
Read More

by Rohit Sharma

19 Feb 2024

13 Exciting Python Projects on Github You Should Try Today [2023]
44753
Python is one of the top choices in programming languages among professionals worldwide. Its straightforward syntax allows software developers and dat
Read More

by Hemant

19 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon