Enterprise data was stored in information silos that were physically apart from other data repositories, and each silo served specialised functions – but that was before Big Data hit the world (by a storm, if we may say). Now, it’s practically impossible to practice the same methods on such large datasets. Just imagine the number of data extracts it would require from so many of such physically separated information silos – only to run a simple query. All thanks to the extremely massive pile of data that lie with organisations these days.
Data Warehouses were developed to combat this problem of data storage. Essentially, Data Warehouses can be thought of as a unified repository of data that comes from various sources and is in various formats. Data Mining, on the other hand, is the process of extracting knowledge from the said Data Warehouse.
In this article, we’ll take a detailed look at Data Warehouse and Data Mining. For better understanding, we’ve structured the article as follows:
- What is a Data Warehouse?
- Data Warehouse Processes
- What is Data Mining?
- KDD Process
- Real Life Use-Cases of Data Mining
What is a Data Warehouse?
Technically, a Data Warehouse can be defined as a subject-oriented, time-variant, non-volatile, and integrated collection of data. Before moving further from here, let’s first look at what these terms mean in the context of a Data Warehouse:
Organisations can use the Data Warehouse to analyze a specific subject area. Suppose you want to see how well your sales team has performed in the last 5 years – you can query your Warehouse, and it’ll tell you all you need to know. In this case, “sales” can be treated as a subject.
Data Warehouses are responsible for storing historical data for organisations. For example, a transaction system can hold the most recent address of a customer, but a Data Warehouse will hold all the previous addresses too. It continuously keeps adding data from various sources, apart from keeping the historical data – that’s what makes it a time-variant model. The data stored will always vary with time.
Once data is stored in a Data Warehouse, it can’t be altered or modified. We can only add a modified copy of the data we want to modify.
As we said earlier, a Data Warehouse holds data from multiple sources. Say we have two data sources – A and B. Both the sources might have completely different types of data stored in them, but when they are brought to a Warehouse, they’re made to undergo preprocessing. That is how a Data Warehouse integrates data from a number of sources.
Data Warehouse Processes
Take a look at the above image. The data that is collected from various sources (operational system, ERP, CRM, Flat Files, etc.) is made to undergo an ETL process before it’s inserted into the data warehouse. This is essentially done to remove anomalies, if any, from the data – so that no harm is caused to the Data Warehouse. ETL stands for – Extraction, Transformation, and Loading. Let’s have a look at each of these processes in detail. To understand better, we’ll use an analogy – think of a gold rush and read on!
Extraction is essentially done to collect all the required data from the source systems using as few resources as possible.
Think of this step like panning the river in search of gold nuggets as big as possible.
The main aim is to insert the extracted data into the database in a general format. This is because different sources will have different formats of storing the data – for example, one data source might have data in “dd/mm/yyyy” format, and the other might have it in “dd-mm-yy” format. In this step, we’ll convert this into a generalised format – one that’ll be used for data from all the sources.
Now you have a gold nugget. What do you do? Melt it down and remove the impurities.
In this step, the transformed data is loaded into the target database.
Now you have pure gold – mould it into a ring and sell it away!
The process of bringing data from various sources and storing it in the Data Warehouse (after the ETL process, of course), is what is known as Data Warehousing.
Now, you have your data in place – all cleaned up and ready to go. What should be the next step? Extracting knowledge – yes!
Data Mining to the rescue!
What is Data Mining?
Data Mining is, quite simply, the process of extracting previously unknown but potentially useful information from the data sets. By “previously unknown”, we mean knowledge that can be acquired only after deeply mining the data warehouse – i.e., it won’t make sense on the surface. Data Mining essentially searches for the relationships global patterns that exist between the data elements.
For example, imagine you run a supermarket. Now, a customer’s purchase history might not look to reveal a lot on the surface, but, if analyzed carefully – recognizing the possible patterns, then merely this information is enough to give out a lot. If you haven’t guessed it yet, we’re talking about Target – a supermarket that figured out a teen girl (customer) was pregnant just by carefully studying her purchase history and looking for trends and patterns. So, the information that looked so trivial on the surface turned out to be of so much value when mined carefully – and that is exactly what we mean by “previously unknown knowledge”.
We feel it’ll be unfair to you if we give you the flavor of Data Warehouse and Data Mining and completely ignore the big picture – Knowledge Discovery in Databases (KDD). Data Mining forms one of the steps of a KDD process.Let’s talk a bit more about KDD.
Knowledge Discovery In Databases (KDD)
Data mining is one of the more crucial steps in the process of KDD. KDD basically covers everything from the selection of data to finally evaluating the mined data. The complete KDD cycle is shown in the image below:
It is of utmost importance to know the exact target data. Selecting the subset of Data Warehouse that’s to be analyzed is a very important step because removing unrelated data elements will reduce the search space during the Data Mining phase.
In this step, the selected data is freed from any anomalies and outliers. Basically, the data is completely cleaned in this phase. Like, if there are some missing data fields, they’re filled with appropriate values. For example, in the table that stores the details of your organisation’s employees, suppose there’s a column for “Middle Name”. Chances are, it’ll be empty for many employees. In such a scenario, an appropriate value is chosen (N/A, for ex).
This phase attempts to reduce the variety of data elements while preserving the quality of the info.
This is the main phase of a KDD process. The transformed data is subjected to data-mining methods like grouping, clustering, regression, etc. This is done iteratively to bring the best results. Different techniques can be used depending on the requirements.
This is the final step. In this, the obtained knowledge is documented and presented for further analysis. Various Data Visualisation tools are used in this step to depict the acquired knowledge in a beautiful and understandable way.
Real Life Use-Cases of Data Mining
Every organisation from Amazon, Flipkart, Netflix, to Facebook, Twitter, Instagram, to even Walmart, is putting Data Mining to good use. In this section, we’ll talk about four broad use cases of Data Mining that are an integral part of your day-to-day life.
Telecom service providers use Data Mining to predict the “churn” – a term used by them for when a customer ditches them for another provider. Apart from that, they collate billing information, website visits, customer care interactions, and other such things to give each customer a probability score. Then, those customers that are on a higher risk of “churning” are provided offers and incentives.
E-commerce is easily the most one known use case when it comes to Data Mining. One of the most famous of them is, of course, Amazon. They use extremely sophisticated mining techniques. Check out the “People who viewed that product, also liked this” functionality for instance!
Supermarkets are also an interesting use case of Data Mining. Mining the purchase history of customers allows them to understand their purchasing patterns. This information is then used by the supermarkets to provide personalised offers to the customers. Oh, and did we tell you about what Target did using Data Mining? (Yes, we did!)
Retailers club their customers into Recency, Frequency, and Monetary (RFM) groups. Using Data Mining, they target marketing to these groups. A customer who spends little but frequently and his last purchase was fairly recent will be handled differently than a customer who spent a lot but only once.
Data Warehousing and Data Mining make up two of the most important processes that are quite literally running the world today. Almost every big thing today is a result of sophisticated data mining. Because un-mined data is as useful (or useless) as no data at all.
We hope this article gave you clarity on what these two terms mean and much more!
Latest posts by Sumit Shukla (see all)
- How does Unsupervised Machine Learning Work? - June 12, 2018
- What is Machine Learning and Why it matters - June 11, 2018
- Role of Apache Spark in Big Data and What Sets it Apart - May 29, 2018