Association Rule Mining, as the name suggests, association rules are simple If/Then statements that help discover relationships between seemingly independent relational databases or other data repositories.
Most machine learning algorithms work with numeric datasets and hence tend to be mathematical. However, association rule mining is suitable for non-numeric, categorical data and requires just a little bit more than simple counting.
Association rule mining is a procedure which aims to observe frequently occurring patterns, correlations, or associations from datasets found in various kinds of databases such as relational databases, transactional databases, and other forms of repositories.
But what is association rule?
The Association rule is a learning technique that helps identify the dependencies between two data items. Based on the dependency, it then maps accordingly so that it can be more profitable. Association rule furthermore looks for interesting associations among the variables of the dataset. It is undoubtedly one of the most important concepts of Machine Learning and has been used in different cases such as association in data mining and continuous production, among others. However, like all other techniques, association in data mining, too, has its own set of disadvantages. The same has been discussed in brief in this article.
An association rule has 2 parts:
- an antecedent (if) and
- a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it can be understood as a retail store’s association rule to target their customers better. If the above rule is a result of a thorough analysis of some data sets, it can be used to not only improve customer service but also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then patterns. Then, depending on the following two parameters, the important relationships are observed:
- Support: Support indicates how frequently the if/then relationship appears in the database.
- Confidence: Confidence tells about the number of times these relationships have been found to be true.
Must read: Free excel courses!
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the rules that govern how or why such products/items are often bought together. For example, peanut butter and jelly are frequently purchased together because a lot of people like to make PB&J sandwiches.
Learn Data Science Courses online at upGrad
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was the first application area of association mining. The aim is to discover associations of items occurring together more often than you’d expect from randomly sampling all the possibilities. The classic anecdote of Beer and Diaper will help in understanding this better.
The story goes like this: young American men who go to the stores on Fridays to buy diapers have a predisposition to grab a bottle of beer too. However unrelated and vague that may sound to us laymen, association rule mining shows us how and why!
Let’s do a little analytics ourselves, shall we?
Suppose an X store’s retail transactions database includes the following data:
- Total number of transactions: 600,000
- Transactions containing diapers: 7,500 (1.25 percent)
- Transactions containing beer: 60,000 (10 percent)
- Transactions containing both beer and diapers: 6,000 (1.0 percent)
From the above figures, we can conclude that if there was no relation between beer and diapers (that is, they were statistically independent), then we would have got only 10% of diaper purchasers to buy beer too.
However, as surprising as it may seem, the figures tell us that 80% (=6000/7500) of the people who buy diapers also buy beer.
This is a significant jump of 8 over what was the expected probability. This factor of increase is known as Lift – which is the ratio of the observed frequency of co-occurrence of our items and the expected frequency.
How did we determine the lift?
Simply by calculating the transactions in the database and performing simple mathematical operations.
So, for our example, one plausible association rule can state that the people who buy diapers will also purchase beer with a Lift factor of 8. If we talk mathematically, the lift can be calculated as the ratio of the joint probability of two items x and y, divided by the product of their probabilities.
Lift = P(x,y)/[P(x)P(y)]
However, if the two items are statistically independent, then the joint probability of the two items will be the same as the product of their probabilities. Or, in other words,
P(x,y)=P(x)P(y),
which makes the Lift factor = 1. An interesting point worth mentioning here is that anti-correlation can even yield Lift values less than 1 – which corresponds to mutually exclusive items that rarely occur together.
Association Rule Mining has helped data scientists find out patterns they never knew existed.
Basic Fundamentals of Statistics for Data Science
Types Of Association Rules In Data Mining
There are typically four different types of association rules in data mining. They are
- Multi-relational association rules
- Generalized Association rule
- Interval Information Association Rules
- Quantitative Association Rules
Multi-Relational Association Rule
Also known as MRAR, multi-relational association rule is defined as a new class of association rules that are usually derived from different or multi-relational databases. Each rule under this class has one entity with different relationships that represent the indirect relationships between entities.
Generalized Association Rule
Moving on to the next type of association rule, the generalized association rule is largely used for getting a rough idea about the interesting patterns that often tend to stay hidden in data.
Quantitative Association Rules
This particular type is actually one of the most unique kinds of all the four association rules available. What sets it apart from the others is the presence of numeric attributes in at least one attribute of quantitative association rules. This is in contrast to the generalized association rule, where the left and right sides consist of categorical attributes.
Algorithms Of Associate Rule In Data Mining
There are mainly three different types of algorithms that can be used to generate associate rules in data mining. Let’s take a look at them.
- Apriori Algorithm
Apriori algorithm identifies the frequent individual items in a given database and then expands them to larger item sets, keeping in check that the item sets appear sufficiently often in the database.
- Eclat Algorithm
ECLAT algorithm is also known as Equivalence Class Clustering and bottomup. Latice Traversal is another widely used method for associate rule in data mining. Some even consider it to be a better and more efficient version of the Apriori algorithm.
- FP-growth Algorirthm
Also known as the recurring pattern, this algorithm is particularly useful for finding frequent patterns without the need for candidate generation. It mainly operates in two stages namely, FP-tree construction and extract frequently used item sets.
Now that you have a basic understanding of what is association rule,
Top Data Science Skills to Learn
Top Data Science Skills to Learn | ||
1 | Data Analysis Course | Inferential Statistics Courses |
2 | Hypothesis Testing Programs | Logistic Regression Courses |
3 | Linear Regression Courses | Linear Algebra for Analysis |
Let’s look at some areas where Association Rule Mining has helped quite a lot:
-
Market Basket Analysis:
This is the most typical example of association mining. Data is collected using barcode scanners in most supermarkets. This database, known as the “market basket” database, consists of a large number of records on past transactions. A single record lists all the items bought by a customer in one sale. Knowing which groups are inclined towards which set of items gives these shops the freedom to adjust the store layout and the store catalog to place the optimally concerning one another.
Explore our Popular Data Science Courses
The purpose of ARM analysis is to characterise the most intriguing patterns effectively. The Market Basket Analysis or MBA, often referred to as the ARM analysis, is a technique for identifying consumer patterns by mining associations from store transactional databases. Each and every commodity today includes a bar code. The corporate sector quickly documents this information as having enormous potential worth in marketing. Commercial businesses are particularly interested in “association rules” that pinpoint the trends such that the inclusion of one thing in a basket denotes the acquisition of one or more subsequent items. The outcomes of this “market basket analysis” can then be utilised to suggest product pairings. This helps managers in making efficient decisions.
Methods for Data Mining (DM) are also used to identify groups of items that are bought at the same time. Choosing which goods to place next to one another on store shelves might assist raise sales significantly. The following two phases can be used to decompose the ARM issue.
- Find groups of objects or item sets with operation support higher than specified minimum support. Recurring item sets are those that have the minimum support.
- To generate frequent patterns for databases, use large item sets.
-
Medical Diagnosis:
Association rules in medical diagnosis can be useful for assisting physicians for curing patients. Diagnosis is not an easy process and has a scope of errors which may result in unreliable end-results. Using relational association rule mining, we can identify the probability of the occurrence of illness concerning various factors and symptoms. Further, using learning techniques, this interface can be extended by adding new symptoms and defining relationships between the new signs and the corresponding diseases.
Must read: Data structures and algorithm free!
upGrad’s Exclusive Data Science Webinar for you –
Transformation & Opportunities in Analytics & Insights
3. Census Data:
Every government has tonnes of census data. This data can be used to plan efficient public services(education, health, transport) as well as help public businesses (for setting up new factories, shopping malls, and even marketing particular products). This application of association rule mining and data mining has immense potential in supporting sound public policy and bringing forth an efficient functioning of a democratic society.
Our learners also read: Free Online Python Course for Beginners
-
Protein Sequence:
Proteins are sequences made up of twenty types of amino acids. Each protein bears a unique 3D structure which depends on the sequence of these amino acids. A slight change in the sequence can cause a change in structure which might change the functioning of the protein. This dependency of the protein functioning on its amino acid sequence has been a subject of great research. Earlier it was thought that these sequences are random, but now it’s believed that they aren’t. Nitin Gupta, Nitin Mangal, Kamal Tiwari, and Pabitra Mitra have deciphered the nature of associations between different amino acids that are present in a protein. Knowledge and understanding of these association rules will come in extremely helpful during the synthesis of artificial proteins.
Read our popular Data Science Articles
- Building an Intelligent Transportation System
The Intelligent Transportation System (ITS) integrates cutting-edge beam technology, intelligent technology, and switch technology across the board. A flexible, precise, on-time, and organised interconnected transportation controlling system is the foundation of an intelligent transportation system.
The advanced traffic system (ITS) is put together on an informative network and created using sensors in parking lots and weather centres, cars, transfer stations, and transmission equipment to carry information centres throughout the traffic information.
The system gathers all the data by analysing real-time data on traffic conditions, parking availability, and other travel-related information. The system then uses the data to choose the best routes. The following requirements should be met for the application of ITS:
- Credible, correct, and genuine road and traffic data collection.
- Efficient, reliable information exchange between traffic management and road management facilities.
- The use of self-learning software applications by traffic toll management centres. decide on the route choices.
Best tools for Association Rule Mining
The best way to understand what is association rule mining, is by understanding its tools and how they work. Associate Rule is known as Association Rule Mining, where it uses diverse models and tools to analyse patterns in data sets. Association Rules in Data Mining has some amazing tools. We have a list of some amazing open-source tools that are great for working with Association Rules in Data Mining.
- WEKA – Waikato Environment for Knowledge Analysis
Another free and open-source tool for Association Rule in Data Mining is WEKA. A graphic user interface or common terminal programmes can be used to access it. Additionally, it is accessible through a Java API and utilised for data preparation, Machine Learning algorithm development, and visualisation of data on just about any system. WEKA includes a number of ML techniques that may be used to address actual data mining issues.
- RapidMiner
Another well-known open-source advanced analytic tool is RapidMiner. It is known for its user-friendly visual interface. It enables users to connect to any source of data, including social networking, cloud storage, commercial applications, and corporate data stores. Additionally, in order to prepare the data and analysis, RapidMiner includes automatic in-database processing. It is a great tool for Association Rule in Data Mining.
- Orange
An open-source tool called Orange is used primarily for data processing and display. Orange is used to explore and preprocess data. It is also used as a modelling tool that was written in Python. In Orange, one must choose the add-on to install “Associate” in order to make use of ARM. These add-ons will also enable network analysis, text mining, and NLP in addition to it. Orange is one of the most popular tools for Association Rule in Data Mining.
Associate Rule is known as affinity analysis as well, which leverages these tools to find all possible patterns and co-occurrences. These tools should be enough to answer your questions and doubt regarding what is association rule mining and how it works!
With that, I hope I was able to clarify everything you needed to know about association rule mining.
If you happen to have any doubts, queries, or suggestions – do drop them in the comments below!