Association Rule Mining, as the name suggests, association rules are simple If/Then statements that help discover relationships between seemingly independent relational databases or other data repositories.
Most machine learning algorithms work with numeric datasets and hence tend to be mathematical. However, association rule mining is suitable for non-numeric, categorical data and requires just a little bit more than simple counting.
Association rule mining is a procedure which aims to observe frequently occurring patterns, correlations, or associations from datasets found in various kinds of databases such as relational databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:
- an antecedent (if) and
- a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it can be understood as a retail store’s association rule to target their customers better. If the above rule is a result of a thorough analysis of some data sets, it can be used to not only improve customer service but also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then patterns. Then, depending on the following two parameters, the important relationships are observed:
- Support: Support indicates how frequently the if/then relationship appears in the database.
- Confidence: Confidence tells about the number of times these relationships have been found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the rules that govern how or why such products/items are often bought together. For example, peanut butter and jelly are frequently purchased together because a lot of people like to make PB&J sandwiches.
Association Rule Mining is sometimes referred to as “Market Basket Analysis”, as it was the first application area of association mining. The aim is to discover associations of items occurring together more often than you’d expect from randomly sampling all the possibilities. The classic anecdote of Beer and Diaper will help in understanding this better.
The story goes like this: young American men who go to the stores on Fridays to buy diapers have a predisposition to grab a bottle of beer too. However unrelated and vague that may sound to us laymen, association rule mining shows us how and why!
Let’s do a little analytics ourselves, shall we?
Suppose an X store’s retail transactions database includes the following data:
- Total number of transactions: 600,000
- Transactions containing diapers: 7,500 (1.25 percent)
- Transactions containing beer: 60,000 (10 percent)
- Transactions containing both beer and diapers: 6,000 (1.0 percent)
From the above figures, we can conclude that if there was no relation between beer and diapers (that is, they were statistically independent), then we would have got only 10% of diaper purchasers to buy beer too.
However, as surprising as it may seem, the figures tell us that 80% (=6000/7500) of the people who buy diapers also buy beer.
This is a significant jump of 8 over what was the expected probability. This factor of increase is known as Lift – which is the ratio of the observed frequency of co-occurrence of our items and the expected frequency.
How did we determine the lift?
Simply by calculating the transactions in the database and performing simple mathematical operations.
So, for our example, one plausible association rule can state that the people who buy diapers will also purchase beer with a Lift factor of 8. If we talk mathematically, the lift can be calculated as the ratio of the joint probability of two items x and y, divided by the product of their probabilities.
Lift = P(x,y)/[P(x)P(y)]
However, if the two items are statistically independent, then the joint probability of the two items will be the same as the product of their probabilities. Or, in other words,
which makes the Lift factor = 1. An interesting point worth mentioning here is that anti-correlation can even yield Lift values less than 1 – which corresponds to mutually exclusive items that rarely occur together.
Association Rule Mining has helped data scientists find out patterns they never knew existed.
Let’s look at some areas where Association Rule Mining has helped quite a lot:
Market Basket Analysis:
This is the most typical example of association mining. Data is collected using barcode scanners in most supermarkets. This database, known as the “market basket” database, consists of a large number of records on past transactions. A single record lists all the items bought by a customer in one sale. Knowing which groups are inclined towards which set of items gives these shops the freedom to adjust the store layout and the store catalog to place the optimally concerning one another.
Association rules in medical diagnosis can be useful for assisting physicians for curing patients. Diagnosis is not an easy process and has a scope of errors which may result in unreliable end-results. Using relational association rule mining, we can identify the probability of the occurrence of illness concerning various factors and symptoms. Further, using learning techniques, this interface can be extended by adding new symptoms and defining relationships between the new signs and the corresponding diseases.
Every government has tonnes of census data. This data can be used to plan efficient public services(education, health, transport) as well as help public businesses (for setting up new factories, shopping malls, and even marketing particular products). This application of association rule mining and data mining has immense potential in supporting sound public policy and bringing forth an efficient functioning of a democratic society.
Proteins are sequences made up of twenty types of amino acids. Each protein bears a unique 3D structure which depends on the sequence of these amino acids. A slight change in the sequence can cause a change in structure which might change the functioning of the protein. This dependency of the protein functioning on its amino acid sequence has been a subject of great research. Earlier it was thought that these sequences are random, but now it’s believed that they aren’t. Nitin Gupta, Nitin Mangal, Kamal Tiwari, and Pabitra Mitra have deciphered the nature of associations between different amino acids that are present in a protein. Knowledge and understanding of these association rules will come in extremely helpful during the synthesis of artificial proteins.
With that, I hope I was able to clarify everything you needed to know about association rule mining.
If you happen to have any doubts, queries, or suggestions – do drop them in the comments below!