If you’ve ever stepped even a toe in the world of data science or Python, you would have heard of R.
Developed as a GNU project, R is both a language and an environment designed for graphics and statistical computing. It is similar to the S language, and can thus, be considered as its implementation.
As a language, R is highly extensible. It provides a variety of statistical and graphical techniques like time-series analysis, linear modeling, non-linear modeling, clustering, classification, classical statistical tests.
It is one of these techniques that we will be exploring more deeply and that is clustering or cluster analysis!
What is cluster analysis?
In the simplest of terms, clustering is a data segmentation method whereby data is partitioned into several groups on the basis of similarity.
How is the similarity assessed? On the basis of inter-observation distance measures. These can be either Euclidean or correlation-based distance measures.
Cluster analysis is one of the most popular and in a way, intuitive, methods of data analysis and data mining. It is ideal for cases where there is voluminous data and we have to extract insights from it. In this case, the bulk data can be broken down into smaller subsets or groups.
The little groups that are formed and derived from the whole dataset are known as clusters. These are obtained by performing one or more statistical operations. Each cluster, though containing different elements, share the following properties:
- Their numbers are not known in advance.
- They are obtained by carrying out a statistical operation.
- Each cluster contains objects that are similar and have common characteristics.
Even without the ‘fancy’ name of cluster analysis, the same is used a lot in day-to-day life.
At the individual level, we make clusters of the things we need to pack when we set out on a vacation. First clothes, then toiletries, then books, and so on. We make categories and then tackle them individually.
Companies use cluster analysis, too, when they carry out segmentation on their email lists and categorize customers on the basis of age, economic background, previous buying behaviour, etc.
Cluster analysis is also referred to as ‘unsupervised machine learning’ or pattern recognition. Unsupervised because we aren’t looking to categorize particular samples in particular samples only. Learning because the algorithm also learns how to cluster.
3 Methods of Clustering
We have three methods that are most often used for clustering. These are:
- Agglomerative Hierarchical Clustering
- Relational clustering/ Condorcet method
- k-means clustering
1. Agglomerative Hierarchical Clustering
This is the most common type of hierarchical clustering. The algorithm for AHC works in a bottom-up manner. It begins by regarding each data point as a cluster in itself (called a leaf).
It then combines together the two clusters that are the most similar. These new and bigger clusters are called nodes. The grouping is repeated until the entire dataset comes together as a single, big cluster called the root.
Visualizing and drawing each step of the AHC process leads to the generation of a tree called a dendrogram.
Reversing the AHC process leads to divisive clustering and the generation of clusters.
The dendrogram can also be visualized as:
In conclusion, if you want an algorithm that is good at identifying small clusters, go for AHC. If you want one that is good at identifying large clusters, then the divisive clustering method should be your choice.
2. Relational clustering/ Condorcet method
‘Clustering by Similarity Aggregation’ is another name for this method. It works as follows:
The individual objects in pairs that build up the global clustering are compared. To vectors m(A, B) and d(A, B), a pair of individual values (A, B) is assigned. In the vector b(A, B), both A and B have the same values, whereas, in the vector d(A, B), both of them have different values).
The two individual values of A and B are said to follow the Condorcet criterion as follows:
c(A, B) = m(A, B)- d(A, B)
For an individual value like A and a cluster called S, the Condorcet criterion stands as:
c(A,S) = Σic(A,Bi)
The overall summation is Bi ∈ S.
the next biggest thing
With the above conditions having been met, clusters of the form c(A, S) are constructed. A can have the least value of 0 and is the largest of all the data points in the cluster.
Finally, the global Condorcet criterion is calculated. This is done by performing a summation of the individual data points present in A and the cluster SA which contains them.
The above steps are repeated until the global Condorcet criterion doesn’t improve or the largest number of iterations is reached.
3. k-means clustering
This is one of the most popular partitioning algorithms. All of the available data (also called data points/ observations sometimes) will be grouped into these clusters only. Here is a breakdown of how the algorithm proceeds:
- Select k clusters at random. These k rows will also mean finding k centroids for each cluster.
- Each data point is then assigned to the centroid closest to it.
- As more and more data points get assigned, centroids are recalculated as the average of all the data points (being) added.
- Continue assigning data points and shifting the centroid as needed.
- Repeat steps 3 and 4 until no data points change cluster.
The distance between a data point and a centroid is calculated using one of the following methods:
- Euclidean distance
- Manhattan distance
- Minlowski distance
The most popular of these- the Euclidean distance- is calculated as follows:
Each time that the algorithm is run, different groups are returned as a result. The very first assignment to the variable k is completely random. This makes k-means very sensitive to the first choice. As a result, it becomes almost impossible to get the same clustering unless the number of groups and overall observations is small.
How to assign a value to k?
In the beginning, we’ll randomly assign a value to k which will dictate the direction that the results head in. To ensure that the best choice is made, it is helpful to keep in mind the following formula:
Here, n is the number of data points in the dataset.
Regardless of the presence of a formula, the number of clusters would be heavily dependent on the nature of the dataset, the industry and business it belongs to, etc. Hence, it is advisable to pay heed to one’s own experience and intuition as well.
With the wrong cluster size, the grouping may not be as effective and can lead to overfitting. Due to overfitting, new data points might not be able to find a place in the cluster since the algorithm has eeked out the little details and all generalization is lost.
Applications of Cluster Analysis
So, where exactly are the powerful clustering methods used? We cursorily mentioned a few examples above. Below are some more instances:
Medicine and health
On the basis of the patients’ age and genetic makeup, doctors are able to provide a better diagnosis. This ultimately leads to treatment that is more beneficial and aligned. New medicines can also be discovered this way. Clustering in medicine is termed as nosology.
In social spheres, clustering people on the basis of demographics, age, occupation, residence location, etc. helps the government to enforce laws and shape policies that suit diverse groups.
In marketing, the term clustering is replaced by segmentation / typological analysis. It is used to explore and select potential buyers of a particular product. Companies then test the elements of each cluster to know which customers display pro-retainment behavior.
As an input for the clustering algorithm that will be implemented here, past web pages accessed by a user are inputted. These web pages are then clustered. In the end, a profile of the user, based on his browsing activity, is generated. From personalization to cyber safety, this result can be leveraged anywhere.
Outlets also benefit from clustering customers on the basis of age, colour preferences, style preferences, past purchases, etc. This helps retailers to create customized experiences and also plan future offerings aligned to customer desires.
As is evident, cluster analysis is a highly valuable method- no matter the language or environment it is implemented in. Whether one wants to derive insights, eke out patterns, or carve out profiles, cluster analysis is a highly useful tool with results that can be practically implemented. Proficiency in working with the various clustering algorithms can lead one to perform accurate and truly valuable data analysis.
If you are curious to learn about data science, check out our PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.