Mahout is an open-source project by the Apache Software Foundation that data scientists use to create distributed or scalable machine learning algorithms. Mahout primarily focuses on linear algebra, and its algorithms are written on top of the Hadoop infrastructure. Some of the popular data mining techniques implemented by this framework include Recommendation, Classification, and Clustering. Distance measures in Mahout is an essential topic to learn for clustering problems.
Since Mahout provides coders with a ready-to-use structure and allows quick and effective management of bulk data, it has become one of the top projects of Apache. And various companies like Twitter, Facebook, LinkedIn, Adobe, Yahoo, etc. use it for their internal data mining tasks.
Know more: 12 Most Useful Data Mining Applications
What are distance measures?
As the name suggests, it is a measure of the distance between data points. Distance measures in Mahout calculate how close two arbitrary vectors are located and indicate the similarity between the points. Let us now consider some examples.
- Suppose you run a telephone company, and you want to set up a network of towers in a certain region. To ensure optimum signal strength, you need to determine the locations for erecting the towers.
- The regional administration wants to open a series of public emergency-care wards. The location of these units across the region should be such that they lie in the proximity of the accident-prone areas.
- For effective law enforcement and stringent surveillance in areas with high crime rates, you can evaluate the vicinity in which the patrol vans should be stationed.
In all these scenarios, you can see that distance measures lie at the core of clustering algorithms. In unsupervised learning problems, this computation forms one of the most crucial factors for decision-making. Your choice about the distance measuring technique would influence the results to a great extent.
Also, you need not use the techniques available in the Mahout Library. You can also apply a custom method to find out distance metrics that are based on the context of your specific data or algorithm. All you need to do is implement mathematical logic for the vector points and assign a value to determine whether that implementation falls within a particular centroid. The center of a cluster is referred to as the centroid.
Learn about: Top Companies Hiring Data Scientists in India
Brushing up clustering basics
Before we delve into the different categories, let us first refresh our basics about clustering. Clusters are basically similarity or dissimilarity groups of data instances. Here are some real-life applications.
- Marketers can use clustering to segment customers and execute a targeted marketing strategy.
- As a clothing manufacturer, you may want to group people depending on similar T-shirt sizes, such as “Small,” “Medium,” and “Large.” A one-size-fits-all approach does not work every time. And customized T-shirts for each person can be expensive.
- In library management systems, clustering is used for organizing books and documents according to their content similarities.
- In an Earth observation database, clustering can help identify areas with similar land use.
- In biology, clustering can be used for categorizing genes having similar functionality and understanding structures inherent in different plant and animal populations.
Moreover, vast volumes of data are generated and used every day in this digital age. Hence, clustering is one of the most widely-used data mining techniques due to the convenience it offers.
The quality of clustering is determined by two primary aspects – the clustering algorithm and the distance function.
- Clustering algorithm (partitional, hierarchical, etc.)
- Distance function (similarity or dissimilarity)
Now that we have revised the foundational concepts let us move on to the different types of distance measures available in Apache Mahout.
Distance measures in Mahout
Cosine distance measure
This type of distance measure is best suited for finding text similarity. Given a collection of text documents, it can produce a topic hierarchy by grouping them using the highest-weighted common words.
The cosine distance measure uses the TF-IDF algorithm to convert attributes into vectors. And the vector weights are higher for the topic words than stop words. So, similar documents have common topic words between them. As a result, the centroid vector (or the cluster center) has a higher average weight for topic words.
One of the most popular applications is the page rankings or search summaries you encounter on Google pages. The algorithm first forms clusters and then finds the centroid. This procedure is also useful for information discovery in AI applications like Siri and Alexa.
Inter-cluster distance measure
It is the distance between the objects belonging to two separate clusters. The inter-cluster distance measure is appropriate for evaluating the quality of your cluster. If the centroids are too close to each other, it will hamper the process of creating groups with similar features. Therefore, it becomes critical to draw clear distinctions between the cluster members. The overall goal is to partition or segment the data points into specific clusters.
Read more: Cluster Analysis in R
Intra-cluster distance measure
This measure gives you the distance between two members of the same cluster. So, it is the opposite of the inter-cluster distance measure. Intra-cluster distances are smaller as compared to inter-cluster distances. Small measures of distance between similar objects indicate that clusters are tight and reliably discriminated from each other.
This type of distance metric depends on two things: i) penalty for farther objects ii) smaller value for closer objects. And clusters that are more separated have a high ratio of these two values.
Now, let us look at the following demonstration of similarity distance measures in cluster analysis.
A courier service can create different ‘delivery zones’ by grouping those locations that have minimal distance between them. In this way, the algorithm facilitates fast and effective delivery by the personnel. Our task is to optimize the distance between the centroid points of the clusters, minimize intra-cluster variance, and ensure that the data sets with the most similar characteristics are clustered together.
With this, we have explained the concept of distance measures in Mahout. And now that you have got the gist of this important big data tool, you can easily elucidate it in any job interview. Also, a clear understanding of the different distance measures would help you achieve accuracy while implementing clustering algorithms.
If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.