For working professionals
For fresh graduates
More
Did you know? HDBSCAN now supports long vectors in R’s dbscan package (Jan 2025), enabling it to handle much larger distance matrices for big data clustering tasks.
This enables processing distance matrices larger than 2^31 elements, allowing efficient clustering of bigger datasets and overcoming memory limitations in large-scale or high-dimensional data.
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an advanced clustering algorithm. It efficiently identifies clusters in data with varying densities while handling noise and outliers better than traditional methods.
It builds a cluster hierarchy and selects the most stable structures. This guide to HDBSCAN explains its core algorithm, key implementation steps using Python’s hdbscan library, and real-world applications.
HDBSCAN is especially useful in high-dimensional or noisy datasets. Use cases include customer segmentation, anomaly detection, and unsupervised pattern recognition. This guide explains how the HDBSCAN algorithm works, outlines steps for implementing it using Python libraries like hdbscan.
Hone your skills in ML and AI with upGrad’s Artificial Intelligence & Machine Learning - Courses, designed in collaboration with the world’s top 1% universities. Join over 1,000 leading companies and unlock an average 51% salary hike while learning from industry experts!
HDBSCAN in ML is a clustering algorithm used to find patterns or groupings in unlabeled data without needing to specify the number of clusters. It improves on DBSCAN by handling datasets with varying densities and identifying noise more effectively.
HDBSCAN is commonly used in machine learning tasks like customer segmentation, anomaly detection, and image analysis, where traditional clustering methods may struggle. It is valued for its flexibility, minimal parameter tuning, and ability to reveal meaningful structure in complex datasets.
Planning to take your career to the next level by learning ML and AI, here are some top-rated courses to help you get there:
Importance of HDBSCAN Clustering in ML
HDBSCAN in ML is crucial as it adapts to datasets with varying densities. It automatically identifies meaningful patterns and outliers without requiring a predefined number of clusters. Its ability to handle noise and reveal hierarchical structures makes it ideal for large-scale data analysis.
Unlock your potential with upGrad's Post Graduate Diploma in Machine Learning. Gain expertise in ML, Generative AI, and statistics. By the end of the program, you will be proficient in designing, developing, and deploying ML models for applications across various industries.
Also Read: Top 10 Machine Learning Applications in 2025 and the Role of Edge Computing
HDBSCAN has several key parameters that control its behavior and the quality of clustering. They can influence how clusters are formed, how noise is handled, and the granularity of the results.
Below are the main parameters and their significance:
Specifies the minimum number of points required to form a cluster. Points that don't meet this threshold are classified as noise. Decreasing this value allows smaller clusters to form, making the algorithm more sensitive and capable of identifying smaller groups within the data. However, it might also lead to overfitting, where too many tiny, irrelevant clusters are formed. Increasing min_cluster_size results in fewer, larger clusters and reduces the sensitivity to small, potentially unimportant groupings. This can help in identifying larger, more stable clusters but may exclude smaller patterns that could be important in some datasets.
Defines the minimum number of neighboring points that a data point must have to be considered a core point. Core points are central to the formation of clusters. Higher values make the algorithm stricter in defining clusters, which can help eliminate noisy or spurious clusters by requiring more points to form a dense region. However, if this value is too high, it may fail to identify smaller, valid clusters. Lower values make the algorithm more flexible, potentially leading to more clusters, but possibly including more noise or less meaningful groupings.
Specifies the method for measuring the distance between points. Common options include Euclidean distance, which calculates the straight-line distance between two points, and Manhattan distance, which measures distance along axes at right angles (grid-based). The choice of metric impacts how clusters are defined and can significantly affect the algorithm's performance.
For example, Euclidean distance works well for continuous, spatial data, while Manhattan distance is often better suited for grid-like or non-continuous data. A custom metric can be used when domain-specific relationships between points need to be considered, such as in text or categorical data clustering.
Determines how clusters are selected from the hierarchical tree. There are two main methods:
Controls the level of detail in the clustering hierarchy. A smaller alpha value leads to finer clusters, making the algorithm more sensitive to small variations in density. This can result in a more detailed and intricate cluster structure, but it may also create a lot of small, potentially irrelevant clusters.
A larger alpha value reduces the detail, grouping points into broader clusters, and is useful when looking for more general trends in the data. Tuning this parameter helps strike a balance between precision and the ability to capture broader, overarching structures.
Used with the Minkowski distance formula, which generalizes both Euclidean (p=2) and Manhattan (p=1) distances. The value of p defines the power parameter in the Minkowski formula:
Each of these parameters significantly impacts how the algorithm processes data, determines clusters, and identifies noise.
Are you a full-stack developer wanting to integrate AI into Python programming workflow? upGrad’s AI-Driven Full-Stack Development bootcamp can help you. You’ll learn how to build AI-powered software using OpenAI, GitHub Copilot, Bolt AI & more.
Also Read: Time Series Forecasting with ARIMA Models: Components, Advantages & Steps
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that works by building a hierarchy of clusters based on the density distribution of data points. It creates a mutual-reachability graph, where points are connected based on their similarity or distance.
The algorithm then uses this graph to identify clusters of varying densities, making it highly adaptable to complex datasets. Unlike traditional clustering methods, HDBSCAN does not require users to define the number of clusters beforehand, allowing for a more flexible clustering process.
Step 1. Mutual Reachability Graph Construction
The first step in HDBSCAN is constructing a mutual-reachability graph. This graph connects data points based on a special distance measure that accounts for local density. The algorithm treats points in denser regions as being closer together and creates a graph where points are connected by edges weighted by this measure.
Here’s a closer look at the key elements involved in this process:
Step 2. Building the Minimum Spanning Tree (MST)
After constructing the mutual-reachability graph, the next step in HDBSCAN is to create a Minimum Spanning Tree (MST). The algorithm connects all data points using the shortest possible connections based on the mutual reachability graph, forming a tree that links every point while minimizing the total edge weight.
Here’s a breakdown of how the MST plays a crucial role in cluster formation:
This step ensures that the algorithm groups points with similar density, creating a structure where clusters can be identified at different levels of granularity.
Step 3. Condensing the Tree
Once the Minimum Spanning Tree (MST) is built, HDBSCAN proceeds to condense the tree by removing the longest edges, which represent weak connections. This step gradually breaks the tree into smaller, more meaningful clusters, allowing the algorithm to identify stable structures while filtering out noise.
Here’s how the process works:
This process isolates the most stable and meaningful clusters while filtering out noise and irrelevant data points.
Step 4. Extracting Clusters
After condensing the tree, the next step in HDBSCAN is to extract clusters by cutting the tree at a certain level. The cut level is determined based on either a user-defined minimum cluster size or a stability-based heuristic, ensuring that the most stable and meaningful clusters are selected. Here’s how the process works:
With the extraction process complete, the final clusters are identified, leaving us with a clear distinction between meaningful groupings and noise. Next, let’s explore how this process can be implemented in Python with a practical example.
If you want to understand how to work with AI and ML, upGrad’s Executive Diploma in Machine Learning and AI can help you. With a strong hands-on approach, this AI ML program ensures that you apply theoretical knowledge to practical problems.
Also Read: Exponential Smoothing Method in Forecasting: Techniques and Applications
Implementing the HDBSCAN algorithm involves several steps, from installation to running the algorithm and visualizing the results. This section will guide you through the process in a way that is easy to follow, and we will include how to use both the HDBSCAN and related libraries, like Scikit-learn to facilitate the clustering process.
Note: You do not need different libraries for each step of the HDBSCAN algorithm itself-clustering can be fully performed using just the “hdbscan library5”. However, additional libraries are commonly used for tasks surrounding clustering |
Step 1: Install Required Libraries
To get started, install the HDBSCAN library along with supporting libraries. For the purpose of this implementation, you only need to install HDBSCAN along with scikit-learn for data manipulation and matplotlib for visualization. You can install them using pip:
pip install hdbscan
pip install scikit-learn
pip install matplotlib
Once installed, you can use HDBSCAN to perform clustering tasks directly. Other libraries like scikit-learn are optional and can be integrated for specific tasks, such as creating synthetic datasets or standardizing the data.
Step 2: Import Libraries
Now that the libraries are installed, import the necessary packages to perform clustering and visualize the results.
import hdbscan
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
In this step, we import:
Step 3: Prepare the Data
Preparing your dataset is essential before applying the HDBSCAN algorithm. This may involve normalizing or scaling the data or generating a synthetic dataset for testing purposes.
# Generate synthetic data
X, _ = make_blobs(n_samples=200, centers=5, cluster_std=0.60, random_state=0)
X = StandardScaler().fit_transform(X) # Standardize data
In this example:
Step 4: Apply the HDBSCAN Algorithm
Now, apply the HDBSCAN algorithm to the dataset. You can adjust key parameters like min_cluster_size and min_samples to tune the algorithm’s sensitivity to clusters.
# Create HDBSCAN model
hdb = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=5)
# Fit the model
hdb.fit(X)
What the Parameters Mean:
Understanding the Output Labels:
After fitting the model, you can access the cluster assignments using hdb.labels_.
For instance, in customer segmentation:
To visually interpret the result, move to the next step—visualizing the clusters.
Step 5: Visualize the Clusters
After the model is trained, visualize the resulting clusters using matplotlib. Each point is colored according to its assigned cluster, with -1 representing noise.
# Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=hdb.labels_, cmap='viridis', marker='o')
plt.title("HDBSCAN Clustering")
plt.show()
This code will generate a scatter plot where each data point is color-coded according to the cluster it belongs to. Points labeled -1 are treated as noise and shown in a distinct color.
Step 6: Analyze the Results
Once clustering is complete, you can analyze the results by examining the labels_ attribute of the HDBSCAN model. The labels_ array contains the cluster assignments for each data point.
# Print the number of clusters
print("Number of clusters:", len(set(hdb.labels_)) - (1 if -1 in hdb.labels_ else 0))
# Print noise points count
print("Number of noise points:", list(hdb.labels_).count(-1))
Step 7: Fine-tuning Parameters
You can fine-tune the HDBSCAN algorithm to achieve better clustering by adjusting parameters like min_cluster_size and min_samples based on the characteristics of your data.
# Experiment with different parameters
hdb = hdbscan.HDBSCAN(min_cluster_size=20, min_samples=10)
hdb.fit(X)
# Visualize the results again
plt.scatter(X[:, 0], X[:, 1], c=hdb.labels_, cmap='viridis', marker='o')
plt.title("HDBSCAN Clustering - Adjusted Parameters")
plt.show()
Adjusting min_cluster_size and min_samples will change the sensitivity of the algorithm to density, resulting in more or fewer clusters depending on the settings.
In this example, we apply HDBSCAN to cluster customer data based on purchasing behavior, helping an e-commerce company segment customers into meaningful groups. The algorithm will detect clusters of customers with similar buying patterns, while also identifying noise points as outliers (e.g., one-time buyers).
This can help the company better understand its customer base and target marketing efforts more effectively.
import hdbscan
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
From sklearn.preprocessing import StandardScaler
# Step 1: Generate synthetic customer purchase data (for illustration)
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
X = StandardScaler().fit_transform(X) # Standardizing the data
# Step 2: Apply the HDBSCAN algorithm
hdb = hdbscan.HDBSCAN(min_cluster_size=30, min_samples=5)
hdb.fit(X)
# Step 3: Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=hdb.labels_, cmap='coolwarm', marker='o', s=50)
plt.title("HDBSCAN Customer Segmentation")
plt.xlabel("Feature 1 (e.g., frequency of purchases)")
plt.ylabel("Feature 2 (e.g., total spending)")
plt.colorbar(label='Cluster Label')
plt.show()
# Step 4: Analyze the results
num_clusters = len(set(hdb.labels_)) - (1 if -1 in hdb.labels_ else 0)
num_noise_points = list(hdb.labels_).count(-1)
num_clusters, num_noise_points # Show the results
Explanation:
Output:
Explanation of Output:
In this practical example, HDBSCAN helps identify customer segments for targeted marketing or personalized offers, while noise points may represent anomalies or users who need separate analysis.
Also Read: Hierarchical Clustering in Python [Concepts and Analysis]
Preprocessing techniques like PCA, t-SNE, or UMAP can be used before applying HDBSCAN to reduce the data's dimensionality. This is particularly useful for reducing the curse of dimensionality for high-dimensional data, which can be noisy and computationally expensive to cluster.
Why Use It?
How is It Done?
It’s optional. However, it’s recommended for high-dimensional datasets to improve cluster quality and computational performance. By reducing dimensions, clustering becomes more efficient and effective, making it a beneficial step in many cases.
Now that you have a better understanding of how to implement HDBSCAN Clustering in Python, Let’s look into some practical use cases of HDBSCAN clustering in real life.
If you need a better understanding of cybersecurity, upGrad’s free Fundamentals of Cybersecurity course can help you. You will learn key concepts, current challenges, and important terminology to protect systems and data.
Also Read: Image Segmentation Techniques [Step By Step Implementation]
HDBSCAN is increasingly used in real-life applications where identifying clusters of varying densities is essential. Its ability to automatically detect clusters without predefining their number makes it ideal for dynamic and complex datasets. From anomaly detection to customer segmentation, HDBSCAN provides valuable insights in various industries.
1. Industrial Process Monitoring
HDBSCAN is a valuable tool for industrial process monitoring, allowing for the identification of abnormal patterns or outliers in real-time data. By clustering sensor data from machines or production lines, HDBSCAN can quickly detect faults or inefficiencies, leading to better decision-making and predictive maintenance.
2. Fraud Detection
HDBSCAN plays a crucial role in fraud detection by identifying unusual patterns in transactional data. It helps to detect anomalous behaviors, such as fraudulent transactions, by clustering data points with similar characteristics and flagging outliers. Since fraud patterns can vary in density and structure, HDBSCAN’s ability to adapt to these variations makes it ideal for detecting subtle fraudulent activities.
3. Spatial Data Analysis
HDBSCAN is highly effective in spatial data analysis, where it helps identify clusters in geospatial data with varying densities. By clustering geographic locations based on proximity and other spatial features, HDBSCAN can uncover patterns such as areas of high activity or underutilized zones. Its ability to handle noise and irregular shapes makes it suitable for complex spatial datasets, including urban planning and environmental monitoring.
4. Genomics
HDBSCAN is valuable in genomics for clustering genetic data, such as gene expression levels or genetic variations, which often exhibit complex, non-linear relationships. It helps in identifying subgroups within genetic datasets, such as patient populations with similar genetic profiles, without requiring a predefined number of clusters.
5. Market Segmentation
HDBSCAN is highly effective in market segmentation by clustering customers based on purchasing behavior, demographics, and other factors. It helps identify distinct customer segments, making it ideal for dynamic and evolving markets. It ensures that both large and niche customer segments are accurately identified.
Now that you have a better understanding of how to implement HDBSCAN Clustering in Python, let’s look at some of its advantages and drawbacks.
HDBSCAN offers significant advantages, such as its ability to handle clusters of varying densities, manage noise, and not require pre-defined cluster counts. However, it also has limitations, including sensitivity to parameters and higher computational complexity compared to simpler algorithms like k-means.
Below is a table that explores both the strengths and challenges of using HDBSCAN.
Advantages | Limitations |
Automatically detects the optimal number of clusters, reducing the need for user input. | Can be computationally expensive, especially for large datasets due to the MST construction. |
Handles data with varying densities, identifying clusters with different shapes. | Struggles with high-dimensional data, as the clustering effectiveness decreases with more features. |
Labels noise points separately, improving the clustering quality and focusing on meaningful clusters. | Sensitive to parameter tuning, especially **min_cluster_size** and **min_samples**, affecting clustering quality. |
To get the most out of HDBSCAN clustering, consider the following best practices:
Also Read: Top 10 Dimensionality Reduction Techniques for Machine Learning(ML) in 2025
Test your understanding of HDBSCAN Clustering in machine learning with a quizz. This will help reinforce the concepts covered in the tutorial and prepare you for practical application.
Assess your understanding of HDBSCAN Clustering, its parameters, advantages, limitations, and best practices by answering the following multiple-choice questions.
Test your knowledge now!
1. What is the effect of increasing the min_cluster_size parameter in HDBSCAN?
a) It reduces the number of clusters by requiring larger clusters to be formed
b) It increases the number of clusters by allowing smaller clusters
c) It makes the algorithm more sensitive to noise
d) It makes the clustering process faster
2. In HDBSCAN, what does the min_cluster_size parameter control?
a) The maximum allowed distance between data points
b) The minimum number of points required to form a cluster
c) The density of points in each cluster
d) The number of outliers in the data
3. What is the role of the mutual-reachability distance in HDBSCAN?
a) It measures the Euclidean distance between points
b) It calculates the density difference between points
c) It defines the proximity between points based on density and distance
d) It measures the correlation between different clusters
4. Which of the following is a limitation of HDBSCAN?
a) It requires the number of clusters to be defined beforehand
b) It can’t handle high-dimensional data
c) It can be computationally expensive for very large datasets
d) It cannot handle noise or outliers
5. How does HDBSCAN handle noise and outliers in a dataset?
a) Noise points are assigned to the nearest cluster
b) It labels points that do not fit into any cluster as noise (label -1)
c) It excludes noisy data points from the dataset completely
d) Noise points are automatically removed from the data
6. What is the function of the condensed tree in HDBSCAN?
a) It visualizes the data points' distribution
b) It simplifies the hierarchical structure to focus on stable clusters
c) It automatically determines the number of clusters
d) It calculates the core distances of all data points
7. Which of the following techniques can be used to improve HDBSCAN's performance on high-dimensional data?
a) Increasing the number of clusters
b) Using dimensionality reduction techniques like PCA or UMAP
c) Using k-means for initial clustering
d) Predefining the noise threshold
8. What does the min_samples parameter influence in HDBSCAN?
a) The minimum number of points needed for a core point
b) The maximum allowed number of points in a cluster
c) The size of the minimum spanning tree
d) The number of outliers to be detected
9. Which of the following is a typical use case of HDBSCAN?
a) Clustering highly structured data with fixed boundaries
b) Identifying clusters in data with varying densities, such as customer segmentation
c) Clustering data that has well-separated groups
d) Data with known predefined clusters
10. How can you evaluate the quality of clusters produced by HDBSCAN?
a) By counting the number of clusters
b) By visualizing clusters and assessing silhouette scores
c) By measuring the distance between clusters only
d) By calculating the density of individual data points
You can also continue expanding your skills in machine learning with upGrad, which will help you deepen your understanding of advanced ML concepts and practical applications.
HDBSCAN clustering is a powerful technique used in machine learning and data analysis to group data points based on varying densities, uncovering hidden patterns and outliers. It plays a critical role in applications like anomaly detection and customer segmentation. Are you worried about understanding machine learning algorithms? upGrad can help you upskill and master complex algorithms and concepts of ML and data science, enhancing practical skills and facilitating career advancement.
upGrad offers online courses, live classes, and mentorship to help you excel in machine learning. With 10 million learners, 200+ programs, and 1,400+ hiring partners. upGrad offers comprehensive mentorship programs, interactive workshops, and hands-on projects that enable learners to apply theoretical knowledge in real-world scenarios.
While the course covered in the tutorial can significantly improve your knowledge, here are some free courses to facilitate your continued learning:
You can also get personalized career counseling with upGrad to guide your career path, or visit your nearest upGrad center and start hands-on training today!
Similar Reads:
HDBSCAN improves upon DBSCAN by allowing variable density clusters and not requiring a fixed eps parameter. Unlike DBSCAN, HDBSCAN automatically finds the number of clusters and can handle hierarchical relationships. Additionally, it uses a mutual-reachability graph to connect points based on both distance and density, providing more flexibility in identifying clusters of varying sizes and shapes.
Key parameters to tune in HDBSCAN include min_cluster_size and min_samples. min_cluster_size controls the smallest allowable cluster size, while min_samples affects the density threshold needed for a core point. You can experiment with different values and use metrics like the Silhouette Score to evaluate cluster quality and fine-tune these parameters for your specific dataset.
HDBSCAN is an unsupervised clustering algorithm that is particularly effective when you don't have labeled data or predefined groups. Unlike supervised methods, which rely on labeled data to predict or classify data points, HDBSCAN discovers hidden patterns based on the natural structure of the data. Its ability to detect noise and varying densities gives it a significant advantage in complex, unlabeled datasets.
HDBSCAN automatically classifies points that don't fit into any clusters as noise (labeled as -1). By focusing on core points and eliminating sparse data points, it reduces the impact of outliers on clustering results. This feature ensures that the clustering process remains robust, especially in datasets with irregular or noisy points.
Yes, HDBSCAN can be applied to time-series data for anomaly detection or clustering. By analyzing time-series data points in terms of their similarity and density over time, it can identify trends, seasonal patterns, and outliers in temporal datasets. However, preprocessing steps like smoothing or normalization are essential to ensure the algorithm identifies meaningful temporal patterns
Yes, HDBSCAN can handle high-dimensional data, but like other clustering algorithms, its performance may degrade with increasing dimensions due to the "curse of dimensionality." Applying dimensionality reduction techniques like PCA or UMAP before clustering can improve its effectiveness by reducing noise and making the clustering process more efficient.
HDBSCAN builds a hierarchical tree of clusters, where each branch represents a potential cluster. It then condenses this tree, retaining only the most stable clusters, and allows users to select the final clusters based on cluster stability. This hierarchical approach enables the algorithm to detect clusters at multiple levels of granularity, providing more flexibility than flat clustering methods.
Yes, HDBSCAN can be used for real-time anomaly detection. It continuously clusters incoming data and flags points that do not belong to any cluster as anomalies (noise). In applications like network security or fraud detection, HDBSCAN can detect unusual activities or outliers in real time, alerting users to potential issues immediately.
Despite its strengths, HDBSCAN can be computationally expensive for very large datasets due to the need to build a minimum spanning tree and hierarchical clustering structure. It also struggles with datasets where the density varies significantly, leading to possible misidentification of clusters. Moreover, parameter sensitivity, particularly with min_samples and min_cluster_size, can impact the clustering quality if not properly tuned.
HDBSCAN can handle imbalanced datasets by detecting clusters of varying densities. The min_cluster_size and min_samples parameters help control how clusters of different sizes are treated. It can effectively identify small but significant clusters in imbalanced data, which traditional methods like k-means may overlook, making it ideal for imbalanced classification tasks.
For high-dimensional data, t-SNE or UMAP can be used for dimensionality reduction to map the clusters into 2D or 3D space. Once reduced to a lower-dimensional representation, you can visualize the clusters identified by HDBSCAN using scatter plots, with points color-coded by their cluster label. This allows for a more intuitive understanding of the clustering structure.
Start Learning For Free
Explore Our Free Software Tutorials and Elevate your Career.
Talk to our experts. We are available 7 days a week, 9 AM to 12 AM (midnight)
Indian Nationals
1800 210 2020
Foreign Nationals
+918068792934
1.The above statistics depend on various factors and individual results may vary. Past performance is no guarantee of future results.
2.The student assumes full responsibility for all expenses associated with visas, travel, & related costs. upGrad does not provide any a.