Every second, the world generates an unprecedented volume of data. As data has become a crucial component of businesses and organizations across all industries, it is essential to process, analyze, and visualize it appropriately to extract meaningful insights from large datasets. However, there’s a catch – more does not always mean productive and accurate. The more data we produce every second, the more challenging it is to analyze and visualize it to draw valid inferences.
This is where Dimensionality Reduction comes into play.
What is Dimensionality Reduction?
In simple words, dimensionality reduction refers to the technique of reducing the dimension of a data feature set. Usually, machine learning datasets (feature set) contain hundreds of columns (i.e., features) or an array of points, creating a massive sphere in a three-dimensional space. By applying dimensionality reduction, you can decrease or bring down the number of columns to quantifiable counts, thereby transforming the three-dimensional sphere into a two-dimensional object (circle).
Now comes the question, why must you reduce the columns in a dataset when you can directly feed it into an ML algorithm and let it work out everything by itself?
The curse of dimensionality mandates the application of dimensionality reduction.
The Curse of Dimensionality
The curse of dimensionality is a phenomenon that arises when you work (analyze and visualize) with data in high-dimensional spaces that do not exist in low-dimensional spaces.
The higher is the number of features or factors (a.k.a. variables) in a feature set, the more difficult it becomes to visualize the training set and work on it. Another vital point to consider is that most of the variables are often correlated. So, if you think every variable within the feature set, you will include many redundant factors in the training set.
Furthermore, the more variables you have at hand, the higher will be the number of samples to represent all the possible combinations of feature values in the example. When the number of variables increases, the model will become more complex, thereby increasing the likelihood of overfitting. When you train an ML model on a large dataset containing many features, it is bound to be dependent on the training data. This will result in an overfitted model that fails to perform well on real data.
The primary aim of dimensionality reduction is to avoid overfitting. A training data with considerably lesser features will ensure that your model remains simple – it will make smaller assumptions.
Apart from this, dimensionality reduction has many other benefits, such as:
- It eliminates noise and redundant features.
- It helps improve the model’s accuracy and performance.
- It facilitates the usage of algorithms that are unfit for more substantial dimensions.
- It reduces the amount of storage space required (less data needs lesser storage space).
- It compresses the data, which reduces the computation time and facilitates faster training of the data.
Dimensionality Reduction Techniques
Dimensionality reduction techniques can be categorized into two broad categories:
1. Feature selection
The feature selection method aims to find a subset of the input variables (that are most relevant) from the original dataset. Feature selection includes three strategies, namely:
- Filter strategy
- Wrapper strategy
- Embedded strategy
2. Feature extraction
Feature extraction, a.k.a, feature projection, converts the data from the high-dimensional space to one with lesser dimensions. This data transformation may either be linear or it may be nonlinear as well. This technique finds a smaller set of new variables, each of which is a combination of input variables (containing the same information as the input variables).
Without further ado, let’s dive into a detailed discussion of a few commonly used dimensionality reduction techniques!
1. Principal Component Analysis (PCA)
Principal Component Analysis is one of the leading linear techniques of dimensionality reduction. This method performs a direct mapping of the data to a lesser dimensional space in a way that maximizes the variance of the data in the low-dimensional representation.
Essentially, it is a statistical procedure that orthogonally converts the ‘n’ coordinates of a dataset into a new set of n coordinates, known as the principal components. This conversion results in the creation of the first principal component having the maximum variance. Each succeeding principal component bears the highest possible variance, under the condition that it is orthogonal (not correlated) to the preceding components.
The PCA conversion is sensitive to the relative scaling of the original variables. Thus, the data column ranges must first be normalized before implementing the PCA method. Another thing to remember is that using the PCA approach will make your dataset lose its interpretability. So, if interpretability is crucial to your analysis, PCA is not the right dimensionality reduction method for your project.
2. Non-negative matrix factorization (NMF)
NMF breaks down a non-negative matrix into the product of two non-negative ones. This is what makes the NMF method a valuable tool in areas that are primarily concerned with non-negative signals (for instance, astronomy). The multiplicative update rule by Lee & Seung improved the NMF technique by – including uncertainties, considering missing data and parallel computation, and sequential construction.
These inclusions contributed to making the NMF approach stable and linear. Unlike PCA, NMF does not eliminate the mean of the matrices, thereby creating unphysical non-negative fluxes. Thus, NMF can preserve more information than the PCA method.
Sequential NMF is characterized by a stable component base during construction and a linear modeling process. This makes it the perfect tool in astronomy. Sequential NMF can preserve the flux in the direct imaging of circumstellar structures in astronomy, such as detecting exoplanets and direct imaging of circumstellar disks.
3. Linear discriminant analysis (LDA)
The linear discriminant analysis is a generalization of Fisher’s linear discriminant method that is widely applied in statistics, pattern recognition, and machine learning. The LDA technique aims to find a linear combination of features that can characterize or differentiate between two or more classes of objects. LDA represents data in a way that maximizes class separability. While objects belonging to the same class are juxtaposed via projection, objects from different classes are arranged far apart.
4. Generalized discriminant analysis (GDA)
The generalized discriminant analysis is a nonlinear discriminant analysis that leverages the kernel function operator. Its underlying theory matches very closely to that of support vector machines (SVM), such that the GDA technique helps to map the input vectors into high-dimensional feature space. Just like the LDA approach, GDA also seeks to find a projection for variables in a lower-dimensional space by maximizing the ratio of between-class scatters to within-class scatter.
5. Missing Values Ratio
When you explore a given dataset, you might find that there are some missing values in the dataset. The first step in dealing with missing values is to identify the reason behind them. Accordingly, you can then impute the missing values or drop them altogether by using the befitting methods. This approach is perfect for situations when there are a few missing values.
However, what to do when there are too many missing values, say, over 50%? In such situations, you can set a threshold value and use the missing values ratio method. The higher the threshold value, the more aggressive will be the dimensionality reduction. If the percentage of missing values in a variable exceeds the threshold, you can drop the variable.
Generally, data columns having numerous missing values hardly contain useful information. So, you can remove all the data columns having missing values higher than the set threshold.
6. Low Variance Filter
Just as you use the missing values ratio method for missing variables, so for constant variables, there’s the low variance filter technique. When a dataset has constant variables, it is not possible to improve the model’s performance. Why? Because it has zero variance.
In this method also, you can set a threshold value to wean out all the constant variables. So, all the data columns with variance lower than the threshold value will be eliminated. However, one thing you must remember about the low variance filter method is that variance is range dependent. Thus, normalization is a must before implementing this dimensionality reduction technique.
7. High Correlation Filter
If a dataset consists of data columns having a lot of similar patterns/trends, these data columns are highly likely to contain identical information. Also, dimensions that depict a higher correlation can adversely impact the model’s performance. In such an instance, one of those variables is enough to feed the ML model.
For such situations, it’s best to use the Pearson correlation matrix to identify the variables showing a high correlation. Once they are identified, you can select one of them using VIF (Variance Inflation Factor). You can remove all the variables having a higher value ( VIF > 5 ). In this approach, you have to calculate the correlation coefficient between numerical columns (Pearson’s Product Moment Coefficient) and between nominal columns (Pearson’s chi-square value). Here, all the pairs of columns having a correlation coefficient higher than the set threshold will be reduced to 1.
Since correlation is scale-sensitive, you must perform column normalization.
8. Backward Feature Elimination
In the backward feature elimination technique, you have to begin with all ‘n’ dimensions. Thus, at a given iteration, you can train a specific classification algorithm is trained on n input features. Now, you have to remove one input feature at a time and train the same model on n-1 input variables n times. Then you remove the input variable whose elimination generates the smallest increase in the error rate, which leaves behind n-1 input features. Further, you repeat the classification using n-2 features, and this continues till no other variable can be removed.
Each iteration (k) creates a model trained on n-k features having an error rate of e(k). Following this, you must select the maximum bearable error rate to define the smallest number of features needed to reach that classification performance with the given ML algorithm.
Also Read: Why Data Analysis is Important in Business
9. Forward Feature Construction
The forward feature construction is the opposite of the backward feature elimination method. In the forward feature construction method, you begin with one feature and continue to progress by adding one feature at a time (this is the variable that results in the greatest boost in performance).
Both forward feature construction and backward feature elimination are time and computation-intensive. These methods are best suited for datasets that already have a low number of input columns.
10. Random Forests
Random forests are not only excellent classifiers but are also extremely useful for feature selection. In this dimensionality reduction approach, you have to carefully construct an extensive network of trees against a target attribute. For instance, you can create a large set (say, 2000) of shallow trees (say, having two levels), where each tree is trained on a minor fraction (3) of the total number of attributes.
The aim is to use each attribute’s usage statistics to identify the most informative subset of features. If an attribute is found to be the best split, it usually contains an informative feature that is worthy of consideration. When you calculate the score of an attribute’s usage statistics in the random forest in relation to other attributes, it gives you the most predictive attributes.
To conclude, when it comes to dimensionality reduction, no technique is the absolute best. Each has its quirks and advantages. Thus, the best way to implement dimensionality reduction techniques is to use systematic and controlled experiments to figure out which technique(s) works with your model and which delivers the best performance on a given dataset.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.