For working professionals
For fresh graduates
More
49. Variance in ML
Did you know? The High Correlation Filter is one of the most effective and widely used feature selection techniques in machine learning today! This filter reduces redundancy by identifying and removing highly correlated features, typically those with a Pearson correlation coefficient above 0.8 or 0.9. It also minimizes multicollinearity and streamlines the dataset for faster and more accurate models.
High correlation filters are essential tools in machine learning and data preprocessing. They help identify and remove highly correlated features, ensuring the dataset remains lean, relevant, and free from redundant information. High correlation filters reduce multicollinearity and enhance model interpretability and performance by eliminating these overlaps.
These filters are handy when working with datasets containing numerous numerical features. By applying correlation thresholds, commonly 0.8 or 0.9, you can prevent the model from being biased by duplicated patterns or noise. This speeds up training and supports more robust and generalizable outcomes.
In this blog, we explore high correlation filters, when to use them, and their benefits to the machine learning workflow, from feature selection to model evaluation.
Improve your machine learning skills with upGrad’s online AI and ML courses. They help you build real-world problem-solving abilities. Learn to design intelligent systems and apply algorithms in practical scenarios.
When building a machine learning model, your dataset may include features that provide similar or identical information. For instance, if you have features like "age in years" and "age in months," or "height in cm" and "height in meters," these two features carry almost the same meaning. This redundancy can confuse your model because it might give more importance to features that are not contributing any new, valuable information.
A High Correlation Filter in ML helps solve this problem by identifying pairs of strongly correlated features that move together similarly. These correlations are often measured using the Pearson correlation coefficient, which ranges from -1 to 1. A value above 0.8 or 0.9 usually indicates that the features are highly correlated.
This means the two features provide almost the same information and can be considered redundant. So one of them can be removed to improve the model's efficiency.
If you're looking to deepen your understanding of machine learning and apply it to real-world problems like these, consider exploring hands-on courses:
The Pearson correlation coefficient measures the linear relationship between two continuous variables and ranges from -1 to +1. A value near +1 indicates a strong positive correlation, -1 indicates a strong negative correlation, and 0 means no linear correlation. It’s calculated using the formula:
r = cov(X, Y) / (σₓ · σᵧ)
Where cov(X, Y) is the covariance between variables X and Y, and σₓ and σᵧ are their standard deviations. In machine learning, this helps identify features that move together. If two features have a high correlation (e.g., r > 0.85), one can be removed to reduce redundancy, simplify the model, and prevent multicollinearity.
The high correlation filter in ML uses statistical methods to calculate how strongly two features are related. Suppose the correlation coefficient is above a set threshold of 0.8. It identifies those two features as redundant. From the pair, the filter will remove one less important feature or provide no additional value, thus reducing the overall complexity of your dataset.
This helps you simplify your data without losing essential information. By eliminating these redundant features, you avoid overfitting and enhance model efficiency. With fewer features, the model can focus on the most meaningful data, leading to more accurate predictions and faster training times.
Why is this important for you?
Ready to start your programming journey but not sure where to begin? upGrad’s free Python course for beginners teaches core concepts like loops, data structures, and object-oriented programming step by step. Build confidence with hands-on coding so you’re not stuck watching endless tutorials without real progress.
Now that you’ve understood what a high correlation filter is and why it matters, let’s look at how you can implement it in Python step by step.
In this section, you'll learn how to implement a high correlation filter in machine learning using Python. By applying this filter, you can identify and remove highly correlated features from your dataset, which helps reduce multicollinearity and improve your models' performance.
Here is a step-by-step guide using popular libraries like pandas and scikit-learn.
Step 1: Import Required Libraries
You'll need to import the necessary libraries to handle data and calculate correlations to get started.
import pandas as pd
import numpy as np
Step 2: Load and Prepare Your Data
Load your dataset into a pandas DataFrame. For this example, let’s use a sample CSV file. You can replace the path with your dataset.
df = pd.read_csv('your_dataset.csv')
Step 3: Compute the Correlation Matrix
Calculate the correlation matrix using pandas to identify how strongly features are related. The corr() function computes the Pearson correlation coefficient. The absolute values of correlations help spot highly correlated features. Removing such features reduces redundancy and improves model performance.
corr_matrix = df.corr().abs()
print(corr_matrix)
Step 4: Select the Upper Triangle of the Correlation Matrix
The correlation matrix is symmetric, meaning the correlation between feature A and feature B is the same as that between B and A. To avoid redundant checks, focus on the upper triangle, excluding the diagonal (representing a feature’s correlation with itself, always 1).
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
Step 5: Identify Features to Drop
Setting a correlation threshold is crucial for identifying redundant features. Standard thresholds are 0.8 or 0.95, depending on how strictly you want to filter similar features. However, the ideal threshold may vary based on the nature of your dataset. For example, a higher threshold might be more appropriate in high-dimensional datasets or those with naturally correlated features (like image or text embeddings).
This helps avoid over-filtering and ensures essential features aren’t mistakenly removed. Conversely, a lower threshold for tabular data with interpretable features can help eliminate subtle redundancy. This step ensures you retain only features that contribute unique, non-overlapping information to your model.
threshold = 0.8
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > threshold)]
print("Features to drop:", to_drop)
Step 6: Drop Highly Correlated Features
Once you've identified the highly correlated features using the correlation matrix and set a threshold, removing the redundant features from your dataset is next. Dropping these features ensures that your model doesn’t suffer from multicollinearity, which can distort model performance and make interpretation difficult.
df_filtered = df.drop(columns=to_drop)
print("Filtered DataFrame shape:", df_filtered.shape)
Step 7: Use the Filtered Data for Modeling
Your DataFrame is now ready for training. You can use df_filtered as input for your machine learning models.
Why This Approach Works:
Output:
Struggling with data manipulation and visualization? Check out upGrad’s free Learn Python Libraries: NumPy, Matplotlib & Pandas course. Gain the skills to handle complex datasets and create powerful visualizations. Start learning today!
You’ve seen how to implement a High Correlation Filter in Python; let’s look at the key benefits and limitations you should know.
Certain features often convey overlapping patterns or duplicate information in many machine learning projects, especially those involving large numerical datasets. A correlation filter in ML helps you detect these highly similar features and remove the less informative ones. This reduces data complexity and improves your model's focus on unique, meaningful signals.
While correlation filters can significantly boost performance and reduce overfitting, they also come with trade-offs. Understanding both sides is crucial before adding this step to your feature selection workflow.
Here are some benefits and limitations of using a high correlation filter in ML:
Benefits:
Limitations:
Also Read: 25+ Essential Machine Learning Projects GitHub with Source Code for Beginners and Experts in 2025
When applying a high correlation filter in ML, it's essential to be strategic. You can't just remove features blindly. Doing so might lead to the loss of valuable information. Instead, follow these best practices to optimize model performance without compromising feature quality.
Start with a Pearson correlation threshold of 0.8 or 0.9. This means if two independent variables (not the target) correlate with the threshold, one of them should be considered for removal. You are not removing or basing on correlation with the target, only among input features.
How it helps:
Highly correlated features carry duplicated information, which can cause several issues:
Applying a correlation filter simplifies the dataset, avoids redundancy, and sometimes even improves generalization.
How to do it:
Example:
Suppose you're building a creditworthiness model for a bank. You have:
These two features correlate 0.95. On paper, both seem similar. But after checking, you notice that:
In this case, you drop loan_amount_last_year because it adds less predictive value and could confuse the model due to its high correlation.
However, this threshold might differ in other cases. For example:
upGrad’s Linear Regression—Step by Step Guide course explains core concepts like simple and multiple regression in an easy-to-follow manner. You’ll also gain hands-on knowledge of performance metrics and how regression is used in real data science scenarios.
Also Read: A Guide to Linear Regression Using Scikit [With Examples]
Before applying a correlation filter, always visualize the correlation matrix. This helps you understand how features relate to each other and whether the correlations are isolated or part of a larger pattern. Looking only at raw correlation values without visualization can cause overlooked clusters or misinterpretation of relationships. A heatmap will help you see clusters of highly correlated features, allowing you to make smarter decisions about which features to remove.
How it helps:
How to do it:
Example: Let’s say you're working on a churn prediction model. You have these features:
When you compute .corr() and plot the heatmap:
This cluster shows a set of features carrying similar information, instead of blindly removing one pair. The heatmap helps you see the group as a whole. You might then choose to keep only avg_daily_data_used because it’s easier to interpret and has a strong signal with the target.
Also Read: Top 15 Data Visualization Project Ideas: For Beginners, Intermediate, and Expert Professionals
When applying a high correlation filter, focus only on input-independent features. Never drop a feature solely because it’s highly correlated with another if it also strongly correlates with the target variable. The goal is to reduce redundancy without harming the model's ability to learn relevant patterns.
How it helps:
Retaining features strongly related to the target ensures your model doesn't lose key predictive signals.
How to do it:
Example: You're developing a loan default prediction model. Two features, total_outstanding_amount and monthly_due_amount, are highly correlated (correlation = 0.91). Before removing either, you check their relationship with the target variable (default_status):
This indicates that monthly dues have a more substantial influence on default behavior. It reflects immediate financial pressure, which is more predictive of short-term default than total debt. In this case, even though both features are correlated, you keep monthly_due_amount and remove total_outstanding_amount, preserving a feature that has direct predictive power.
Additional Example:
Consider a model predicting credit default rates. If you remove the feature income, which is highly correlated with the target (default status), you might lose critical predictive power. Even though income might be associated with other features like total debt, it's still an essential predictor for credit risk. Removing it could severely impair the model's ability to predict defaults accurately.
A correlation filter should never work in isolation. Removing one without understanding its domain relevance can hurt model performance or reduce explainability even when two features are highly correlated. Your decision should balance statistical correlation with practical usefulness.
How it helps:
Statistical correlation only shows mathematical similarity, not real-world value. Two features might be correlated, but only one might be directly tied to operational goals, policy rules, or user behavior. Retaining the more domain-relevant feature helps preserve important context for prediction and interpretability.
How to do it:
Example: Suppose you’re building a model to predict employee attrition. You find that number_of_days_absent_last_year and average_monthly_absence_days correlate at 0.92. Statistically, they seem redundant. But when you dig deeper:
In this case, you retain average_monthly_absence_days because it’s directly tied to organizational decision-making, while the other is less useful despite being correlated.
After applying a high correlation filter in ML, it's important to confirm that removing features hasn’t negatively impacted your model's performance. This is where cross-validation plays a critical role. You're not just checking if the model still runs—you’re checking if it generalizes well to unseen data after the feature reduction.
How it helps:
Feature removal may reduce overfitting, but it can also result in:
By using cross-validation, you evaluate the model on multiple splits of the data, helping you:
How to do it:
Example: You're working on a customer churn model. After applying correlation filtering, you drop total_data_used_last_month because it's highly correlated with average_weekly_data_usage. Before finalizing, you run 5-fold cross-validation on your model:
Even though accuracy dropped slightly, the standard deviation decreased, showing more consistent performance. That means the model has become more stable and less overfitted, a worthy trade-off. If performance had dropped significantly instead, you would have reconsidered the feature removal decision and considered keeping the dropped variable.
Even if two features are highly correlated, one may contribute more to the model’s predictions than the other. Comparing feature importance helps you decide which correlated feature to keep.
How it helps:
Correlation filtering only looks at relationships between features, not their impact on the target variable. A feature might be highly correlated with another but still carry a more predictive signal. If you remove it without checking its importance, the model's performance may drop.
This step ensures that you don’t eliminate important features, especially in non-linear models like Random Forest, XGBoost, or Gradient Boosted Trees.
How to do it:
Example: You’re working on a telecom churn model. You find total_outgoing_calls_last_month and total_outgoing_call_minutes_last_month. These features correlate at 0.93. A heatmap tells you they are redundant, but both seem relevant. After training a Random Forest model:
Here, you drop call_count based on correlation, and the model found it less helpful. This avoids accidentally discarding predictive information.
When applying a high correlation filter in ML, it's critical to maintain a clear record of which features were removed, why they were removed, and what impact they had on the model. This is especially important when you're working in regulated industries, collaborating with large teams, or planning to revisit or update the model in the future.
How it helps:
How to do it:
Example: Suppose you're building a fraud detection model and drop last_30_day_transaction_sum because it's 0.91 correlated with last_month_avg_transaction. You document:
Feature Dropped | Correlated With | Correlation Score | Reason |
last_30_day_transaction_sum | last_month_avg_transaction | 0.91 | Lower stability and less correlation with the target |
Later, when your teammate needs to explain the model logic to auditors, this table makes the process fast and foolproof.
After performing feature transformations like scaling, standardization, or one-hot encoding, it's critical to re-check correlations. Transformations can change relationships between variables, especially when new dimensions are introduced.
Why it matters:
During preprocessing, you may:
These operations alter the feature space:
If you skip correlation checks after these steps, your model might unknowingly include newly redundant or misleading features.
How to do it:
Example: Imagine you're building a student performance prediction model. One of your features is grade_category, which has values A, B, and C. After one-hot encoding, this becomes grade_A, grade_B, grade_C. These dummy variables are mutually exclusive. If one is 1, the others are 0. This causes a perfect negative correlation between them (e.g., grade_A vs. grade_B = -1). If included together, they introduce perfect multicollinearity, breaking linear models.
To fix this, you can:
Now that you’ve seen the benefits and limitations of high correlation filters, let’s look at where you can apply them with real-world examples.
In many machine learning datasets, features can be highly correlated, such as age, years of experience, or height in inches and centimeters. These redundant features provide similar information and can distort model interpretation, inflate variance, and slow down training. A high correlation filter helps streamline the dataset by removing features that offer no new value.
Below are five specific use cases where this filter significantly improves model performance and reliability.
For instance, when stock prices rise, trading volumes also tend to increase. This correlation can introduce redundancy in the model. By applying a high correlation filter to remove one of these correlated features, we can reduce redundancy, improve model performance, increase interpretability, and speed up training. This process helps create a more efficient model focusing on the most informative features without duplication.
Also Read: Top 50 IoT Projects For all Levels in 2025 [With Source Code]
After learning how High Correlation Filters are applied in real-world scenarios, it’s time to test what you have learned through a quiz on high correlation filters in ML.
Test how well you understand using high correlation filters in machine learning. These 10 MCQs will help reinforce your learning and clarify important concepts.
Q1. What is the main purpose of using a high correlation filter in ML?
A) To reduce training time
B) To remove duplicate rows
C) To eliminate features with overlapping information
D) To convert categorical data to numeric
Q2. What correlation value is typically used as a threshold for filtering?
A) 0.2
B) 0.5
C) 0.75
D) 0.85 or higher
Q3. Why should highly correlated features be removed in a model?
A) They decrease model interpretability
B) They increase the dataset size
C) They make target prediction easier
D) They help generate synthetic features
Q4. Which metric is most commonly used in high correlation filters?
A) F1-score
B) Pearson correlation coefficient
C) Cosine similarity
D) Z-score
Q5. In which type of machine learning model is multicollinearity a serious issue?
A) Decision trees
B) K-means clustering
C) Linear regression
D) Naive Bayes
Q6. What happens if you keep highly correlated features in a linear model?
A) Accuracy improves
B) Coefficients become unstable
C) Outliers are removed
D) Data normalization becomes easier
Q7. When applying a high correlation filter, what is typically retained?
A) The feature with the lowest missing values
B) The most interpretable feature
C) Only target variable
D) One of the two highly correlated features
Q8. What is a common drawback of blindly removing features based on correlation?
A) Reduced training time
B) Loss of a significant signal
C) Better model accuracy
D) Lower memory usage
Q9. How does a correlation heatmap assist in applying the filter?
A) It highlights the outliers visually
B) It shows the pairwise correlation between all features
C) It ranks features by importance
D) It visualizes label distribution
Q10. Which step should come before applying a correlation filter?
A) Model deployment
B) Label encoding
C) Handling missing values and outliers
D) Hyperparameter tuning
Also Read: Clustering vs Classification: What is Clustering & Classification
High correlation filters allow you to refine your feature set by eliminating redundant features that provide overlapping information. Removing highly correlated variables can reduce multicollinearity, speed up model training, and improve your model's interpretability.
Applying a high correlation filter during data preprocessing in machine learning workflows helps streamline feature selection and ensures that your model focuses on the most relevant, independent variables. Whether working on predictive maintenance or customer segmentation, using a correlation filter empowers you to create more efficient, accurate, and interpretable models.
If you want to strengthen your practical ML skills, these additional courses may help you stay relevant and job-ready.
Curious which courses can help you gain expertise in ML in 2025? Contact upGrad for personalized counseling and valuable insights. For more details, you can also visit your nearest upGrad offline center.
The most effective libraries are pandas, NumPy, and scikit-learn. pandas.corr() helps compute pairwise correlation, while NumPy supports efficient array operations. Scikit-learn's VarianceThreshold or custom transformers can automate correlation-based feature dropping if you're working with pipelines. Seaborn.heatmap() is widely used for visual inspection to identify feature relationships before filtering.
It depends on the domain and model sensitivity. Use 0.8 if you remove strict redundancy, like in finance or healthcare. Use 0.9 or 0.95 to be conservative and retain more features. It's a trade-off: lower thresholds reduce noise but risk losing signal; higher thresholds preserve more data but can leave multicollinearity.
Yes. You can write a custom TransformerMixin class that calculates a correlation matrix and drops columns based on a threshold. This allows seamless integration into ML pipelines using Pipeline() or ColumnTransformer(). It’s handy when working with automated training flows in MLflow or Vertex AI platforms.
Not directly. Correlation filters work on continuous numeric features. Use Cramér’s V or Theil’s U for categorical data for association measures. If you're dealing with one-hot encoded features, apply correlation filtering only after encoding and only on columns with enough variability, to avoid removing features with low signal.
Correlated features may represent temporal lag or trend information in time series. Instead of directly dropping them, consider dimensionality reduction (e.g., PCA) or transform correlated features using differencing or rolling stats. Also, ensure your correlation analysis respects time to avoid data leakage in the future.
Yes. Use mutual information, feature importance from tree models (e.g., XGBoost or Random Forest), Recursive Feature Elimination (RFE), or L1 regularization (Lasso). While correlation filters are fast and interpretable, they only capture linear relationships. These other techniques can capture non-linear patterns and interaction effects.
Linear models like Logistic Regression, linear regression, or SVMs are sensitive to multicollinearity, which can inflate coefficients and cause unstable predictions. In contrast, tree-based models (like Decision Trees or Random Forests) handle correlation better but may still overfit if redundant features dominate. Always check your model type before skipping this step.
Yes. Use pair plots (sns.pairplot) for deeper insight into feature relationships or network graphs with libraries like networkx to visualize correlation clusters. Dendrograms from hierarchical clustering can group similar features based on correlation distance for large datasets. These visuals help reduce bulk features with context.
Always maintain a log of dropped features using dictionaries or a tracking sheet. In pipelines, store the list in a .json or push it to MLflow’s artifact store. This helps with model debugging and reproducibility. In production systems, tracking dropped columns ensures transparency and avoids data mismatch issues downstream.
Apply correlation filtering before PCA or scaling. PCA captures variance from all features, including redundant ones, so using it after dropping correlated features gives cleaner principal components. Feature scaling (like StandardScaler or MinMaxScaler) doesn’t affect correlation direction but can amplify noise if used before filtering.
You can wrap the correlation filter logic in a custom Python function or class and deploy it as a preprocessing step in your pipeline DAG. Use a PythonOperator to run the filtering before model training tasks in Airflow. In Kubeflow, you can include it in a pipeline step using a custom container or a Jupyter-based component. For automated pipelines, export the list of dropped features as metadata or artifacts to ensure transparency and reproducibility across runs. This makes your feature selection process auditable and modular within scalable ML workflows.
Talk to our experts. We are available 7 days a week, 9 AM to 12 AM (midnight)
Indian Nationals
1800 210 2020
Foreign Nationals
+918068792934
1.The above statistics depend on various factors and individual results may vary. Past performance is no guarantee of future results.
2.The student assumes full responsibility for all expenses associated with visas, travel, & related costs. upGrad does not provide any a.