Lasso Regression in Machine Learning
By Rahul Singh
Updated on Jun 26, 2026 | 10 min read | 3.22K+ views
Share:
All courses
Certifications
More
By Rahul Singh
Updated on Jun 26, 2026 | 10 min read | 3.22K+ views
Share:
Table of Contents
Lasso regression in machine learning is a regularization technique that improves linear regression models by reducing overfitting and selecting the most important features. It adds a penalty to the model during training, which discourages large coefficient values and can reduce the coefficients of less important features to zero.
As a result, lasso regression builds simpler, more interpretable models while maintaining strong predictive performance. It is widely used in datasets with many input variables, where identifying the most relevant features is as important as making accurate predictions.
This blog covers everything you need to understand lasso regression in machine learning: what it is, how it works mathematically, when to use it, how it compares to ridge regression, and how to implement it in Python.
Think of lasso regression like packing for a trip with limited luggage space. You keep only the items you really need and leave the rest behind. In the same way, lasso regression keeps the most important features and removes those that contribute little to the model.
The key thing that makes lasso regression different is what it does to irrelevant features: it sets their coefficients to exactly zero. That means it does not just reduce the impact of weak features. It removes them completely from the model.
This behavior is called automatic feature selection, and it is one of the biggest reasons lasso regression in machine learning is so widely used.
Standard linear regression minimizes this loss:
RSS = sum of (actual value - predicted value)^2
Lasso regression adds a penalty term to this:
Lasso Loss = RSS + lambda * (sum of |coefficients|)
The added term is the L1 penalty. Lambda (also written as alpha) controls how strong the penalty is:
Lambda Value |
Effect on Model |
| Lambda = 0 | No penalty, behaves like ordinary linear regression |
| Small Lambda | Mild shrinkage, most features kept |
| Large Lambda | Strong shrinkage, more coefficients become zero |
| Very large Lambda | Almost all coefficients become zero, underfitting risk |
The L1 penalty uses absolute values of the coefficients. This geometric property is what causes lasso to push coefficients to exactly zero, unlike ridge regression which only shrinks them close to zero.
Think of it this way. When you minimize a function with an L1 constraint, the solution often lands at a corner of the constraint region, where one or more coefficients are exactly zero. This is not the case with L2 (ridge), where the constraint is smooth and rounded, and solutions rarely hit zero.
That sharp corner behavior is what gives lasso regression in machine learning its feature selection property.
Also Read: Different Types of Regression Models You Need to Know
All three are regularization techniques, but they work differently and suit different problems.
Feature |
Lasso (L1) |
Ridge (L2) |
ElasticNet |
| Penalty type | Sum of absolute values | Sum of squared values | Mix of L1 and L2 |
| Can zero out coefficients | Yes | No | Yes |
| Feature selection | Built-in | No | Partial |
| Handles multicollinearity | Moderate | Strong | Strong |
| Best when | Many irrelevant features | Many correlated features | Both problems present |
Use lasso regression in machine learning when:
Use ridge regression when:
Use ElasticNet when:
Python's scikit-learn library makes lasso regression straightforward to apply. Here is a step-by-step implementation.
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
# Sample data: 10 features, only 3 are actually useful
np.random.seed(42)
X = np.random.randn(200, 10)
true_coef = np.array([3, -2, 1.5, 0, 0, 0, 0, 0, 0, 0])
y = X @ true_coef + np.random.randn(200) * 0.5
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature scaling is important for lasso
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
# Check which coefficients became zero
print("Lasso coefficients:", lasso.coef_)
print("Test MSE:", mean_squared_error(y_test, lasso.predict(X_test_scaled)))
Expected output behavior: The coefficients for features 4 through 10 (which were set to zero in the true model) should be shrunk to zero or near zero by lasso.
Choosing the right alpha is critical. Use LassoCV to find the best value automatically:
from sklearn.linear_model import LassoCV
# LassoCV tests multiple alpha values using cross-validation
lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0], cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print("Best alpha:", lasso_cv.alpha_)
print("Coefficients:", lasso_cv.coef_)
Lasso regression penalizes large coefficients. If one feature has values in the thousands and another has values between 0 and 1, the penalty affects them unequally. Always use StandardScaler before fitting a lasso model.
Also Read: A Guide to Linear Regression Using Scikit [With Examples]
Knowing what is lasso regression in machine learning is one thing. Knowing when to actually apply it is what matters in practice.
Scenario |
Problem with Lasso |
Better Choice |
| Many correlated features | Picks one arbitrarily, drops others | Ridge or ElasticNet |
| All features matter | Incorrectly zeros out useful ones | Ridge regression |
| Very small dataset | May over-regularize | Use with caution, tune alpha carefully |
| Non-linear relationships | Linear model may underfit | Tree-based models or kernel methods |
If you're interested in building accurate and interpretable machine learning models like lasso regression, explore these upGrad programs to strengthen your data science and AI expertise:
Lasso increases bias (it constrains the model) but reduces variance (the model generalizes better to new data). This trade-off is the core idea behind regularization.
With a small alpha, lasso behaves close to ordinary linear regression: low bias, high variance. With a large alpha, you get high bias and low variance. The right alpha sits somewhere in between, and cross-validation helps you find it.
Also Read: Bias Variance Tradeoff in Machine Learning
Lasso regression in machine learning is not just a textbook algorithm. It sees regular use in fields where selecting the right features is as important as the prediction itself.
Here is a simplified comparison on a dataset with 500 features, only 20 of which are truly predictive:
Model |
Test RMSE |
Features Used |
Interpretable |
| Linear Regression | 4.2 | 500 | No |
| Ridge Regression | 3.1 | 500 | Partially |
| Lasso Regression | 2.8 | 18 | Yes |
| ElasticNet | 2.9 | 22 | Yes |
Lasso achieves lower error and a simpler model because it correctly identifies and removes the noisy features.
Even with its strengths, lasso regression in machine learning has a few known weaknesses. Being aware of them will save you debugging time.
Also Read: What is Overfitting and Underfitting in Machine Learning?
What is lasso regression in machine learning in simple terms? It is a smarter version of linear regression that learns to ignore what does not matter.
Use it when you have more features than you need, when interpretability matters, or when you suspect your dataset has a lot of noise hiding the true signal. Pair it with cross-validation to tune the alpha parameter, always scale your features, and compare it against ridge and ElasticNet to find what works best for your problem.
If you want to build a career applying these techniques in real-world projects, upGrad's data science and machine learning programs will take you from the fundamentals to job-ready skills with hands-on projects, mentorship, and industry-aligned curriculum.
Want to build expertise in machine learning and AI? Speak with an upGrad expert in a free 1:1 counselling session to find the right program for your career goals.
Lasso regression is a type of linear regression that adds a penalty to reduce the model's complexity. It automatically shrinks some feature coefficients to exactly zero, which means it selects only the most important features and ignores the rest during training.
Ordinary linear regression tries to fit all features without any constraints, which can lead to overfitting when many features are irrelevant. Lasso adds a penalty on the size of coefficients, discouraging the model from using noisy or redundant features. This makes lasso more reliable on unseen data.
Alpha controls the strength of the regularization penalty. A small alpha applies mild shrinkage and keeps most features. A large alpha applies strong shrinkage and forces more coefficients to zero. You should always tune alpha using cross-validation rather than setting it manually without testing.
Lasso uses an L1 penalty, which is the sum of absolute values of coefficients. Geometrically, this constraint creates sharp corners where the optimal solution often lands, setting some coefficients to exactly zero. Ridge uses an L2 penalty (squared values), which has a smooth boundary and rarely produces exact zeros.
Lasso in its standard form is designed for regression (predicting continuous values). For classification, you would use logistic regression with an L1 penalty, which is available in scikit-learn via LogisticRegression(penalty='l1'). The feature selection behavior is similar.
When two or more features are highly correlated, lasso tends to pick one and set the rest to zero. This arbitrary selection can be a problem if all correlated features carry real information. In such cases, ElasticNet (which combines L1 and L2 penalties) is usually a better choice.
Use cross-validation. In scikit-learn, LassoCV automatically tests a range of alpha values across multiple folds and selects the one that produces the lowest cross-validated error. Avoid picking alpha by hand without validation, as it leads to poorly tuned models.
Yes, feature scaling is essential. Lasso penalizes the magnitude of coefficients, so if features are on different scales, the penalty affects them unequally. Standardizing features using StandardScaler (mean zero, unit variance) ensures fair penalization across all features.
Lasso uses only the L1 penalty, which produces sparse models but struggles with correlated features. ElasticNet combines both L1 and L2 penalties. This lets it zero out irrelevant features like lasso while also handling groups of correlated features more gracefully, like ridge regression.
Coefficients that are exactly zero mean lasso determined those features are not useful for predicting the outcome. Non-zero coefficients indicate features that contribute to the prediction. The sign (positive or negative) tells you the direction of the relationship, and the magnitude tells you relative importance, though only when features are scaled.
Lasso can work on small datasets but requires careful tuning. With a small sample, a high alpha may over-regularize and remove genuinely useful features. Use cross-validation with a fine-grained range of alpha values and consider whether the number of features is much larger than the number of samples before applying lasso.
87 articles published
Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...
India’s #1 Tech University
Executive Program in Generative AI for Leaders
76%
seats filled