Lasso Regression in Machine Learning

By Rahul Singh

Updated on Jun 26, 2026 | 10 min read | 3.22K+ views

Share:

Lasso regression in machine learning is a regularization technique that improves linear regression models by reducing overfitting and selecting the most important features. It adds a penalty to the model during training, which discourages large coefficient values and can reduce the coefficients of less important features to zero.

As a result, lasso regression builds simpler, more interpretable models while maintaining strong predictive performance. It is widely used in datasets with many input variables, where identifying the most relevant features is as important as making accurate predictions.

This blog covers everything you need to understand lasso regression in machine learning: what it is, how it works mathematically, when to use it, how it compares to ridge regression, and how to implement it in Python

What Is Lasso Regression in Machine Learning?

Think of lasso regression like packing for a trip with limited luggage space. You keep only the items you really need and leave the rest behind. In the same way, lasso regression keeps the most important features and removes those that contribute little to the model.

The key thing that makes lasso regression different is what it does to irrelevant features: it sets their coefficients to exactly zero. That means it does not just reduce the impact of weak features. It removes them completely from the model.

This behavior is called automatic feature selection, and it is one of the biggest reasons lasso regression in machine learning is so widely used.

The Lasso Regression Formula

Standard linear regression minimizes this loss:

RSS = sum of (actual value - predicted value)^2

Lasso regression adds a penalty term to this:

Lasso Loss = RSS + lambda * (sum of |coefficients|)

The added term is the L1 penalty. Lambda (also written as alpha) controls how strong the penalty is:

Lambda Value

Effect on Model

Lambda = 0 No penalty, behaves like ordinary linear regression
Small Lambda Mild shrinkage, most features kept
Large Lambda Strong shrinkage, more coefficients become zero
Very large Lambda Almost all coefficients become zero, underfitting risk

The L1 penalty uses absolute values of the coefficients. This geometric property is what causes lasso to push coefficients to exactly zero, unlike ridge regression which only shrinks them close to zero.

Why Does L1 Produce Sparse Models?

Think of it this way. When you minimize a function with an L1 constraint, the solution often lands at a corner of the constraint region, where one or more coefficients are exactly zero. This is not the case with L2 (ridge), where the constraint is smooth and rounded, and solutions rarely hit zero.

That sharp corner behavior is what gives lasso regression in machine learning its feature selection property.

Also Read: Different Types of Regression Models You Need to Know

Lasso vs Ridge vs ElasticNet: Key Differences

All three are regularization techniques, but they work differently and suit different problems.

How the Penalties Differ

Feature

Lasso (L1)

Ridge (L2)

ElasticNet

Penalty type Sum of absolute values Sum of squared values Mix of L1 and L2
Can zero out coefficients Yes No Yes
Feature selection Built-in No Partial
Handles multicollinearity Moderate Strong Strong
Best when Many irrelevant features Many correlated features Both problems present

When to Choose Each

Use lasso regression in machine learning when:

  • You have many features but suspect only a few are truly useful
  • You want a simpler, more interpretable model
  • Feature selection and prediction need to happen together

Use ridge regression when:

  • Most features are likely to contribute something
  • You have high multicollinearity and do not want to remove features
  • You need a stable model with lower variance

Use ElasticNet when:

  • You have both issues: too many features and multicollinearity
  • You want a balance between L1 and L2 behavior

How to Implement Lasso Regression in Python

Python's scikit-learn library makes lasso regression straightforward to apply. Here is a step-by-step implementation.

Basic Lasso Regression Example

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np

# Sample data: 10 features, only 3 are actually useful
np.random.seed(42)
X = np.random.randn(200, 10)
true_coef = np.array([3, -2, 1.5, 0, 0, 0, 0, 0, 0, 0])
y = X @ true_coef + np.random.randn(200) * 0.5

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling is important for lasso
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)

# Check which coefficients became zero
print("Lasso coefficients:", lasso.coef_)
print("Test MSE:", mean_squared_error(y_test, lasso.predict(X_test_scaled)))

Expected output behavior: The coefficients for features 4 through 10 (which were set to zero in the true model) should be shrunk to zero or near zero by lasso.

Tuning Lambda with Cross-Validation

Choosing the right alpha is critical. Use LassoCV to find the best value automatically:

from sklearn.linear_model import LassoCV

# LassoCV tests multiple alpha values using cross-validation
lasso_cv = LassoCV(alphas=[0.001, 0.01, 0.1, 1.0, 10.0], cv=5)
lasso_cv.fit(X_train_scaled, y_train)

print("Best alpha:", lasso_cv.alpha_)
print("Coefficients:", lasso_cv.coef_)

Important: Always Scale Your Features

Lasso regression penalizes large coefficients. If one feature has values in the thousands and another has values between 0 and 1, the penalty affects them unequally. Always use StandardScaler before fitting a lasso model.

Also Read: A Guide to Linear Regression Using Scikit [With Examples]

When Should You Use Lasso Regression?

Knowing what is lasso regression in machine learning is one thing. Knowing when to actually apply it is what matters in practice.

Situations Where Lasso Works Well

  1. High-dimensional datasets: When you have hundreds or thousands of features and only a handful are likely to be meaningful. Genomics data is a classic example.
  2. Need for model interpretability: After lasso removes irrelevant features, the resulting model is easier to explain. A model with 5 coefficients is far more interpretable than one with 50.
  3. Sparse underlying relationships: If you believe the true relationship between inputs and output depends on only a few variables, lasso is designed for that assumption.
  4. Exploratory analysis: Lasso can act as a first-pass filter to identify which features deserve deeper investigation.

Situations Where Lasso May Not Be Ideal

Scenario

Problem with Lasso

Better Choice

Many correlated features Picks one arbitrarily, drops others Ridge or ElasticNet
All features matter Incorrectly zeros out useful ones Ridge regression
Very small dataset May over-regularize Use with caution, tune alpha carefully
Non-linear relationships Linear model may underfit Tree-based models or kernel methods

If you're interested in building accurate and interpretable machine learning models like lasso regression, explore these upGrad programs to strengthen your data science and AI expertise:

The Bias-Variance Trade-off in Lasso

Lasso increases bias (it constrains the model) but reduces variance (the model generalizes better to new data). This trade-off is the core idea behind regularization.

With a small alpha, lasso behaves close to ordinary linear regression: low bias, high variance. With a large alpha, you get high bias and low variance. The right alpha sits somewhere in between, and cross-validation helps you find it.

Also Read: Bias Variance Tradeoff in Machine Learning

Lasso Regression in Real-World Applications

Lasso regression in machine learning is not just a textbook algorithm. It sees regular use in fields where selecting the right features is as important as the prediction itself.

Common Use Cases

  1. Genomics and bioinformatics: Studies often have tens of thousands of gene expression features but only a few hundred samples. Lasso identifies which genes predict a specific outcome without overfitting to irrelevant ones.
  2. Finance: Predicting stock returns or credit risk using many economic indicators. Lasso helps select which indicators actually matter for a given target variable.
  3. Medical research: When predicting patient outcomes using clinical variables, interpretability is essential. Lasso delivers both prediction accuracy and a lean model that clinicians can understand.
  4. Text analysis: In bag-of-words models, you might have thousands of word features. Lasso reduces these to the terms most predictive of the label.
  5. Real estate pricing: Predicting house prices using dozens of features like square footage, location, age, and amenities. Lasso identifies the features that drive price the most.

Performance Benchmark Example

Here is a simplified comparison on a dataset with 500 features, only 20 of which are truly predictive:

Model

Test RMSE

Features Used

Interpretable

Linear Regression 4.2 500 No
Ridge Regression 3.1 500 Partially
Lasso Regression 2.8 18 Yes
ElasticNet 2.9 22 Yes

Lasso achieves lower error and a simpler model because it correctly identifies and removes the noisy features.

Limitations and Common Mistakes to Avoid

Even with its strengths, lasso regression in machine learning has a few known weaknesses. Being aware of them will save you debugging time.

Known Limitations

  1. Struggles with grouped correlated features: If features A, B, and C are all correlated and all useful, lasso may pick only one and zero out the rest. This can hurt model performance. ElasticNet handles this more gracefully.
  2. No probabilistic output: Lasso gives point predictions, not probability distributions. For tasks where uncertainty matters, you need a different approach.
  3. Assumes linear relationships: If the true relationship between inputs and output is non-linear, lasso will underfit regardless of the alpha value.
  4. Sensitive to feature scaling: Forgetting to scale features is one of the most common implementation mistakes. Always scale before fitting.

Common Implementation Mistakes

  • Not using cross-validation to tune alpha, and instead guessing
  • Forgetting to scale features and getting unreliable coefficient comparisons
  • Using lasso when all features are genuinely important (ridge is better there)
  • Treating zeroed-out coefficients as proof that a feature has no predictive power at all (it may still carry signal in interaction with other features)

Also Read: What is Overfitting and Underfitting in Machine Learning?

Conclusion

What is lasso regression in machine learning in simple terms? It is a smarter version of linear regression that learns to ignore what does not matter.

Use it when you have more features than you need, when interpretability matters, or when you suspect your dataset has a lot of noise hiding the true signal. Pair it with cross-validation to tune the alpha parameter, always scale your features, and compare it against ridge and ElasticNet to find what works best for your problem.

If you want to build a career applying these techniques in real-world projects, upGrad's data science and machine learning programs will take you from the fundamentals to job-ready skills with hands-on projects, mentorship, and industry-aligned curriculum.

Want to build expertise in machine learning and AI? Speak with an upGrad expert in a free 1:1 counselling session to find the right program for your career goals.

Frequently Asked Question (FAQs)

1. What is lasso regression in machine learning in simple terms?

Lasso regression is a type of linear regression that adds a penalty to reduce the model's complexity. It automatically shrinks some feature coefficients to exactly zero, which means it selects only the most important features and ignores the rest during training.

2. How is lasso different from ordinary linear regression?

Ordinary linear regression tries to fit all features without any constraints, which can lead to overfitting when many features are irrelevant. Lasso adds a penalty on the size of coefficients, discouraging the model from using noisy or redundant features. This makes lasso more reliable on unseen data.

3. What does the alpha (lambda) parameter control in lasso regression?

Alpha controls the strength of the regularization penalty. A small alpha applies mild shrinkage and keeps most features. A large alpha applies strong shrinkage and forces more coefficients to zero. You should always tune alpha using cross-validation rather than setting it manually without testing.

4. Why does lasso produce sparse models while ridge does not?

Lasso uses an L1 penalty, which is the sum of absolute values of coefficients. Geometrically, this constraint creates sharp corners where the optimal solution often lands, setting some coefficients to exactly zero. Ridge uses an L2 penalty (squared values), which has a smooth boundary and rarely produces exact zeros.

5. Can lasso regression be used for classification problems?

Lasso in its standard form is designed for regression (predicting continuous values). For classification, you would use logistic regression with an L1 penalty, which is available in scikit-learn via LogisticRegression(penalty='l1'). The feature selection behavior is similar.

6. What happens when lasso is applied to correlated features?

When two or more features are highly correlated, lasso tends to pick one and set the rest to zero. This arbitrary selection can be a problem if all correlated features carry real information. In such cases, ElasticNet (which combines L1 and L2 penalties) is usually a better choice.

7. How do I choose the right alpha value for lasso regression?

Use cross-validation. In scikit-learn, LassoCV automatically tests a range of alpha values across multiple folds and selects the one that produces the lowest cross-validated error. Avoid picking alpha by hand without validation, as it leads to poorly tuned models.

8. Is feature scaling necessary before applying lasso regression?

Yes, feature scaling is essential. Lasso penalizes the magnitude of coefficients, so if features are on different scales, the penalty affects them unequally. Standardizing features using StandardScaler (mean zero, unit variance) ensures fair penalization across all features.

9. What is the difference between lasso regression and ElasticNet?

Lasso uses only the L1 penalty, which produces sparse models but struggles with correlated features. ElasticNet combines both L1 and L2 penalties. This lets it zero out irrelevant features like lasso while also handling groups of correlated features more gracefully, like ridge regression.

10. How do I interpret the coefficients of a lasso regression model?

Coefficients that are exactly zero mean lasso determined those features are not useful for predicting the outcome. Non-zero coefficients indicate features that contribute to the prediction. The sign (positive or negative) tells you the direction of the relationship, and the magnitude tells you relative importance, though only when features are scaled.

11. Does lasso regression work well on small datasets?

Lasso can work on small datasets but requires careful tuning. With a small sample, a high alpha may over-regularize and remove genuinely useful features. Use cross-validation with a fine-grained range of alpha values and consider whether the number of features is much larger than the number of samples before applying lasso.

Rahul Singh

87 articles published

Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program