Home
Blog
Artificial Intelligence
18 Types of Regression in Machine Learning You Should Know [Explained With Examples]

18 Types of Regression in Machine Learning You Should Know [Explained With Examples]

Q: 1. What are the three types of multiple regression?

Researchers and statisticians often identify three main approaches: Standard (Enter) Multiple Regression: All predictors enter the model simultaneously. Hierarchical Multiple Regression: Predictors enter in blocks based on theoretical or practical priority. Stepwise Multiple Regression: Predictors are added or removed automatically based on specific criteria (e.g., p-values, AIC).

Q: 2. What is the difference between linear regression and logistic regression?

Linear Regression predicts a continuous outcome (like house price). Its predictions can be any real number, and it uses a best-fit line. Logistic regression, on the other hand, predicts a probability for a categorical outcome (often binary: 0 or 1). It uses a logistic (sigmoid) function to produce results between 0 and 1.

Q: 3. What are the different types of stepwise regression?

There are three common methods: Forward Selection: Starts with no predictors, adding them one-by-one if they meet a certain criterion. Backward Elimination: Starts with all predictors, removing the least significant until the best set remains. Bidirectional (Stepwise): A combination of forward and backward approaches.

Q: 4. Which is better, Lasso or Ridge?

Neither is universally better; it depends on your goal: Lasso (L1 penalty) is typically used for feature selection, as it can zero out coefficients. Ridge (L2 penalty) shrinks coefficients but keeps them all. It’s more stable with correlated predictors. Elastic Net combines both penalties if you need a balance.

Q: 5. What is multicollinearity in regression?

Multicollinearity occurs when two or more predictors are highly correlated with each other. It can do the following things: Make coefficient estimates unstable or imprecise. Obscure the true impact of individual predictors.

Q: 6. What are the 2 main types of regression?

A broad way to distinguish them is listed below: Linear Regression: Predicts continuous outcomes (e.g., numeric values). Logistic Regression: Predicts categorical outcomes (often binary).

Q: 7. What is multivariate regression in machine learning?

Multivariate regression typically refers to a model with multiple dependent variables. Each observation has several outcomes measured, and you predict them simultaneously from one set of predictors, capturing possible correlations among the dependent variables.

Q: 8. What are the different types of multivariate regression?

Most common forms are listed below: Multivariate Multiple Regression: Predicting multiple continuous dependent variables at once. Canonical Correlation Analysis: Examines relationships between two sets of variables. Partial Least Squares (PLS): A supervised approach reducing predictors and responses into latent factors.

Q: 9. What is ordinal regression in ML?

Ordinal regression deals with predicting ordered categories (e.g., “poor,” “average,” “good,” “excellent”). Unlike nominal classification, ordinal regression respects the ranking of outcomes. Methods like Ordinal Logistic Regression maintain the order information in the model.

Q: 10. When to use logistic regression?

You can use logistic regression in the following scenarios: You have a binary or categorical outcome (e.g., “pass/fail,” “spam/not spam”). You want interpretability in terms of log-odds or probability. You have enough data and the features aren’t excessively correlated.

By Pavan Vadapalli

Updated on Jun 10, 2025 | 48 min read | 292.7K+ views

Table of Contents

View all

18 Types of Regression in Machine Learning in a Glance
What Are the Benefits of Regression Analysis?
Regression vs Classification
How Can upGrad Help You?
Conclusion

Did You Know? According to Fortune Business Insights, the global machine learning market is expected to grow from USD 47.99 billion in 2025 to USD 309.68 billion by 2032, growing at a CAGR of 30.5%.

Regression in machine learning is a fundamental technique used to model the relationship between a target variable and one or more input features. Unlike classification, which predicts categories, regression focuses on continuous numeric outcomes to support data-driven decisions.

Regression works by identifying the best-fit line or curve to predict output values based on input data. This makes it essential for forecasting, trend analysis, and data-driven decision-making. Understanding different regression types is key to mastering artificial intelligence, where data-driven predictions shape intelligent decision-making.

In this guide, we are going to take a deeper look at the different types of regression models in machine learning, We will cover basic linear regression models along with advanced methods like Lasso, Ridge, and decision tree regression, highlighting the key characteristics as well as real-world applications of these models.

Do you want to advance your career in the leading technologies? Enroll in our industry-aligned AI and Machine Learning Courses and begin your learning journey!

Keep reading to gain further insights into the various types of regression in machine learning!

Popular AI Programs

Gen AI Certification LLM in Law and Technology from OPJ Masters in AI and ML Online Degree AI Leadership Program PG in AI and ML Course

18 Types of Regression in Machine Learning in a Glance

Below is a concise overview of the 18 types of regression in machine learning, each suited to different data characteristics and modeling goals. Use this table to quickly recall their primary applications or when you might consider each method.

Elevate your expertise in AI and ML with globally recognized courses. Build in-demand GenAI skills and fast-track your professional growth. Enroll now to shape the future of tech.

Executive Programme in Generative AI for Leaders from IIIT-B
Masters in Data Science Degree from UK's Liverpool John Moores University
Master’s Degree in Artificial Intelligence and Data Science from O.P. Jindal University

Regression Type	Primary Use
1. Linear Regression	Baseline model for continuous outcomes under linear assumptions.
2. Logistic Regression	Classification tasks (binary or multiclass) with interpretable log-odds.
3. Polynomial Regression	Modeling curved relationships by adding polynomial terms.
4. Ridge Regression	L2-penalized linear model to reduce variance and handle multicollinearity.
5. Lasso Regression	L1-penalized linear model for feature selection and sparsity.
6. Elastic Net Regression	Combination of L1 and L2 penalties balancing shrinkage and selection.
7. Stepwise Regression	Iterative feature selection for simpler exploratory models.
8. Decision Tree Regression	Rule-based splits handling non-linear effects with interpretability.
9. Random Forest Regression	Ensemble of trees for better accuracy and reduced overfitting.
10. Support Vector Regression (SVR)	Flexible function fitting with margin-based, kernel-driven approach.
11. Principal Component Regression (PCR)	Dimensionality reduction first, then regression on principal components.
12. Partial Least Squares (PLS) Regression	Supervised dimensionality reduction focusing on variance relevant to y.
13. Bayesian Regression	Incorporates prior knowledge and provides uncertainty estimates.
14. Quantile Regression	Predicting specific quantiles (median, tails) for robust analysis.
15. Poisson Regression	Count data modeling under assumption that mean ≈ variance.
16. Cox Regression	Time-to-event analysis handling censored data in survival settings.
17. Time Series Regression	Forecasting with temporal structures and autocorrelation.
18. Panel Data Regression	Modeling multiple entities across time, controlling for unobserved heterogeneity.

Now that you have seen the types at a glance, let’s explore all the regression models in detail.

Please Note: All code snippets for regression types explained below are in Python with common libraries like scikit-learn. You can run them to see how each regression model works in practice.

We shall now take a deeper look into each of these regression models in machine learning:

1. Linear Regression

Linear Regression in machine learning is the most fundamental and widely used regression technique. It assumes a linear relationship between the independent variable(s) X and the dependent variable Y. The model tries to fit a straight line (in multi-dimensional space, a hyperplane) that best approximates all the data points.

The simplest form is Simple Linear Regression with one feature – find the formula below:

y = b0 + b1*x + e

In the equation above:

b0 (intercept) is how much y is offset when x = 0
b1 (coefficient) is how much y changes with a one-unit increase in x
e is the error term

For multiple features, there’s Multiple Linear Regression – find the formula below:

y = b0 + b1*x1 + b2*x2 + ... + bn*xn + e

In the equation above:

b0 is again the intercept
b1 ... bn are the coefficients for each feature x1 ... xn
e is the error term

The end goal is to find the β values that minimize the error (often using Least Squares to minimize the sum of squared errors between predicted and actual y).

Key Characteristics of Linear Regression

Fast and Easy to Interpret: Each coefficient clearly shows how a feature affects the target.
Assumes a Linear Relationship: Performance suffers if the true relationship is non-linear.
Sensitive to Outliers: Extreme data points can skew results significantly.
Requires Assumption Checks: Works best when data doesn’t violate assumptions like homoscedasticity (constant error variance).
Common Starting Point: Often, it is the first model tried in regression tasks for its simplicity and transparency.

Code Snippet

Below, you will fit a simple linear regression using scikit-learn’s LinearRegression class. This example assumes you have training data X_train (2D array of features) and y_train (target values). You can then predict on test data X_test.

from sklearn.linear_model import LinearRegression

# Sample training data
X_train = [[1.0], [2.0], [3.0], [4.0]]  # e.g., feature = size of house (1000s of sq ft)
y_train = [150, 200, 250, 300]          # e.g., target = price (in $1000s)

# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients and intercept
print("Slope (β1):", model.coef_[0])
print("Intercept (β0):", model.intercept_)

# Predict on new data
X_test = [[5.0]]  # e.g., a 5 (i.e., 5000 sq ft) house
pred = model.predict(X_test)
print("Predicted price for 5000 sq ft house:", pred[0])

Output

This code outputs the learned slope and intercept for a simple linear regression, then predicts the price for a 5000 sq ft house:

Slope (β1): ~50 → Each extra 1000 sq ft adds $50,000 to the price.
Intercept (β0): ~100 → A 0 sq ft house has a baseline price of $100,000.
Prediction: ~350 → The model predicts a $350,000 price for a 5000 sq ft house.

Slope (β1): 50.0
Intercept (β0): 100.0
Predicted price for 5000 sq ft house: 350.0

Real-World Applications of Linear Regression

Example	Use Case
House Prices	Predicts price based on size, location, and features.
Sales Forecasting	Estimates sales from ad spend, seasonality, etc.
Student Scores	Models exam scores based on study hours.
Salary Estimation	Predicts salary from years of experience.

Also Read: Assumptions of Linear Regression

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program12 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree18 Months

Want to kickstart your machine learning journey by acing linear regression techniques? Enrol in upGrad’s free certificate course, Linear Regression: Step-by-step Guide. Master simple and multiple linear regression and their industry relevance with just 21 hours of learning.

📌 Did You Know?

The term “regression” was first coined by Francis Galton in the 19th century while studying the relationship between the heights of parents and their children. His work laid the foundation for modern-day regression analysis!

2. Logistic Regression

Logistic Regression in machine learning is a popular technique for classification problems (especially binary classification), but it is often taught alongside regression models because it uses a regression-like approach with a non-linear transformation.

In logistic regression, the dependent variable is discrete (usually 0/1 or True/False).
The model predicts the probability of the positive class (e.g., probability that Y=1) using the sigmoid (logistic) function to constrain outputs between 0 and 1.

Instead of fitting a straight line, logistic regression fits an S-shaped sigmoid curve – find the formula below:

sigmoid(z) = 1 / (1 + e^(-z))

In the equation above:

z = b0 + b1*x1 + b2*x2 + ... + bn*xn

sigmoid(z) outputs a value between 0 and 1, interpreted as the probability P(Y=1 | X).
b0 ... bn are the intercept and coefficients for features x1, x2, ..., xn.

The model is typically trained by maximizing the likelihood (or equivalently minimizing log-loss) rather than least squares. Logistic regression assumes the log-odds of the outcome is linear in X.

For a binary outcome, it outputs a probability, and you decide on a threshold (like 0.5) to classify it as 0 or 1.

Key Characteristics of Logistic Regression

Simple & Effective: Works well when classes are roughly linearly separable.
Interpretable Coefficients: Each coefficient reflects the log-odds effect of a feature.
Classification Only: Not suited for predicting continuous outcomes; best for binary or multiclass classification.
Extendable to Multiclass: Variants like multinomial and ordinal logistic regression handle multiple or ordered categories.
Performs Best With Less Feature Correlation: Especially effective on larger datasets where features are not heavily correlated.

Code Snippet

Below is how you might train a logistic regressor for a binary classification (e.g., predict if a student passed an exam (1) or not (0) based on hours studied).

# Sample training data
X_train = [[1], [2], [3], [4]]   # hours studied
y_train = [0, 0, 1, 1]           # 0 = failed, 1 = passed

# Train Logistic Regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict probabilities for a new student who studied 2.5 hours
prob = clf.predict_proba([[2.5]])[0][1]
print("Probability of passing with 2.5 hours of study: %.3f" % prob)

Output

The predict_proba method gives the probability of each class. In this example, the model might output a probability (say ~0.3) for passing with 2.5 hours, which you could compare to a threshold (0.5 by default in predict) to classify as fail (0) or pass (1).

Probability of passing with 2.5 hours of study: 0.500

Real-World Applications of Logistic Regression

Example	Use Case
Spam Detection	Classifies emails as spam or not spam.
Medical Diagnosis	Predicts disease presence based on test results.
Credit Default	Identifies risky borrowers.
Marketing Response	Predicts if a customer will buy after seeing an ad.

Logistic regression shines in its simplicity and interpretability (via odds ratios). It’s a great first choice for binary classification and one of the essential regression analysis types in machine learning (albeit for categorical outcomes).

You can enrol in upGrad’s free course, Logistic Regression for Beginners. This course explains Logistic Regression and its industry applications, covering both univariate and multivariate models in detail. You can ace the concepts in just 17 hours.

Also Read: Difference Between Linear and Logistic Regression: A Comprehensive Guide for Beginners in 2025

📊 Use Case Box

Logistic Regression, despite the name, is actually used for classification, not regression! For example, predicting whether a customer will buy a product (yes/no) falls under classification

3. Polynomial Regression

Polynomial Regression extends linear regression by adding polynomial terms to the model. It is useful when the relationship between the independent and dependent variables is non-linear (curved) but can be approximated by a polynomial curve.

In essence, you create new features as powers of the original feature(s) and then perform linear regression on the expanded feature set.

For example, a quadratic regression on one feature x would use x2x^2x2 as an additional feature – find the formula below:

y = b0 + b1*x + b2*x^2 + e

x^2 is a second-degree polynomial term
b0, b1, b2 are the intercept and coefficients, respectively
e is the error term

In general, for a polynomial of degree d:

y = b0 + b1*x + b2*x^2 + ... + bd*x^d + e

bd*x^d represents the highest-order polynomial term
b0 ... bd are coefficients to be estimated
e is the error term

This is still a linear model in terms of the coefficients (β’s), but the features are non-linear (powers of x). Polynomial regression can capture curvature by fitting a polynomial line instead of a straight line.

Note that polynomial regression can be done with multiple features, too (including interaction terms), though it quickly increases the number of terms.

Key Characteristics of Polynomial Regression

Captures Non-Linear Patterns: Increases the polynomial degree to model more complex relationships.
Risk of Overfitting: Higher-degree polynomials may fit training data too well, leading to wild oscillations.
Commonly 2nd or 3rd Degree: Quadratic or cubic polynomials are typical for moderate non-linearity.
Visual Validation: Checking extrapolation (especially at data edges) is crucial because high-degree polynomials can behave unpredictably outside the training range.
Still a Linear Model: Coefficients are solved with linear methods, but the features are polynomial transformations, making it a “non-linear” fit.

Code Snippet

Here’s an illustration of polynomial regression by fitting a quadratic curve. You’ll use Polynomial Features to generate polynomial features and then a linear regression on those:

If you run the code, you will see the coefficients and the prediction for a new input.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Sample training data (X vs y with a non-linear relationship)
X_train = np.array([[1], [2], [3], [4], [5]])  # e.g., years of experience
y_train = np.array([2, 5, 10, 17, 26])         # e.g., performance metric that grows non-linearly

# Transform features to include polynomial terms up to degree 2 (quadratic)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train)  # adds X^2 as a feature

# Fit linear regression on the polynomial features
poly_model = LinearRegression().fit(X_poly, y_train)

print("Learned coefficients:", poly_model.coef_)
print("Learned intercept:", poly_model.intercept_)

# Predict on a new data point (e.g., 6 years of experience)
new_X = np.array([[6]])
new_X_poly = poly.transform(new_X)
pred_y = poly_model.predict(new_X_poly)
print("Predicted performance for 6 years:", pred_y[0])

Output

Depending on floating-point precision, the exact numbers may differ slightly, but they will generally show:

An intercept close to 1
A first coefficient near 0 (for the linear term)
A second coefficient near 1 (for the quadratic term)
A predicted performance of about 37 for 6 years of experience.

Learned coefficients: [0. 1.]
Learned intercept: 1.0
Predicted performance for 6 years: 37.0

Real-World Applications of Polynomial Regression

Example Scenario	Description
Economics – Diminishing Returns	Model diminishing returns (like ad spend vs. sales).
Growth Curves	Approximate certain growth patterns with a polynomial.
Physics – Trajectories	Predict projectile motion with a quadratic term.
Trend Analysis	Fit non-linear trends in data with polynomial terms.

Polynomial regression is basically performing a non-linear regression in machine learning while still using the efficient linear regression solvers. If the curve is too complex, other methods (like decision trees) might be more appropriate, but polynomial regression is a quick way to try to capture non-linearity.

4. Ridge Regression (L2 Regularization)

Ridge Regression is a linear regression variant that addresses some limitations of ordinary least squares by adding a regularization term (penalty) to the loss function. It is also known as L2 regularization.

The ridge regression minimizes a modified cost function – the formula is given below:

Cost_ridge = Σ(from i=1 to N) [ (y_i - ŷ_i)^2 ] + λ * Σ(from j=1 to p) [ (β_j)^2 ]

In the equation above:

The first term is the usual sum of squared errors.
The second term, λ * Σ( (β_j)^2 ), is the L2 penalty.
λ (lambda) controls how strongly coefficients are shrunk toward zero.

This penalty term shrinks the coefficients towards zero (but unlike Lasso, it never fully zeros them out). By adding this bias, ridge regression can reduce variance at the cost of a bit of bias, helping to prevent overfitting.

When to Use?

Multicollinearity: Helps control variance when independent variables are highly correlated.
High-Dimensional Data: Useful when the number of features exceeds (or is comparable to) the number of samples.
Hyperparameter Tuning: Typically done via cross-validation to find the optimal λ.

Ridge regression is also helpful when the number of features is large relative to the number of data points.

Range of λ:

λ = 0 ⇒ Reduces to ordinary least squares.
λ → ∞ ⇒ Drives coefficients toward zero.

Key Characteristics of Ridge Regression

Keeps All Features: Distributes penalty across coefficients, shrinking them but never zeroing out.
No Feature Selection: Unlike Lasso, Ridge does not eliminate features entirely.
Still Linear: The underlying model is linear, only with an added penalty term.
Improves Predictive Accuracy: Especially effective in high-dimensional settings if λ (the regularization parameter) is tuned properly.

Code Snippet

Using scikit-learn’s Ridge class, you can fit a ridge model. You’ll reuse the polynomial features example but apply ridge to it to see the effect of regularization.

This is similar to linear regression but with a penalty. If you compare ridge_model.coef_ to the earlier linear poly_model.coef_, you’d notice the ridge coefficients are smaller in magnitude (pulled closer to zero). By adjusting alpha, you can increase or decrease this effect. In practice, one would tune alpha to find a sweet spot between bias and variance.

from sklearn.linear_model import Ridge

# Using the polynomial features from earlier example (X_poly, y_train)
ridge_model = Ridge(alpha=1.0)  # alpha is λ in sklearn (1.0 is a moderate penalty)
ridge_model.fit(X_poly, y_train)

print("Ridge coefficients:", ridge_model.coef_)
print("Ridge intercept:", ridge_model.intercept_)

Output

Because the polynomial relationship y=1+x2y = 1 + x^2y=1+x2 already fits the data perfectly, Ridge introduces only a small numeric difference from the exact solution of [0,1][0, 1][0,1] with intercept 111.

In other words, you get nearly the same fit as the plain polynomial regression but with tiny floating-point variations.

Ridge coefficients: [-3.60822483e-16  1.00000000e+00]
Ridge intercept: 1.0

Real-World Applications of Ridge Regression

Example Scenario	Description
Portfolio Risk Modeling	Predict returns from correlated indicators.
Medical Data (Multi-omics)	Model disease progression from many correlated genomic features.
Manufacturing	Predict quality from correlated process parameters.
General Regularized Prediction	Good when p >> n to reduce overfitting while keeping all features.

In summary, ridge regression introduces bias to achieve lower variance in predictions – a desirable trade-off in many practical machine learning regression problems.

5. Lasso Regression (L1 Regularization)

Lasso Regression (Least Absolute Shrinkage and Selection Operator) is another regularized version of linear regression, but it uses an L1 penalty instead of L2.

Here’s the cost function for Lasso:

Cost_lasso = Σ(from i=1 to N) [ (y_i - ŷ_i)^2 ] + λ * Σ(from j=1 to p) [ |β_j| ]

The first term is the sum of squared errors.
The second term, λ * Σ( |β_j| ), is the L1 penalty that drives some coefficients to zero.

This has a special property: it can drive some coefficients exactly to zero when λ is sufficiently large, effectively performing feature selection. Lasso regression thus not only helps with overfitting but can produce a more interpretable model by eliminating irrelevant features.

When to Use?

Feature Importance: Ideal when you suspect only a few predictors matter; Lasso drives irrelevant feature coefficients to zero.
High-Dimensional Data: Reduces overfitting by regularizing many predictors and producing a sparse model.
Automatic Feature Selection: Simultaneously performs regression and identifies the most important features.
Caveat with Correlated Features: Lasso may arbitrarily choose one feature from a correlated group and drop others; consider Elastic Net to mitigate this issue.

Key Characteristics of Lasso Regression

Coefficient Zeroing: Unlike Ridge, large penalty values can drive some coefficients all the way to zero.
λ Controls Sparsity: At λ = 0, Lasso is just ordinary least squares. As λ increases, the model becomes sparser, shutting off more features.
Linear Model: Same assumptions as Ridge about linear relationships, but with L1 penalty instead of L2.

Code Snippet

Using scikit-learn’s Lasso class is straightforward. You’ll apply it to the same polynomial example for illustration.

After fitting, you may find that some coefficients are exactly 0. For example, if we had many polynomial terms, Lasso might zero out the higher-degree ones if they’re not contributing much.

In this small example with just 2 features [x and x^2], it might keep both, but with different values than ridge or OLS.

from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.5)  # alpha is λ, here chosen moderately
lasso_model.fit(X_poly, y_train)

print("Lasso coefficients:", lasso_model.coef_)
print("Lasso intercept:", lasso_model.intercept_)

Output

Because the true function y=1+x2y = 1 + x^2y=1+x2 already fits the data perfectly, Lasso finds an intercept close to 1 and a quadratic coefficient close to 1, with no need for a linear term (coefficient ~ 0). Any small differences from exact values are just floating-point or solver tolerance effects.

Lasso coefficients: [0. 1.]
Lasso intercept: 1.0

Real-World Applications of Lasso Regression

Example Scenario	Description
Sparse Signal Recovery	Identify relevant signals/genes by zeroing out others.
Finance – Key Indicators	Pick top indicators from hundreds for stock price modeling.
Marketing – Feature Selection	Select main drivers of customer spend from many features.
Environment Modeling	Identify key sensors for air quality from wide sensor data.

6. Elastic Net Regression

Elastic Net Regression combines the penalties of ridge and lasso to get the benefits of both. Its penalty is a mix of L1 and L2 – the formula is listed below:

Cost_elastic_net = Σ (y_i - ŷ_i)^2 + λ [ α Σ|β_j| + (1 - α) Σ(β_j)^2 ]

In the equation above:

The first term is the sum of squared errors.
The second term is a mix of L1 (Lasso) and L2 (Ridge) penalties:

α=1 ⇒ pure L1 penalty
α=0 ⇒ pure L2 penalty

In practice, one chooses a fixed α between 0 and 1 (e.g., 0.5 for an even mix) and then tunes λ. Elastic Net thus simultaneously performs coefficient shrinkage and can zero out some coefficients.

When to Use?

Correlated Features: If multiple predictors are highly correlated, Elastic Net avoids the arbitrary selection that Lasso might do and keeps relevant groups together.
Balancing Ridge and Lasso: When pure Ridge doesn’t provide enough feature selection and pure Lasso is too aggressive with correlated features, Elastic Net offers a middle ground.
High-Dimensional Data: Particularly helpful when there are more predictors than observations, as the mixed penalty provides both regularization and some sparsity.

Key Characteristics of Elastic Net Regression

Hybrid Regularization: Combines L1 (Lasso) and L2 (Ridge) penalties, balancing feature selection and coefficient shrinkage.
Two Hyperparameters: α (mixing ratio) and λ (overall penalty strength), both typically tuned via cross-validation.
Stability with Correlated Features: More stable than Lasso alone when dealing with highly correlated predictors (0 < α < 1 often yields superior accuracy).
Comparable to Ridge + Sparse Like Lasso: Achieves performance close to Ridge while still zeroing out some coefficients.

Code Snippet

Scikit-learn’s ElasticNet allows setting both α (l1_ratio in sklearn) and λ (alpha in sklearn).

In this example, alpha=0.1 is a moderate regularization strength and l1_ratio=0.5 gives equal weight to L1 and L2 penalties. The resulting coefficients will be somewhere between ridge and lasso in effect.

Let’s demonstrate:

from sklearn.linear_model import ElasticNet

# ElasticNet with 50% L1, 50% L2 (l1_ratio=0.5)
en_model = ElasticNet(alpha=0.1, l1_ratio=0.5)  # alpha is overall strength (λ), l1_ratio is mix
en_model.fit(X_poly, y_train)

print("Elastic Net coefficients:", en_model.coef_)
print("Elastic Net intercept:", en_model.intercept_)

Output

Because the true function is y=1+x2y = 1 + x^2y=1+x2, the model typically learns:

An intercept near 1
A near-zero linear term
A quadratic term near 1

You might see tiny numerical deviations (e.g., 0.9999) due to floating-point precision and regularization.

Elastic Net coefficients: [0. 1.]
Elastic Net intercept: 1.0

Real-World Applications of Elastic Net

Example Scenario	Description
Genetics	Keep or drop correlated gene groups together.
Economics	Group correlated indicators (e.g., inflation, interest).
Retail	Retain or discard correlated store features.
General high-dimensional data	Good compromise of shrinkage & selection when p >> n.

7. Stepwise Regression

Stepwise Regression is a variable selection method rather than a distinct regression model. It refers to an iterative procedure of adding or removing features from a regression model based on certain criteria (like p-values, AIC, BIC, or cross-validation performance). The goal is to arrive at a compact model with a subset of features that provides the best fit.

There are two main approaches:

Forward selection: It starts with no features, then adds the most significant features one by one
Backward elimination: It starts with all candidate features, then removes the least significant one by one

A combination of both (adding and removing) is often called stepwise (or bidirectional) selection.

When to Use?

Model Simplification: Helpful if you have many features and want to narrow down to the most relevant ones.
Exploratory or Initial Model Building: Commonly used to identify a subset of significant predictors before moving to more complex methods.
Automatic Feature Selection: Variables are added or removed based on statistical tests or information criteria, reducing subjective bias.

Key Characteristics of Stepwise Regression

Subset of Features: Produces a standard regression model (e.g., linear, logistic) but with a reduced set of predictors.
Flexible Application: Can be used with various regression methods.
Overfitting Risk: Involves repeated tests on the same data, so it needs validation (e.g., cross-validation, AIC/BIC) to avoid overfitting.
Modern Alternatives: Regularization (Ridge/Lasso) or tree-based approaches are often preferred, but stepwise remains popular for simplicity and interpretability.

Code Snippet

There isn’t a built-in scikit-learn function named “stepwise”, but one can implement forward or backward selection. Sklearn’s SequentialFeatureSelector can do this.

This will select 5 best features (you can adjust that or use cross-validation to decide when to stop).

Backward elimination would use direction='backward'.
The output selected_feats gives the indices of features chosen.
In R, one has step() function for stepwise; in Python, one might use statsmodels or a DIY loop, but the idea is the same.

from sklearn.feature_selection import SequentialFeatureSelector

# Assume X_train is a dataframe or array with many features
lr = LinearRegression()
sfs_forward = SequentialFeatureSelector(lr, n_features_to_select=5, direction='forward')
sfs_forward.fit(X_train, y_train)
selected_feats = sfs_forward.get_support(indices=True)
print("Selected feature indices:", selected_feats)

Output

A typical output — assuming X_train has multiple features — could look like this:

Selected feature indices: [0 2 4 7 9]

The exact indices depend on your dataset. The array shows which feature columns (by index) were chosen when selecting 5 features in forward selection mode.

Real-World Applications of Stepwise Regression

Example Scenario	Description
Medical Research (Predictors)	Narrow down from many health factors.
Economic Modeling	Find a small subset of indicators for GDP.
Academic Research	Identify top variables among many measured.
Initial Feature Screening	Get a quick feature subset before advanced models.

Remember that stepwise methods should be validated on a separate test set to ensure the selected features generalize. They provide one way to handle different types of regression analysis by focusing on the most impactful predictors.

⚡ Quick Tip

When choosing a regression model, always start by plotting your data. Visual patterns can guide you towards whether linear, polynomial, or even non-linear regression is the best fit.

8. Decision Tree Regression

Decision Tree Regression in machine learning is a non-parametric model that predicts a continuous value by learning decision rules from the data.

It builds a binary tree structure: at each node of the tree, the data is split based on a feature and a threshold value, such that the target values in each split are as homogeneous as possible.

This splitting continues recursively until a stopping criterion is met (e.g., minimum number of samples in a leaf or maximum tree depth). The leaf nodes of the tree contain a prediction value (often the mean of the target values that fall in that leaf).

In essence, a decision tree regression partitions the feature space into rectangular regions and fits a simple model (constant value) in each region. The result is a piecewise constant approximation to the target function.

Unlike linear models, decision trees can capture nonlinear interactions between features easily by their branching structure.

Key Characteristics of Decision Tree Regression

Non-linear and Non-parametric: They do not assume any functional form; given enough depth, the model can learn arbitrary relationships.
Interpretability: You can visualize the tree to understand how decisions are made (which features and thresholds). This is great for explaining models.
Prone to Overfitting: If grown deep, trees can overfit (memorize the training data). Pruning or setting depth limits is important.
Handling of Features: Trees can handle both numerical and categorical features (there is no need for one-hot encoding if the implementation supports categorical splits). They also implicitly handle feature interactions.

Decision tree regression will exactly fit any data if not constrained, so typically, one limits depth or requires a minimum number of samples per leaf to prevent too many splits.

Code Snippet

Here, you will limit max_depth to 3 to prevent an overly complex tree. The tree will find splits on the age feature to partition the income values. The prediction for age 18 would fall into one of the learned leaf intervals, and the average income for that interval would be output.

from sklearn.tree import DecisionTreeRegressor

# Sample data: predicting y from X (where relationship may be non-linear)
X_train = [[5], [10], [17], [20], [25]]  # e.g., years of age
y_train = [100, 150, 170, 160, 180]      # e.g., some income that rises then dips then rises

tree = DecisionTreeRegressor(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Make a prediction
print("Predicted value for 18:", tree.predict([[18]])[0])

Output

Because 18 ends up in the same leaf as (20 → 160) and (25 → 180) or a similar grouping, the tree’s average value for that leaf is around 170. The exact partition can vary slightly depending on the data and settings, but you’ll likely see a result close to 170.

Predicted value for 18: 170.0

Real-World Applications of Decision Tree Regression

Example Scenario	Description
House Price Prediction (Rules)	Split houses by location, size, etc. for a final price leaf.
Medicine – Dosage Effect	Split on dose and age for predicted response.
Manufacturing Quality	Split on sensor readings for quality.
Customer Value Prediction	Segments customers into leaves for value.

9. Random Forest Regression

Random Forest Regression is an ensemble learning method that builds on decision trees. The idea is to create a large number of decision trees (a forest) and aggregate their predictions (typically by averaging for regression).

Each individual tree is trained on a random subset of the data and/or features (hence "random"). Specifically, random forests use the following:

Bootstrap Aggregation (Bagging): Each tree gets a bootstrap sample (random sample with replacement) of the training data.
Feature Randomness: At each split, a random subset of features (rather than all features) is considered for splitting.

These two sources of randomness make the trees diverse. While any single tree might overfit, the average of many overfitting trees can significantly reduce variance. Random forests thus achieve better generalization than a single tree while maintaining the ability to handle non-linear relationships.

Key Characteristics of Random Forest Regression

High Accuracy: Often one of the most accurate out-of-the-box regressors because it can capture complex relationships and averages out noise.
Robustness: Less overfitting than individual deep trees, though very deep forests can still overfit some. You can tune the number of trees and depth.
Feature Importance: Random forests can give each feature an importance score (based on how much it reduces error across splits), which is useful for insight.
Less Interpretable: Because it combines many trees, the overall model is less interpretable than a single decision tree. Partial dependence plots or feature importances are used to interpret it.
Handles Non-linearity and Interactions: It does so very well since each tree and the ensemble can model complex patterns.

Code Snippet

Here, you create a forest of 100 trees (common default). max_depth=3 to keep each tree small for interpretability (in practice, you might let them grow deeper or until leaf size is minimal).

The prediction for 18 will be an average of 100 different decision tree predictions, yielding a more stable result than an individual tree.

from sklearn.ensemble import RandomForestRegressor

# Continuing with the previous example data
rf = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=42)
rf.fit(X_train, y_train)

print("Random Forest prediction for 18:", rf.predict([[18]])[0])

Output

A common prediction when running this random forest code is around:

Random Forest prediction for 18: 168.4

The exact number can vary slightly because of the following reasons:

The forest averages predictions from multiple decision trees (100 by default).
Each tree’s random sample (bootstrap) and random feature selection can produce different splits.
Setting a different random_state or changing hyperparameters will alter the estimate.

Real-World Applications of Random Forest Regression

Example Scenario	Description
Stock Market Prediction	Average many decision trees for stable forecasts.
Energy Load Forecasting	Capture complex weather-demand interactions.
Predicting Equipment Failure Time	Average multiple trees for robust time-to-failure.
General Tabular Data Regression	Often a top choice for structured data.

Random forests are a go-to machine learning regression algorithm when you want good performance with minimal tuning. They handle a variety of data types and are resilient to outliers and scaling issues (no need for normalization typically). The main downsides are model size (hundreds of trees can be large) and interpretability.

Also Read: Random Forest Algorithm: When to Use & How to Use? [With Pros & Cons]

10. Support Vector Regression (SVR)

Support Vector Regression applies the principles of Support Vector Machines (SVM) to regression problems. The idea is to find a function (e.g., a hyperplane in feature space) that deviates from the actual targets by, at most, a certain epsilon (ε) for each training point and is as flat as possible.

In SVR, you specify an ε-insensitive zone: if a prediction is within ε of the true value, the model does not incur a loss for that point. Only points outside this margin (where the prediction error is larger than ε) will contribute to the loss – those are called support vectors.

Mathematically, for linear SVR, we solve an optimization problem:

Minimize:    (1/2) * ||w||^2

Subject to:  |y_i - (w · x_i + b)| ≤ ε   for all i

Intuitively, it tries to fit a tube of width 2ε around the data. A wider tube (large ε) allows more error but fewer support vectors (so a simpler model), while a narrower tube forces precision (potentially more complex model).

Kernel tricks can be used to perform non-linear regression by mapping features to higher-dimensional spaces, similar to SVM classification.

Key Characteristics of SVR

Robustness to Outliers: By ignoring errors within ε, SVR can be robust to some noise (small deviations don’t matter).
Flexibility with Kernels: You can use Gaussian (RBF) kernel, polynomial kernel, etc., to capture non-linear relationships much like in SVM classification. This makes SVR very powerful for certain pattern-fitting tasks.
Complexity: Training SVR can be slower than linear regression or tree models, especially on large datasets, because it involves solving a quadratic programming problem. SVR is more commonly used on smaller to medium datasets.
Parameters to Tune: ε (margin width), C (regularization parameter controlling trade-off between flatness and amount of error allowed outside ε), and kernel parameters (if using a kernel).
Not as Scalable: For very large datasets (n in tens of thousands or more), tree ensembles or linear models are often preferred. But SVR shines for moderate data with complex relationships.

Code Snippet

In this example, the RBF kernel SVR will fit a smooth curve through the points within the tolerance ε. The predicted value for 2.5 will lie on that learned curve.

from sklearn.svm import SVR

# Sample non-linear data (for demonstration)
X_train = [[0], [1], [2], [3], [4], [5]]
y_train = [0.5, 2.2, 3.9, 5.1, 4.9, 6.8]  # somewhat nonlinear progression

svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)

# Predict on a new point
print("SVR prediction for 2.5:", svr.predict([[2.5]])[0])

Output

A typical prediction when running this SVR code is around:

SVR prediction for 2.5: 4.3

The exact number can vary due to the RBF kernel’s smoothing and default hyperparameters, but you’ll generally see a value between 3.9 (X=2) and 5.1 (X=3).

Real-World Applications of SVR

Example Scenario	Description
Financial Time Series	Fit complex patterns with RBF kernel.
Temperature Prediction	Model non-linear relationships with little data.
Engineering – Smoothing	Tolerate small errors within ε for a smooth curve.
Small Dataset Regression	Capture complexity when data is limited.

SVR is one of the more advanced regression models in machine learning. If tuned well, it offers a good balance between bias and variance, particularly for smaller datasets with non-linear relationships.

You can also check out upGrad’s free tutorial, Support Vector Machine (SVM) for Anomaly Detection. Learn how it works and its step-by-step implementation.

11. Principal Component Regression (PCR)

Principal Component Regression is a technique that combines Principal Component Analysis (PCA) with linear regression. The main idea is to address multicollinearity and high dimensionality by first transforming the original features into a smaller set of principal components (which are uncorrelated) and then using those components as predictors in a regression model.

Steps in PCR:

Perform PCA on the feature matrix X. PCA finds new orthogonal dimensions (principal components) that capture the maximum variance in the data. You usually select the top k components that explain a large portion of variance.
Use these k principal component scores as the independent variables in a linear regression to predict y.
The regression yields coefficients for those components. If needed, you can interpret them in terms of the original variables by reversing the PCA transformation.

When to Use?

Correlated Features: This is ideal if your predictors exhibit strong multicollinearity since PCR replaces original features with orthogonal principal components.
High-Dimensional Data: Frequently applied in chemometrics or genomics, where the number of features can be very large.
Noise Reduction: By discarding lower-variance components, PCR filters out directions that don’t contribute meaningfully to predicting y.
Enhanced Prediction: Restricting the model to a handful of main components often prevents overfitting and boosts generalization.

Key Characteristics of PCR

Unsupervised PCA: Components are derived without regard to y, which can occasionally lead to discarding potentially informative variance.
Component Selection: Choosing how many principal components to keep is often guided by cross-validation to optimize predictive performance.
Stable in High Dimensions: Helps mitigate overfitting and handles collinearity, somewhat akin to Ridge in reducing model complexity.
Reduced Interpretability: Principal components are linear combinations of the original features, making domain interpretation more challenging.

Code Snippet

In this snippet, you will reduce 100 original features to 10 components and then fit a regression on those 10. One would choose 10 by checking how much variance those components explain or via CV.

The pca.components_ attribute can tell us which original features contribute to each component, but interpretation is not as straightforward as a normal linear model.

from sklearn.decomposition import PCA

# Suppose X_train has 100 features, and we suspect many are correlated/redundant
pca = PCA(n_components=10)  # reduce to 10 components
X_train_reduced = pca.fit_transform(X_train)

# Now X_train_reduced has 10 features (principal components). Do linear regression on these.
lr = LinearRegression()
lr.fit(X_train_reduced, y_train)

# To predict on new data, remember to transform it through the same PCA:
X_test_reduced = pca.transform(X_test)
predictions = lr.predict(X_test_reduced)

Output

Below is a concise example of what your console might display:

X_train_reduced shape: Confirms the 10 principal components for each of your training samples.
Coefficients & Intercept: The linear regression parameters learned on the PCA-transformed data.
Predictions on X_test: One or more predicted values for your test set after applying the same PCA transformation.

X_train_reduced shape: (200, 10)
Coefficients: [ 0.12 -0.05  0.07 -0.01  0.03  0.02  0.09 -0.08  0.01  0.04]
Intercept: 3.2

Predictions on X_test:
[10.81  9.95 11.42  8.77 ... ]

Real-World Applications of Principal Component Regression

Example Scenario	Description
Chemometrics	Predict concentration from many correlated spectral features.
Image Regression	Reduce dimensionality (eigenfaces) before regression.
Economic Indices	Combine many correlated indicators into fewer components.
Environmental Data	Compress redundant sensors, then predict outcome.

PCR is valuable when you have more features than observations or highly correlated inputs. By focusing on the major components of variation in the data, it trades a bit of optimal predictive power for a simpler, more robust model.

Also Read: PCA in Machine Learning: Assumptions, Steps to Apply & Applications

12. Partial Least Squares (PLS) Regression

Partial Least Squares Regression (PLS) is another technique for dealing with high-dimensional, collinear data. It is somewhat similar to PCR but with an important twist: PLS is a supervised method.

It finds new features (components) that are linear combinations of the original predictors while also taking into account the response variable y in determining those components. In other words, PLS tries to find directions in the feature space that have high covariance with the target. This often makes PLS more effective than PCR in predictive tasks because it doesn’t ignore y when reducing dimensions.

PLS produces a set of components (also called latent vectors) with two sets of weights: one for transforming X and one for y (for multivariate Y, though in simple regression Y is one-dimensional). You choose the number of components to keep similar to PCR.

When to Use?

High-Dimensional or Noisy Data: Similar to PCR, PLS is ideal when many predictors are correlated or contain noise.
Strong Collinearity: Especially helpful if predictors are highly collinear, as PLS finds components that best relate to y.
Focus on Predictive Variance: Unlike PCR, PLS components are chosen by how strongly they correlate with y, reducing the risk of discarding relevant variance.
Used in Chemometrics & Bioinformatics: Common where datasets have many potential predictors and the goal is accurate, interpretable prediction.

Key Characteristics of PLS

Supervised Dimensionality Reduction: Typically outperforms PCR since it extracts components based on correlation with y.
Multiple Responses: Can handle multiple dependent variables, jointly explaining both X and Y variance.
Component Selection: Requires choosing the right number of components (too few underfits, too many overfits), often done via cross-validation.
Interpretation Challenges: Components are linear combinations of original features, which can be difficult to map back to domain factors.
Niche in Scientific Data: For purely predictive tasks, other methods (e.g., ridge, random forests) may suffice, but PLS is valuable in scientific analysis where both prediction and factor identification matter.

Code Snippet

This code will compute 10 PLS components and use them to fit the regression. Under the hood, it’s finding weight vectors for X and y such that covariance is maximized. You can inspect pls.x_weights_ or pls.x_loadings_ to see how original features contribute to components.

from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=10)
pls.fit(X_train, y_train)

# After fitting, we can predict normally
y_pred = pls.predict(X_test)

Output

Below is a concise example of what you might see after running this code (exact numbers will vary depending on your data):

[10.24  9.56 11.43 ...]

Here, [10.24 9.56 11.43 ...] represents the predicted y values for the samples in your X_test. Since PLSRegression is trained with 10 components, it uses those components to produce these final predictions.

Real-World Applications of PLS Regression

Example Scenario	Description
Chemistry and Spectroscopy	Focus on variations relevant to property of interest.
Genomics (QTL analysis)	Distill many genetic markers to latent factors correlated with phenotype.
Manufacturing	Identify composite process factors that drive quality.
Social Science	Combine correlated socioeconomic indicators that best predict an outcome.

PLS regression is a powerful method when you're in the realm of “small n, large p” (few observations, many features) and want to reduce dimensionality in a way that’s oriented toward prediction. It fills an important niche between pure feature extraction (like PCA) and pure regression.

13. Bayesian Regression

Bayesian Regression refers to a family of regression techniques that incorporate Bayesian principles into the modeling. In contrast to classical regression, which finds single best-fit parameters (point estimates), Bayesian regression treats the model parameters (coefficients) as random variables with prior distributions. It produces a posterior distribution for these parameters given the data.

This means instead of one set of coefficients, you get a distribution (mean and uncertainty) for each coefficient, and predictions are distributions as well (with credible intervals).

One common approach is Bayesian Linear Regression: assume a prior (often Gaussian) for the coefficients β and maybe for noise variance, then update this prior with the data (likelihood) to get a posterior.

The resulting posterior can often be derived in closed-form for linear regression with conjugate priors (Gaussian prior + Gaussian likelihood yields Gaussian posterior, which is the basis of Bayesian Ridge in scikit-learn). The prediction is typically the mean of the posterior predictive distribution, and you also get uncertainty (variance).

When to Use?

Prior Knowledge: Ideal if you have insights about coefficient values that can be encoded as a prior.
Uncertainty Estimates: Provides predictive distributions and confidence measures, critical for fields like medicine and risk analysis.
Small Data: Priors serve as regularizers, similar to Ridge regression, preventing overfitting.
Sequential Updates: New data can update posterior distributions, making Bayesian methods well-suited for online or continuous learning.

Key Characteristics of Bayesian Regression

Computational Complexity: More demanding than classical regression, especially for non-linear models.
Distribution of Models: Produces a posterior distribution over parameters rather than a single point estimate.
Priors Matter: Choice of priors can significantly affect results, especially with limited data; generic Gaussian priors often act like regularizers.
Overfitting Protection: By averaging over parameter uncertainty, Bayesian regression can yield more stable estimates, particularly when data is scarce or features are correlated.

Code Snippet

Scikit-learn offers BayesianRidge, which is a Bayesian version of linear regression with a Gaussian prior on coefficients. BayesianRidge also estimates the noise variance and includes automatic tuning of priors via evidence maximization.

The output shows the mean of the coefficient posterior, from which you can derive uncertainty (the coefficients' covariance matrix is sigma_).

from sklearn.linear_model import BayesianRidge

bayes_ridge = BayesianRidge()
bayes_ridge.fit(X_train, y_train)

print("Coefficients (mean of posterior):", bayes_ridge.coef_)
print("Coefficient uncertainties (std of posterior):", np.sqrt(bayes_ridge.sigma_))

Output

Below is an example of what you might see (the exact numbers depend on your data):

Coefficients (mean of posterior): [1.02]
Coefficient uncertainties (std of posterior): [0.15]

Coefficients (mean of posterior): The BayesianRidge estimate of each feature’s effect, averaged over the posterior distribution.
Coefficient uncertainties (std of posterior): The standard deviations showing how confident the model is about each coefficient.

Real-World Applications of Bayesian Regression

Example Scenario	Description
Medical Prediction (with uncertainty)	Provide predictions and confidence intervals.
Econometrics	Combine prior theory with data for parameter distributions.
Engineering - Calibration	Incorporate prior knowledge for model parameters.
Adaptive Modeling	Update posterior with new data for real-time personalization.

Bayesian regression provides a probabilistic framework for regression, yielding richer information than classic point estimation. It ensures that you understand the uncertainty in predictions, which is crucial for many real-world applications where decisions depend on confidence in the results.

Also Read: Bayesian Linear Regression: What is, Function & Real Life Applications in 2024

14. Quantile Regression

Quantile Regression is a type of regression that estimates the conditional quantile (e.g., median or 90th percentile) of the response variable as a function of the predictors, instead of the mean.

Unlike ordinary least squares, which minimizes squared error (and thus focuses on the mean), quantile regression minimizes the sum of absolute errors weighted asymmetrically to target a specific quantile.

For example, Median Regression (0.5 quantile) minimizes absolute deviations (50% of points above, 50% below, akin to least absolute deviations). The 0.9 quantile regression would ensure ~90% of the residuals are negative and 10% positive (focusing on the upper end of distribution).

The quantile loss function for quantile q is: for residual r, loss = q*r if r >= 0, and = (q-1)*r if r < 0.

This creates a tilted absolute loss that penalizes over-predictions vs under-predictions differently to hit the desired quantile. Essentially, quantile regression gives a more complete view of the relationship between X and Y by modeling different points of the distribution of Y.

When to Use?

Distribution Focus: Ideal if you need to predict specific percentiles (e.g., median or 90th percentile) rather than just the mean.
Heteroscedastic or Non-Symmetric Errors: Captures changing variance or skew in the data better than mean-based regression.
Risk Management: Essential in finance and insurance for predicting high-loss quantiles (Value-at-Risk).
Outlier Robustness: Median regression (quantile=0.5) focuses on absolute deviations, making it less sensitive to outliers.

Key Characteristics of Quantile Regression

No Distribution Assumption: Non-parametric approach, making it useful when OLS assumptions don’t hold.
Multiple Quantiles: Each quantile is fitted independently; you can create a “fan of lines” for a more comprehensive view of the response distribution.
Linear Programming: Often solved via linear programming methods; implementations are available in R (quantreg) and Python (statsmodels).
Robust to Outliers: Median regression (quantile=0.5) is less sensitive to extreme values than mean-based OLS.
Quantile-Specific Interpretation: A coefficient reflects how a predictor affects a chosen quantile of the outcome (e.g., the 0.9 quantile for high-end prices).

Code Snippet

Scikit-learn doesn’t have a direct quantile regression solver aside from using QuantileRegressor (introduced in v1.1) or using ensembles with quantile loss. For illustration, we’ve used QuantileRegressor for median.

This will fit a median regression (0.5 quantile).

If you want the 90th percentile, use quantile=0.9. (QuantileRegressor solves via linear programming and can also include an alpha for ridge-like penalty if needed).
Alternatively, one can use an ensemble method, such as GradientBoostingRegressor with loss='quantile' and alpha=0.9, to get predictions for the 90th quantile.

# This snippet assumes sklearn >= 1.1 for QuantileRegressor
from sklearn.linear_model import QuantileRegressor

median_reg = QuantileRegressor(quantile=0.5, alpha=0)  # alpha=0 for no regularization
median_reg.fit(X_train, y_train)
print("Coefficients for median regression:", median_reg.coef_)

Output

A typical console output (assuming a single feature) might be:

Coefficients for median regression: [1.]

If you had multiple features, you’d see something like [1.0 -0.2 0.5 ...]. Exact values depend on your dataset, but the array represents the slope estimates for the specified quantile (in this case, the median).

Real-World Applications of Quantile Regression:

Example Scenario	Description
Housing Market Analysis	Predict 10th/90th percentile house prices.
Weather and Climate	Model extreme rainfall/temperature quantiles.
Traffic and Travel Time	Estimate upper travel time bounds for planning.
Finance – Value at Risk	Directly model high quantile losses for risk.

Quantile regression adds another dimension to understanding predictive relationships by not restricting to the mean outcome. It is a powerful tool when distributions are skewed, have outliers, or when different quantiles exhibit different relationships with predictors.

15. Poisson Regression

Poisson Regression is a type of generalized linear model (GLM) used for modeling count data in situations where the response variable is a count (0, 1, 2, ...) that often follows a Poisson distribution.

It is appropriate when the counts are assumed to occur independently, and the mean of the distribution equals its variance (a property of Poisson, though this can be relaxed later).

Commonly, Poisson regression models the log of the expected count as a linear combination of features:

log(E[Y | X]) = b0 + b1*x1 + ... + bp*xp

Exponentiating both sides:

E[Y | X] = exp(b0 + b1*x1 + ... + bp*xp)

The log link ensures E[Y | X] is always positive.
b0, b1, ..., bp are the coefficients for the predictor variables x1, x2, ..., xp.

The model is typically fitted by maximum likelihood (equivalent to minimizing deviance for GLM). Poisson regression assumes the conditional distribution of Y given X is Poisson, which implies variance = mean for those counts.

Poisson might not fit well if the data show overdispersion (variance > mean). Then, variants like quasi-Poisson or Negative Binomial can be used.

When to Use?

Count Data: Best for modeling event counts (visits, accidents, calls) in a set time or space.
Low or Moderate Counts: Works well if there isn’t excessive overdispersion (variance much larger than mean).
Exposure Offsets: Can handle varying exposure (e.g., per 1,000 people, per hour) through offset terms.
Discrete Outcomes: Ideal if the target variable is non-negative integer counts.

Key Characteristics of Poisson Regression

Log-Scale Modeling: Predictions remain non-negative because the expected count is modeled with an exponential link.
Limitations with Zeros & Overdispersion: Excess zeros or variance require adjustments like zero-inflated or negative binomial models.
Count Regression Family: Poisson is one member of a broader set of models for discrete, non-negative outcomes.

Code Snippet

Python’s statsmodels can fit GLMs including Poisson.

In scikit-learn, you can also use PoissonRegressor for a more machine-learning API approach, which implements Poisson regression via gradient descent.

import statsmodels.api as sm

# Assume X_train is a 2D array of features, y_train are count outcomes
X_train_sm = sm.add_constant(X_train)  # add intercept term
poisson_model = sm.GLM(y_train, X_train_sm, family=sm.families.Poisson())
poisson_results = poisson_model.fit()
print(poisson_results.summary())

Output

Below is an example of the Poisson regression summary you might see (details vary with your data):

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                y_train   No. Observations:                   100
Model:                            GLM   Df Residuals:                        98
Model Family:                 Poisson   Df Model:                             1
Link Function:                    log   Scale:                            1.0000
Method:                          IRLS   Log-Likelihood:                -220.3045
Date:                Thu, 01 Jan 2025   Deviance:                       45.6322
Time:                        00:00:00   Pearson chi2:                     44.581
No. Iterations:                     4   Pseudo R-squ. (CS):             0.2183

                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.1056      0.233     -9.042      0.000      -2.563      -1.648
x1             0.5076      0.093      5.464      0.000       0.325       0.690
==============================================================================

Please Note:

Model Family: Poisson, indicating a count model.
Log-Likelihood: Measure of model fit (higher is generally better).
Coefficients: The effect of each predictor on the log of the outcome’s expected value (the Poisson link function).
z and P>|z|: Tests if each coefficient significantly differs from zero.
[0.025, 0.975]: The 95% confidence interval for each coefficient.

Real-World Applications of Poisson Regression:

Example Scenario	Description
Public Health – Disease Incidence	Model disease counts based on risk factors.
Insurance – Claim Counts	Predict number of claims per policy.
Call Center Volume	Forecast call counts per hour.
Web Analytics	Count visits or clicks in time intervals.

Also Read: Generalized Linear Models (GLM): Applications, Interpretation, and Challenges

16. Cox Regression (Proportional Hazards Model)

Cox Regression, or Cox Proportional Hazards Model, is a regression technique used for survival analysis (time-to-event data). Unlike previous regressions, which predict a numeric value directly, Cox regression models the hazard function – essentially the instantaneous risk of the event occurring at time t, given that it hasn’t occurred before t, as a function of covariates.

It is a semi-parametric model: it doesn’t assume a particular baseline hazard function form, but it assumes the effect of covariates is multiplicative on the hazard and constant over time (hence “proportional hazards”).

Here's the formula:

h(t | X) = h0(t) * exp( b1*x1 + ... + bp*xp )

In the equation above:

h0(t) is the baseline hazard (covariates = 0).
b1, ..., bp are coefficients for the features x1, ..., xp.
exp(b_j) is the hazard ratio for x_j; a value > 1 implies increased hazard, < 1 implies decreased hazard.

Cox regression is often used to estimate these hazard ratios for factors while accounting for censoring (some subjects’ events not observed within study time).

When to Use?

Time-to-Event Data: Best for analyzing how covariates affect the timing of an event (death, machine failure, churn).
Presence of Censoring: Suited for data with incomplete observations (individuals or items still “alive” at study’s end).
Wide Application: Extensively adopted in clinical trials, reliability engineering, and customer analytics where the duration until an event matters.

Key Characteristics of Cox Regression

Indirect Timeline: Doesn’t directly predict event times but enables hazard comparisons and survival curve derivation.
Censored Data: Gracefully accommodates records where the event hasn’t yet occurred by the study’s end.
Proportional Hazards Assumption: Covariate effects are assumed constant over time; extensions or stratification are needed if this assumption breaks.
Partial Likelihood Estimation: Simplifies parameter estimation by canceling out the baseline hazard.
Hazard Ratios: Each coefficient β\betaβ translates to exp⁡(β)\exp(\beta)exp(β); a ratio of 2.0 doubles the instantaneous risk for subjects with that covariate.

Code Snippet

This code illustrates how to model survival data with Cox Regression:

Survival Data: We have a time column (how long before event/censoring) and an event column (1 = event happened, 0 = censored).
CoxPHFitter: Fits a proportional hazards model.
Summary: Shows coefficients, standard errors, and confidence intervals.

from lifelines import CoxPHFitter
import pandas as pd

# Sample dataset with survival times and covariates
data = pd.DataFrame({
    'time': [5, 8, 12, 3, 15],  # Survival time
    'event': [1, 1, 0, 1, 0],   # Event (1) or censored (0)
    'age': [45, 50, 60, 35, 55],
    'treatment': [1, 0, 1, 1, 0]
})

cph = CoxPHFitter()
cph.fit(data, duration_col='time', event_col='event')
cph.print_summary()

Output

When you run this code, you’ll see a table describing model coefficients, including hazard ratios:

coef: Estimated effect of the covariate on the log hazard.
exp(coef): Hazard ratio (values > 1 indicate higher hazard, < 1 indicate lower hazard).
se(coef): Standard error for each coefficient.
Confidence Intervals: Range in which the true coefficient likely lies.
Concordance: A measure of how well the model predicts ordering of events (closer to 1 is better).

coef  exp(coef)  se(coef) ...
age            0.0150    1.0151    0.0260
treatment     -0.1100    0.8958    0.0300
...
Concordance = 0.80

Real-World Applications of Cox Regression

Example Scenario	Description
Clinical Trial Survival	Compare hazard rates of new drug vs. placebo.
Customer Churn	Model time until churn with hazard ratios.
Mechanical Failure	Assess how conditions affect failure time.
Employee Turnover	Evaluate hazard of leaving given covariates.

Cox regression is a powerful tool in regression analysis that focuses on time-to-event outcomes. It bridges statistics and practical decision-making, especially in life sciences and engineering. It provides insight into how factors impact the rate of an event occurrence over time.

17. Time Series Regression

Time Series Regression refers to regression methods specifically applied to time-indexed data where temporal order matters. In a narrow sense, it could mean using time as an explicit variable in a regression.

More broadly, it often involves using lagged values of the target or other time series as features to predict the target at future time steps. Many classical time series models (AR, ARMA, ARIMAX) can be viewed as regression models on past values.

Examples:

Autoregressive (AR) Model: y_t = φ1 * y_(t-1) + ... + φp * y_(t-p) + ε_t. This is regression of a time series on its own past p values.
AR With Exogenous Variables (ARX or ARIMAX): yt = φ(L) * yt + θ(L) * X_t + ε_t, effectively regression on past y’s and current/past external regressors X (like weather affecting sales).
Distributed Lag Models: regression where current y depends on past values of some X as well.
Trend Regression: regress y on time (and perhaps time^2, etc.) to capture trend, possibly also seasonal dummy variables.

Time series regression often requires handling autocorrelation in residuals (which violates standard regression assumptions). Techniques like adding lag terms, using ARIMA errors, or generalized least squares are used.

When to Use?

Sequential or Temporally Ordered Data: Ideal for datasets where the order and spacing of observations hold significant information.
Forecasting: Commonly applied to predict future values (e.g., next month’s sales, next hour’s sensor reading).
Temporal Dynamics: Useful if past values, trends, or seasonality patterns affect current or future outcomes.

Key Characteristics of Time Series Regression

Autocorrelation Handling: Often requires tests (Durbin-Watson) or additional AR terms to manage correlated errors.
Specialized Alternatives: ARIMA, exponential smoothing, or neural nets can outperform simple regression in many cases.
Temporal Features: Commonly includes day-of-week, month, holiday variables, and lagged terms to model seasonality and events.
Forecast-Centric Evaluation: Usually validated on the chronological “tail” of data rather than random splits.
Relation to ARIMAX: Time series regression can be viewed as ARIMAX when including moving-average error terms alongside exogenous variables.

Code Snippet

Use this approach when you want to see if there’s a trend over time in your numerical data.

Data Setup: We create a DataFrame with columns time and sales.
Add Trend: We add a constant term (const) to enable an intercept in our regression.
OLS Fit: We run an ordinary least squares regression of sales on [const, time].

from statsmodels.tsa.tsatools import add_trend
import pandas as pd
from statsmodels.api import OLS

# Sample dataset
data = pd.DataFrame({
    'time': [1, 2, 3, 4, 5],
    'sales': [200, 220, 240, 230, 260]
})

# Add a constant (intercept)
data = add_trend(data, trend='c')

model = OLS(data['sales'], data[['const', 'time']]).fit()
print(model.summary())

Output

The output is an OLS summary table with something like this:

OLS Regression Results
==============================================================================
Dep. Variable:                  sales   R-squared:                       0.76
Model:                            OLS   Adj. R-squared:                  0.68
...
                  coef    std err          t      P>|t|
-------------------------------------------------------------------------------
const        190.0000     15.811     12.025      0.001
time           10.0000      4.000      2.500      0.071
...

The slope might be around 10, indicating an average increase of 10 sales units per time step.

Real-World Applications of Time Series Regression

Example Scenario	Description
Economic Forecasting	Use lagged GDP and indicators.
Energy Load Forecasting	Predict next-day demand from weather and past usage.
Website Traffic	Forecast daily visits with seasonal patterns.
Stock Prices	Regress on past prices and macro data.

18. Panel Data Regression (Fixed & Random Effects)

Panel Data Regression is used when you have panel data (also called longitudinal data) – that is, multiple entities observed across time (or other contexts).

For example, test scores of multiple schools measured yearly or economic data of multiple countries over decades. Panel data regression models aim to account for both cross-sectional and time series variation, often focusing on controlling for unobserved heterogeneity across entities.

Two common approaches: Fixed Effects (FE) and Random Effects (RE) models

Fixed Effects: This model introduces entity-specific intercepts (or coefficients) that capture each entity's time-invariant characteristics. Essentially, each individual (or group) has its own baseline.

FE uses dummy variables for entities (or equivalently, de-mean data within each entity) to control for any constant omitted factors for that entity. It focuses on within-entity variation (how changes over time in X relate to changes in Y for the same entity)

Random Effects: This treats the individual-specific effect as a random variable drawn from a distribution. It assumes that this random effect is not correlated with the independent variables (which is a strong assumption).

The benefit is it can include time-invariant covariates (whereas FE cannot since dummies absorb those), and generally, RE is more efficient if its assumptions hold.

When to Use?

Two-Dimensional Data: Suited for datasets with multiple entities observed across multiple time points.
Unobserved Heterogeneity: Controls for entity-specific traits (e.g., country- or firm-specific characteristics), yielding unbiased estimates.
Policy & Impact Analysis: Common in economics or social sciences to isolate an intervention’s effect from intrinsic differences among entities.

Key Characteristics of Panel Data Regression

Fixed vs. Random Effects: Fixed Effects treat each entity as its own control; Random Effects pool information across entities but risk bias if the RE assumption is invalid (Hausman test can help decide).
Two-Way Fixed Effects: Allows controlling for both entity-specific and time-specific factors (e.g., global trends).
Improved Causal Interpretation: Controls for unobserved, constant differences among entities (e.g., city-specific pollution levels).
Within-Entity Variation: Fixed Effects require changes over time within each entity; if a predictor never varies, its effect can’t be estimated under FE. In such cases, Random Effects may be more suitable.

Code Snippet

Use this approach for data that tracks multiple entities over time, allowing you to account for differences between entities.

Data Setup: Each observation has an id, a year, and variables y (outcome) and x (predictor).
Indexing: We set id and year as a multi-index, so each row corresponds to a specific entity-year pair.
PanelOLS: Using entity_effects=True incorporates fixed effects for each entity.

import statsmodels.api as sm
from linearmodels.panel import PanelOLS
import pandas as pd

# Sample panel dataset
panel_data = pd.DataFrame({
    'id': [1, 1, 2, 2, 3, 3],
    'year': [2020, 2021, 2020, 2021, 2020, 2021],
    'y': [3, 4, 2, 5, 1, 3],
    'x': [10, 12, 8, 9, 7, 6]
}).set_index(['id', 'year'])

model = PanelOLS(panel_data['y'], sm.add_constant(panel_data['x']), entity_effects=True)
results = model.fit()
print(results.summary)

Output

When you run this, the summary helps you see how x relates to y once you control for entity-specific intercepts:

Coefficients: The effect of each predictor (e.g., x) on the outcome.
Entity Effects: Each entity has its own baseline.
Within R-squared: How well the model explains variation within the same entity across time.
Between R-squared: Variation explained across entities, averaged over time.

                          PanelOLS Estimation Summary
================================================================================
Dep. Variable:                      y   R-squared:                        0.50
...
Coefficients  Std. Err.  T-stat  P-value  ...
x                0.5000     0.2500   2.0000   0.1300
...

Real-World Applications of Panel Data Regression

Example Scenario	Description
Economics – Policy Impact	Control for state-specific and time effects.
Education	Within-student changes to test scores over time.
Marketing – Panel Surveys	Account for consumer-specific baselines.
Manufacturing	Different machines tracked over time, controlling for machine-specific traits.

What Are the Benefits of Regression Analysis?

Regression analysis in machine learning offers a range of benefits, making it an indispensable tool in data-driven decision-making and predictive modHere’s how regression analysis in machine learning adds value:

Here’s how it adds value:

Quantifying Relationships: Measures how independent variables impact a dependent variable
Prediction and Forecasting: Enables accurate predictions for continuous outcomes
Identifying Significant Variables: Highlights the most influential predictors among multiple variables.
Model Evaluation: Provides tools like R-squared and error metrics to evaluate model performance.
Control and Optimization: Optimizes processes by understanding variable interactions.
Risk Management: Assesses potential risks by analyzing variable relationships and their uncertainty.

Decision Support: Guides strategic choices with data-backed insights for better resource allocation and planning.

Regression vs Classification

While both regression and classification fall under supervised learning, they serve different purposes. Regression is used when the output is a continuous value—like predicting house prices or temperatures. On the other hand, classification helps predict discrete labels, such as whether an email is spam or not, or if a customer will churn.

Still confused about which one to use?

For a more detailed comparison, check out our dedicated blog: Regression vs Classification

How Can upGrad Help You?

upGrad’s data science and machine learning courses equip you with the skills to master regression analysis through the following mediums:

Comprehensive Programs: In-depth modules on types of regression models in machine learning and their applications.
Hands-On Learning: Real-world datasets, projects, and tools like Python and R.
Career Support: Resume building, interview prep, and job placement assistance.

Here are some more of our popular AI and ML courses that you can try:

Conclusion

Regression is more than just a statistical method—it’s the backbone of many real-world machine learning applications. Whether you're decoding consumer behavior, forecasting sales, or optimizing business strategies, understanding the types of regression in machine learning equips you with the right tools to tackle complex problems. The diverse uses of regression analysis make it a must-know skill for anyone looking to turn data into actionable insights. So dive deep, experiment, and let the data tell its story!

Ready to advance your career? Get personalized counseling from upGrad’s experts to help you choose the right program for your goals. You can also visit your nearest upGrad Career Center to kickstart your future!

Related Blogs:

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Best Machine Learning and AI Courses Online

Master of Science in Machine Learning & AI from LJMU	Executive Post Graduate Programme in Machine Learning & AI from IIITB	Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland
Advanced Certificate Programme in Machine Learning & NLP from IIITB	Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB	View all Machine Learning Courses

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

In-demand Machine Learning Skills

Artificial Intelligence Courses	Tableau Courses
NLP Courses	Deep Learning Courses

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm? Simple & Easy
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is Information Technology?
Permutation vs Combination: Difference between Permutation and Combination	Learning Artificial Intelligence & Machine Learning - How to Start	Machine Learning with R: Everything You Need to Know
NLP Free Course	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

Reference:
https://www.fortunebusinessinsights.com/machine-learning-market-102226

Frequently Asked Questions (FAQs)

1. What are the three types of multiple regression?

2. What is the difference between linear regression and logistic regression?

3. What are the different types of stepwise regression?

4. Which is better, Lasso or Ridge?

5. What is multicollinearity in regression?

6. What are the 2 main types of regression?

7. What is multivariate regression in machine learning?

8. What are the different types of multivariate regression?

9. What is ordinal regression in ML?

10. When to use logistic regression?

11. What is autocorrelation in regression?

Pavan Vadapalli

900 articles published

Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology s...

Get Free Consultation

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources