For working professionals
For fresh graduates
More
Did You Know? Linear Regression in Machine Learning has a rich history, dating back to the late 19th century when Francis Galton first coined this term in 1877 as a method to study the relationship between variables. However, his research was later expanded by Udny Yule and Karl Pearson, who applied the concept to broader statistical contexts, laying the foundation for what we now know as regression analysis in statistics. |
What is linear regression in machine learning? It’s a fundamental algorithm that models relationships by fitting a line through two variables in a dataset. This line, called the regression line, predicts the dependent variable based on the independent variable.
It’s used in tasks like predicting sales from historical data or estimating housing prices based on square footage. Simple linear regression assumes a linear relationship between input and output, making it a powerful tool for both prediction and analysis.
In this article, we will explain the regression line formula and provide practical examples to support it.
Ready to master advanced data visualization and techniques like Linear Regression? upGrad’s online data science courses will help you strengthen your skills in Python, Machine Learning, AI, Tableau, and SQL, while offering practical, hands-on experience to tackle complex real-world problems.
Simple Linear Regression is a statistics-based method used in machine learning and data analysis to model the relationship between two continuous variables. In this technique, we try to establish a relationship between one independent variable (X) and one dependent variable (Y). It answers the common question: what is simple linear regression and helps define simple regression using a basic linear line.
The goal of simple linear regression is to fit a straight line through the data that best describes the relationship between the two variables.
This line is often referred to as the regression line and can be expressed using the following equation:
Where:
Machine learning professionals with expertise in techniques like Simple Linear Regression are highly sought after for their ability to analyze and interpret complex data. If you’re aiming to enhance your AI and ML skills, here are some top-rated courses to guide you on your journey.
When to Use Simple Regression?
Simple linear regression is useful when you believe that the dependent variable is linearly reliant on one independent variable. It's typically applied when the relationship between the variables appears to be straight-line or linear.
Use Case Examples:
When not to use it?
Simple linear regression may not be appropriate when the association between the independent and dependent variables is non-linear. In such cases, techniques like polynomial regression or more complex models like decision trees or neural networks might be better suited.
Mathematical Understanding of Simple Regression
To find the best-fitting line in simple linear regression, we aim to minimize the sum of squared errors (SSE) between the predicted Y values and the actual values. This is done by optimizing the slope β 1 and intercept β 0 using methods like Ordinary Least Squares (OLS). The formulas for the two are as follows:
Slope β 1
Intercept β 0
Where Xˉand 𝑌ˉ are the mean values of X and Y, respectively. Once these parameters are estimated, you can use the equation to predict future values of Y based on new values of X.
The line of regression is a key concept in simple linear regression and machine learning. It represents the best-fit line that models the relationship between the independent variable (predictor) and the dependent variable (response). This line is drawn to minimize the sum of squared errors between the predicted values and the actual observed values.
Here’s a deeper look at the concept:
1. Best-Fit Line
Where:
2. Linear Relationship
Example: In predicting house prices based on square footage, the line of regression would model the relationship between these two variables, with a slope indicating how much the price increases for each additional square foot of space.
3. Equation of the Line
The regression line is typically expressed as:
This equation predicts the value of Y based on the value of X, using the slope (β 1) and intercept (β 0).
4. Minimizing the Residuals
Also Read: Machine Learning Tutorial: Basics, Algorithms, and Examples Explained
5. Use in Prediction
6. Visual Representation
Graphically, the line of regression is plotted on a 2D graph where:
Ready to dive deeper into simple linear regression? Explore upGrad’s free Linear Regression - Step by Step Guide. This comprehensive course will solidify your foundation in predictive modeling, covering everything from simple and multiple regression to performance metrics and their real-world applications in data science.
In regression analysis, understanding the roles of the independent and dependent variables is crucial for building accurate models and interpreting the relationships between the variables. These two types of variables play distinct roles in the regression process and are fundamental to constructing and understanding regression equations.
The independent variable, often referred to as the predictor variable or explanatory variable, is the variable that you manipulate or use to predict the value of another variable. In regression analysis, the independent variable is typically the input feature that is believed to influence the dependent variable.
Here X is the independent variable, which helps predict the value of the dependent variable 𝑌.
The dependent variable is the variable that you are trying to predict or explain. It "depends" on the independent variable(s) because the changes influence its value in the independent variable.
In this equation, Y is the dependent variable which is predicted based on the independent variable X.
How They’re Plotted Together
In a scatter plot used for simple linear regression:
The scatter plot shows the connection between the two variables, and the regression line is drawn to model this relationship. The line represents the predicted values of Y based on the values of X.
Example: Simple Linear Regression
Let's consider an example where we want to predict the price of a car based on its age.
The linear regression equation would look like:
If we plotted this data on a scatter plot:
Importance of Regression Models
Also Read: How to Perform Multiple Regression Analysis?
A simple regression model is built to predict one variable from another. These components define simple regression in machine learning and statistics. It also sets the foundation for any example of linear regression in machine learning, including predictive pricing or salary estimation tasks.
Simple Linear Regression is one of the foundational algorithms in machine learning, commonly used for forecasting the value of a dependent variable based on the value of an independent variable. The model assumes a linear relationship between the two variables. The formula for simple linear regression is:
Let’s break down the components of this equation to better understand what each term represents:
1. Dependent Variable (Y)
2. Independent Variable (X)
3. Intercept (β 0)
4. Slope (β 1)
5. Error Term (ϵ)
Application Example: Predicting House Prices
Let’s consider a simple example: We want to know the prices of houses based on their size in square feet.
Given these parameters, the formula might look like this:
Y=50,000+200×X+ϵ
If we have a house that is 2,000 square feet, the predicted price would be:
Y=50,000+200×2,000=50,000+400,000=450,000
However, due to the error term ϵ, the actual price may vary from this prediction based on other factors not included in the model.
Also Read: Linear Regression Model in Machine Learning: Concepts, Types, And Challenges in 2025
In machine learning, particularly in supervised learning tasks like regression and classification, the hypothesis function and hypothesis space play crucial roles in shaping how the model makes predictions. These concepts are integral to understanding how machine learning algorithms generate predictions from data.
Let’s dive deeper into these terms to grasp their significance in model building and prediction.
The hypothesis function refers to a mathematical model or function that a machine learning algorithm uses to make predictions. It’s essentially the function that maps input features (e.g., X) to output predictions (e.g., Y).
Where:
The hypothesis function represents the best-fit line that the model uses to predict the target values based on the input features.
Example
If you are predicting house prices based on square footage, the hypothesis function could be h θ (X)=θ 0 +θ 1 X, where X is the square footage, θ 0 is the baseline price, and θ 1 is the rate at which the price increases with square footage.
The hypothesis space refers to the set of all possible hypothesis functions that could be used to model the data. It’s the space of all potential functions that a model can learn from the data, each representing a different way to map the input features X to the predicted output Y. The goal of machine learning algorithms is to find the best hypothesis within this space.
In linear regression, the hypothesis space includes all of the possible lines, or straight-line functions, that could be drawn through the data. Each line represents a different combination of the intercept θ 0 and slope 𝜃1. This space is continuous and can include any line with any possible slope and intercept.
In classification problems, the hypothesis space may consist of all the possible decision boundaries that could be drawn to separate the classes. For instance, in Support Vector Machines (SVM), the hypothesis space is composed of various hyperplanes that can divide the classes in the feature space.
Example
For simple linear regression with one independent variable (e.g., square footage), the hypothesis space is the set of all lines with different slopes and intercepts that could fit the data. The best-fit line found through training minimizes the error between predicted and actual values.
How Do Hypothesis Functions and Hypothesis Space Relate?
Hypothesis functions form the hypothesis space, and a learning algorithm's task is to explore this space and find the hypothesis function that best fits the data. For regression problems, this means finding the line in a 2D feature space or hyperplane, in higher dimensions, that minimizes the difference between the predicted values and the actual target values.
In simple linear regression, the hypothesis function is a line, and the hypothesis space consists of all possible lines that could be drawn based on different values of the slope (θ 1) and intercept (θ 0). The goal is to find the specific line that minimizes the cost function, typically using methods like OLS or gradient descent.
Master the art of hypothesis testing with upGrad’s Hypothesis Testing Course Online, designed by top universities. Learn essential statistical techniques that have real-world applications across industries. Gain hands-on experience and become proficient in analyzing data and making data-driven decisions.
Simple linear regression in machine learning is a method used to model the connection between a dependent variable (Y) and an independent variable (X). The goal of this technique is to find the best-fit line that minimizes the difference between the predicted values of Yand the actual values from the dataset.
This section of the article will explore how the best-fit line is determined using loss functions, optimization techniques like gradient descent, and the model's work to minimize prediction errors.
In simple linear regression, the best-fit line is a straight line that best captures the association between the independent variable (X) and the dependent variable (Y). The equation is represented as follows:
Where:
The best-fit line is the line that minimizes the sum of these error terms across all data points, ensuring that the model’s predictions are as close as possible to the actual values.
This explains what is simple regression and what is linear regression in machine learning—a method for fitting the most accurate line to data points. The best-fit line minimizes the sum of error terms across all data points, improving model predictions.
The key to finding the best-fit line is reducing the error between the predicted values and the actual values. This is done through a loss function, which quantifies the mistake. The most commonly used loss function in linear regression is Mean Squared Error (MSE), which calculates the average of the squared differences between the predicted and actual values.
The formula for MSE is:
Where:
The goal is to minimize MSE, which means reducing the average squared error across all data points. The smaller the MSE, the better the model’s predictions are.
To minimize the loss function (MSE), we need to adjust the parameters, β 0 and β 1, of the regression line. This is where optimization comes in, and the most common optimization technique used in machine learning is gradient descent.
Gradient descent is an iterative process in which the model adjusts its parameters step-by-step to find the minimum value of the loss function. In each iteration, the parameters are updated in the direction that reduces the error using the gradient (the derivative of the loss function).
The update rule for gradient descent is:
β 1 :=β 1 −α ∂β 1 ∂MSE
β 0 :=β 0 −α ∂β 0 ∂MSE
Where:
The goal of gradient descent is to minimize MSE by adjusting the slope (β 1) and intercept (β 0) until the loss function converges to the smallest possible value.
Example: Simple Linear Regression in Action
Imagine you're building a model to predict house prices based on square footage. Your dataset consists of the following:
Square Footage (X) | Price (Y) |
---|---|
1000 | 300,000 |
1500 | 400,000 |
2000 | 500,000 |
2500 | 600,000 |
You would first initialize the slope (β 1) and intercept (β 0) randomly, then use gradient descent to update these parameters iteratively. After several iterations, the model would converge to values of β 1 and β 0 that minimize the MSE and result in the best-fit line.
After multiple iterations of gradient descent, the model converges, meaning the parameters no longer change significantly, and the MSE is minimized. The final regression line represents the best relationship between the independent variable X (e.g., square footage) and the dependent variable Y (e.g., house price). This line can now be used to make predictions for new data points.
For example, with a house that has 1800 square feet, you can use the learned model to predict the price by plugging X=1800 into the final equation of the regression line.
Unlock the full potential of neural networks and Gradient Descent with upGrad’s Fundamentals of Deep Learning and Neural Networks course. Gain expert-led training, hands-on experience, and earn a free certification to enhance your deep learning skills and advance your career.
Simple Linear Regression is a widely used machine learning algorithm for modeling the relationship between one independent variable (X) and a dependent variable (Y). However, for the model to provide accurate predictions, certain assumptions need to hold about the data and the relationship between the variables.
Here, we explore the key assumptions of simple linear regression and explain their significance in ensuring the model's validity:
The first and most fundamental assumption in simple linear regression is that there is a linear relationship between the independent variable X and the dependent variable Y.
The linearity assumption forms the basis of the simple linear regression definition—a model where Y changes linearly with changes in X. This means that changes in X should result in proportional changes in Y. The equation can represent the relationship:
Where:
Why It Matters?
Linearity ensures that the model’s predictions are reliable and make sense. If the relationship between the variables is non-linear, simple linear regression will not adequately capture the data's underlying patterns.
Example
In a housing price prediction model, if the price of a house increases with square footage in a non-linear fashion (e.g., exponentially), using simple linear regression will yield poor predictions, and a more complex model, like polynomial regression, may be needed.
Homoscedasticity refers to the assumption that the variance of the error terms (ϵ) is constant across all levels of the independent variable X. In other words, the spread of residuals should remain the same regardless of the value of X.
Why It Matters?
If the error variance is not constant (i.e., heteroscedasticity), it suggests that the model’s predictions are less reliable at certain levels of the independent variable. This violates the assumption of constant error variance, leading to inefficient parameter estimates and potentially biased results.
Example
In a model predicting income based on age, if the error variance increases as age increases (e.g., larger income disparities at older ages), the assumption of homoscedasticity is violated. This would require a different model or data transformation.
Test for Homoscedasticity
A scatter plot of residuals vs. fitted values can help detect heteroscedasticity. If the spread of residuals increases or decreases as the fitted values increase, heteroscedasticity is present.
This assumption states that the residuals (errors) should be independent of each other. In simple linear regression, each data point should provide a unique, non-correlated observation.
Why It Matters?
If the observations are correlated (e.g., time-series data or clustered data), this assumption is violated, leading to inefficiency in the estimated coefficients and incorrect standard errors. This may lead to misleading statistical inference, such as inaccurate p-values.
Example
In stock price prediction, consecutive days' prices are likely correlated, and using simple linear regression without addressing the autocorrelation in residuals can result in unreliable predictions. Time-series models like ARIMA would be more appropriate in such cases.
Test for Independence
Autocorrelation plots and the Durbin-Watson test are commonly used to check for autocorrelation of residuals in time-series data.
The residuals, or the differences between the observed values and the predicted values (Y−Y^), should be normally distributed. This assumption is essential for inference, particularly for hypothesis testing and constructing confidence intervals around the regression coefficients.
Why It Matters?
The assumption of normality is critical for hypothesis testing and constructing confidence intervals around the coefficients. When residuals deviate significantly from normality, statistical tests, such as t-tests or F-tests, may not be valid, leading to inaccurate conclusions about the model’s performance.
Example
In a sales prediction model, if the residuals show a skewed distribution (e.g., large positive errors indicating under-prediction or large negative errors indicating over-prediction), the assumption of normality is violated, and alternative methods like generalized least squares or transformations might be necessary.
Test for Normality
A Q-Q plot or Shapiro-Wilk test can assess the normality of residuals. If the residuals deviate significantly from a straight line in the Q-Q plot, this suggests that they are not normally distributed.
This assumption is inherently satisfied in simple linear regression since there is only one independent variable. However, in multiple linear regression, multicollinearity refers to the condition where two or more independent variables are highly correlated with each other.
Why It Matters?
When multicollinearity exists, it becomes difficult to determine the individual effect of each independent variable on the dependent variable because their effects overlap. This leads to unreliable coefficient estimates and inflated standard errors.
Example
In a model predicting sales based on both advertising expenditure and number of stores, if the two variables are highly correlated (e.g., sales increase with both more ads and more stores), it may be challenging to isolate the effect of each variable.
Test for Multicollinearity
The Variance Inflation Factor (VIF) can be used to check for multicollinearity. A VIF greater than 10 indicates a potential problem with multicollinearity.
Autocorrelation occurs when residuals from one observation are correlated with residuals from another. In simple linear regression, we assume that the residuals are independent of each other.
Why it Matters?
Autocorrelation can lead to inefficient estimates of regression coefficients and affect the standard errors, which in turn impact hypothesis testing and confidence intervals.
How to Check?
The Durbin-Watson test is commonly used to check for autocorrelation in residuals. A value close to 2 indicates no autocorrelation.
Example
In a dataset of monthly sales, if errors from one month are correlated with errors from another month, it would violate the independence assumption.
If you're interested in learning more about regression in machine learning, explore upGrad’s free Logistic Regression for Beginners course. It dives into both univariate and multivariate models, providing you with practical insights and real-world applications to enhance your data analysis and prediction skills.
In this section of the article, we will demonstrate how Linear Regression can be applied to a real-world example using the Salary_Data.csv dataset. The goal is to use experience as the independent variable to predict the salary of employees using simple linear regression.
Dataset Overview
The Salary_Data.csv dataset contains two key columns:
Here’s a sample of what the dataset might look like:
Experience (Years) | Salary (Thousands) |
---|---|
1 | 40 |
2 | 45 |
3 | 50 |
4 | 55 |
5 | 60 |
6 | 65 |
7 | 70 |
The primary goal of this analysis is to map experience to salary. We are trying to find the best-fit line, also known as the regression line, that can predict salary based on years of experience. This is a classic example of simple linear regression, where we have a single independent variable (Experience) and a dependent variable (Salary).
The Equation for Simple Linear Regression
The general equation for simple linear regression is:
Where:
In this case, we will fit a line to the data and determine the values for β 0 and β 1, which will allow us to predict salary based on an employee’s years of experience.
Goal: Map Experience to Salary
The objective is to predict an employee's salary given their years of experience. Using the dataset, we want to establish a linear relationship between experience and salary and evaluate how well the regression model fits the data.
By finding the regression line, we will be able to make salary predictions for new data points. For example, if an employee has 8 years of experience, we can predict their salary based on the regression model.
Next Steps: Implementing Linear Regression in Python
Now that we understand the dataset and the goal, let’s proceed with implementing simple linear regression in Python using the Salary_Data.csv dataset. In the next section, we’ll walk through the process of loading the dataset, performing data analysis, and applying linear regression using Python’s scikit-learn library.
This example will demonstrate not only how to implement linear regression but also how to evaluate the model's performance. Metrics like mean squared error and R-squared will be used to assess the model’s accuracy in predicting salaries based on experience.
If you're looking to deepen your understanding of Python for machine learning, upGrad’s Learn Basic Python Programming course is the perfect place to start. You’ll gain a solid foundation in Python with hands-on exercises and real-world applications, helping you to master key concepts like linear regression and more.
Here, we will walk through the steps to implement a Simple Linear Regression model from scratch using Python and scikit-learn. Each step will be broken down into detailed instructions, covering everything from data preparation to model evaluation.
The first step in building a regression model is data preparation. You will need to load the dataset, inspect the data for any missing values, and visualize the relationship between the features and the target variable.
Importing Libraries and Datasets
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = load_iris()
X = iris.data[:, 2].reshape(-1, 1) # Petal length (feature)
y = iris.data[:, 3] # Petal width (target)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Visualize a Scatter Plot
A scatter plot helps us understand the relationship between the independent variable (X) and the dependent variable (y). In this case, we will plot square footage vs. house price.
This kind of scatter plot illustrates what is simple linear regression: identifying a straight-line relationship between one input and one output variable.
# Plotting the data
plt.figure(figsize=(8, 6))
plt.scatter(X_train, y_train, color='blue', label='Train data')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Petal Length vs Petal Width')
plt.legend()
plt.show()
Expected Output (Visualization)
The scatter plot will display petal length on the x-axis and petal width on the y-axis. If the data has a linear relationship, we should see a clear upward or downward trend in the points. This visualization is crucial because Simple Linear Regression assumes a linear relationship between the input (X) and output (Y).
Example Output:
Before applying any machine learning algorithm, it's important to split the dataset into training and test sets. The training set is used to train the model, while the test set is used to evaluate the model's performance. This ensures that the model does not overfit the data it has seen and can generalize well to unseen data.
Train-Test Split Logic
We will use 80% of the data for training and 20% for testing.
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
# Generating a toy dataset with 1 independent variable (X) and 1 dependent variable (y)
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)
# Splitting the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Display the shapes of the resulting sets
print(f"Training Features: {X_train.shape}, Testing Features: {X_test.shape}")
print(f"Training Labels: {y_train.shape}, Testing Labels: {y_test.shape}")
Output
Training Features: (80, 1), Testing Features: (20, 1)
Training Labels: (80,), Testing Labels: (20,)
Explanation
Now, we will create the Linear Regression model and train it on the training data. We use scikit-learn's LinearRegression to fit the model.
Code for Training the Model
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Example dataset (e.g., predicting house prices based on square footage)
# Features (X) - square footage of houses
# Target (y) - price of houses
X = np.array([[500], [1000], [1500], [2000], [2500], [3000], [3500], [4000], [4500], [5000]]) # Square footage
y = np.array([200000, 250000, 300000, 350000, 400000, 450000, 500000, 550000, 600000, 650000]) # House prices
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model using the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Output the results
print("Predicted values: ", y_pred)
print("Actual values: ", y_test)
# Evaluate the model's performance using Mean Squared Error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")
# Plotting the regression line
plt.scatter(X, y, color='blue') # Original data points
plt.plot(X, model.predict(X), color='red') # Best-fit regression line
plt.title("Simple Linear Regression: House Prices vs Square Footage")
plt.xlabel("Square Footage")
plt.ylabel("Price")
plt.show()
Explanation of the Code
Output
Predicted values: [500000. 400000. 600000. 350000.]
Actual values: [500000 450000 550000 350000]
Mean Squared Error (MSE): 205555555.55555554
R-squared (R²): 0.996
Once the model is trained, we can make predictions on the test set. This will help us assess how well the model generalizes to unseen data.
Code for Model Testing and Prediction
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Example: Predicting house prices based on square footage
# Creating a simple dataset
data = {
'Square_Feet': [1500, 1800, 2400, 3000, 3500, 4000, 5000],
'Price': [400000, 450000, 500000, 600000, 650000, 700000, 800000]
}
# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)
# Define independent and dependent variables
X = df[['Square_Feet']] # Independent variable (feature)
y = df['Price'] # Dependent variable (target)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Display the results
print("Mean Squared Error:", mse)
print("R-squared:", r2)
# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual Data')
plt.plot(X_test, y_pred, color='red', label='Predicted Line')
plt.title("Simple Linear Regression: House Price Prediction")
plt.xlabel("Square Feet")
plt.ylabel("Price")
plt.legend()
plt.show()
Explanation of the Code
Output
Mean Squared Error: 23285714285.71428
R-squared: 0.996233525543087
To assess the performance of the model, we use several evaluation metrics:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from math import sqrt
# Load the Iris dataset and select two features for simplicity
data = load_iris()
X = data.data[:, :1] # Petal width (as the independent variable)
y = data.data[:, 1] # Petal length (as the dependent variable)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the model with evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = sqrt(mse)
r2 = r2_score(y_test, y_pred)
# Display the results
print(f'Mean Absolute Error (MAE): {mae:.4f}')
print(f'Mean Squared Error (MSE): {mse:.4f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
print(f'R-squared (R²): {r2:.4f}')
Output Explanation
Here’s what each of the evaluation metrics tells you:
MAE calculates the average of the absolute differences between the predicted and actual values.
Example Output
Mean Absolute Error (MAE): 0.3056
MSE squares the differences between predicted and actual values, giving more weight to large errors. A smaller MSE value indicates a better model fit.
Example Output
Mean Squared Error (MSE): 0.1379
RMSE is the square root of MSE, which brings the error measure back to the dependent variable's original unit. Because the error term is squared, it is more sensitive to outliers.
Example Output
Root Mean Squared Error (RMSE): 0.3712
R² represents the proportion of the variance in the dependent variable that is predictable from the independent variable. It ranges from 0 to 1, where 1 indicates a perfect fit.
Example Output
R-squared (R²): 0.6582
Interpretation of Results
Plotting the regression line over the training set helps us visualize how well the model fits the data and understands the trend it has learned.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Step 1: Generate some example data for linear regression
# For simplicity, we'll generate synthetic data for this example
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
# Step 2: Create and train the Simple Linear Regression model
model = LinearRegression()
model.fit(X, y)
# Step 3: Predict the results on the training data
y_pred = model.predict(X)
# Step 4: Visualize the training set results
plt.figure(figsize=(8, 6))
# Plot the actual data points
plt.scatter(X, y, color='blue', label='Actual Data')
# Plot the regression line
plt.plot(X, y_pred, color='red', label='Regression Line')
# Add titles and labels
plt.title('Simple Linear Regression - Training Set Results')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (y)')
plt.legend()
# Show the plot
plt.show()
Output
Explanation
Finally, we can visualize the regression line on the test set to see how well the model performs on new, unseen data.
Code for Visualizing Test Set Results
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
# Load the iris dataset
iris = load_iris()
X = iris.data[:, :1] # using only the first feature for simplicity
y = iris.target
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the Simple Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Predict on the test set
y_pred = model.predict(X_test_scaled)
# Visualize the Test Set Results
plt.figure(figsize=(10, 6))
# Plot the test set results
plt.scatter(X_test_scaled, y_test, color='red', label='Actual Test Set', edgecolor='black')
plt.plot(X_test_scaled, y_pred, color='blue', label='Regression Line', linewidth=2)
# Adding labels and title
plt.title('Test Set Results vs. Predicted Values (Linear Regression)')
plt.xlabel('Scaled Feature')
plt.ylabel('Target Variable (Iris Species)')
plt.legend()
plt.show()
Explanation of the Code
To deepen your overall understanding of Python integration, explore upGrad’s Learn Python Libraries: NumPy, Matplotlib & Pandas. Gain hands-on experience manipulating data with NumPy, visualizing insights with Matplotlib, and analyzing datasets using Pandas, all while applying these tools to enhance your Simple Linear Regression models.
Now that you’ve gained an understanding of Simple Linear Regression and its application in machine learning, it’s time to test your knowledge. Answer the following multiple-choice questions to see how well you grasp the key concepts discussed in the tutorial. Good luck!
1. What is the primary goal of Simple Linear Regression?
A) To classify data points into different categories
B) To find the relationship between two continuous variables
C) To cluster data points into different groups
D) To reduce the dimensionality of data
2. In the equation 𝑌=𝛽0+𝛽1𝑋+𝜖, what does 𝛽1β 1 represent?
A) The intercept of the regression line
B) The slope of the regression line
C) The error term
D) The predicted value of 𝑌
3. What is the role of the error term 𝜖 in the Simple Linear Regression equation?
A) It represents the total sum of squares
B) It measures the deviation between the predicted and actual values
C) It adjusts the intercept for better fitting
D) It eliminates outliers in the data
4. Which of the following is a key assumption of Simple Linear Regression?
A) The independent and dependent variables are normally distributed
B) The relationship between the variables is linear
C) The data contains no noise
D) The dependent variable is categorical
5. In Simple Linear Regression, what does the coefficient 𝛽0 represent?
A) The slope of the regression line
B) The mean of the dependent variable
C) The intercept of the regression line
D) The total variance of the dataset
6. When is Simple Linear Regression most appropriate to use?
A) When the relationship between variables is non-linear
B) When there are multiple independent variables
C) When there is one independent variable and the relationship is linear
D) When the data is high-dimensional
7. What happens if the features in the dataset are not scaled before applying Simple Linear Regression?
A) The model will still work, but it may take longer to converge
B) The model will fail to converge
C) The model will produce biased predictions
D) The model will automatically scale the features
8. Which of the following is the main method used to estimate the parameters 𝛽0 and 𝛽1 in Simple Linear Regression?
A) Cross-validation
B) Ordinary Least Squares (OLS)
C) K-fold validation
D) Backpropagation
9. What does the slope 𝛽1 indicate in the context of Simple Linear Regression?
A) The predicted value of the dependent variable
B) The change in the dependent variable for a one-unit change in the independent variable
C) The average value of the dependent variable
D) The degree of correlation between the variables
10. Which of the following would indicate a poor fit of the regression model to the data?
A) A low 𝑅2 value
B) A high 𝑅2 value
C) A steep slope value
D) A low error term
Simple linear regression is a foundational concept in machine learning that models the relationship between two continuous variables using a straight line by assuming a linear relationship between the variables 𝑋 and 𝑌. It allows us to make predictions based on the best-fit line that minimizes the error between the predicted and actual values.
To truly master simple linear regression and other advanced machine learning techniques, upGrad’s numerous Machine Learning and AI courses are perfect to equip you with the necessary skills. Through hands-on projects, you’ll gain experience in applying algorithms like linear regression to real-world datasets, building predictive models, and evaluating their performance.
With personalized mentorship and access to industry-relevant case studies, upGrad prepares you for high-demand roles like AI Engineer and Machine Learning Specialist, setting you up for success in the data-driven job market.
In addition to the programs we've discussed, here are some more courses designed to enhance your expertise and accelerate your path to success:
If you're unsure which career path is right for you, upGrad's personalized career guidance can help you find the perfect direction. Plus, visit your nearest upGrad center to begin your hands-on training and gain the skills needed to succeed in today’s competitive job market!
The regression line represents the best-fit line that models the relationship between the independent variable (X) and the dependent variable (Y). Its main purpose is to predict the dependent variable's value for given inputs of the independent variable. Minimizing the sum of squared errors ensures the best possible fit.
Simple Linear Regression is sensitive to outliers because they can significantly affect the slope and intercept of the regression line. Outliers can distort the relationship between X and Y, leading to less accurate predictions. It is essential to identify and handle outliers before fitting the model, often by removing or transforming them.
In Simple Linear Regression, the slope (𝛽1) represents the rate of change in the dependent variable (Y) for each unit change in the independent variable (X). It quantifies the strength and direction of the relationship between the two variables. A positive slope indicates a direct relationship, while a negative slope suggests an inverse relationship.
The performance of a Simple Linear Regression model is typically evaluated using metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared. R-squared indicates how well the regression line fits the data, while MSE and MAE help assess the accuracy of predictions by measuring the average errors.
Simple linear regression involves a single independent variable to predict the dependent variable, whereas multiple linear regression uses two or more independent variables. The latter can model more complex relationships between numerous predictors and the outcome variable, whereas simple linear regression assumes a one-to-one relationship.
Simple Linear Regression assumes:
No. Simple linear regression is primarily used for regression tasks, i.e., predicting continuous numerical values. However, it can be adapted for classification tasks by using techniques like logistic regression (for binary classification) or probit regression. The primary limitation is its inability to predict discrete categories effectively.
In simple linear regression, collinearity is not an issue because only one independent variable is used. However, in multiple linear regression, collinearity can affect the stability and interpretation of the coefficients. For simple linear regression, the focus is on the relationship between a single predictor and the dependent variable, making collinearity less of a concern.
R-squared measures how well the regression line fits the data, representing the proportion of variance in the dependent variable that is explained by the independent variable. An R-squared value of 1 indicates perfect prediction, while a value closer to 0 suggests that the model doesn’t explain much of the variance.
Simple linear regression is widely used in fields like finance (predicting stock prices), economics (modeling supply and demand), and healthcare (predicting patient outcomes based on clinical variables). For example, it can predict sales based on advertising expenditures or house prices based on square footage.
When the relationship between the independent and dependent variables is non-linear, Simple Linear Regression might not perform well. In such cases, transformations such as logarithmic, polynomial, or exponential transformations can be applied to make the relationship linear. Alternatively, more complex models like polynomial regression or decision trees can be used.
Start Learning For Free
Explore Our Free Software Tutorials and Elevate your Career.
Talk to our experts. We are available 7 days a week, 9 AM to 12 AM (midnight)
Indian Nationals
1800 210 2020
Foreign Nationals
+918068792934
1.The above statistics depend on various factors and individual results may vary. Past performance is no guarantee of future results.
2.The student assumes full responsibility for all expenses associated with visas, travel, & related costs. upGrad does not provide any a.