Linear regression is one of the most common algorithms for establishing relationships between the variables of a dataset. A mathematical model is a necessary tool for data scientists in performing predictive analysis. This blog will fill you in on the fundamental concept and also discuss a linear regression example.
Table of Contents
What are Regression Models?
A regression model describes the relationship between dataset variables by fitting a line to the data observed. It is a mathematical analysis that sorts out which variables have an impact and matter the most. It also determines how certain we are about the factors involved. The two kinds of variables are:
- Dependent: Factor that you are attempting to predict or understand.
- Independent: Factors that you suspect to have an impact on the dependent variable.
Regression models are used when the dependent variable is quantitative. It may be binary in the case of logistic regression. But in this blog, we will mainly focus on the linear regression model where both variables are quantitative.
Suppose you have data on the monthly sales and average monthly rainfall for the past three years. Let’s say that you plotted this information on a chart. The y-axis represents the number of sales (dependent variable), and the x-axis depicts the total rainfall. Each dot on the chart would show how much it rained during a particular month and the corresponding sales numbers.
If you take another glance at the data, you might notice a pattern. Presume the sales to be higher on the days it rained more. But it would be tricky to estimate how much you would typically sell when it rained a certain amount, say 3 or 4 inches. You could get some degree of certainty if you drew a line through the middle of all data points on the chart.
Nowadays, Excel and statistics software like SPSS, R, or STATA can help you draw a line that best fits the data at hand. In addition, you can also output a formula explaining the slope of the line.
Consider this formula for the above example: Y = 200 + 3X. It tells you that you sold 200 units when it didn’t rain at all (i.e., when X=0). Assuming that the variables stay the same as we advance, every additional inch of rain would result in an average sales of three more units. You would sell 203 units if it rains 1 inch, 206 units if it rains 2 inches, 209 inches if it rains 3 inches, and so on.
Typically, the regression line formula also includes an error term (Y = 200 + 3 X + error term). It takes into account the reality that independent predictors may not always be perfect predictors of dependent variables. And the line merely gives you an estimate based on the data available. The larger the error term, the less certain would be your regression line.
Linear Regression Basics
A simple linear regression model uses a straight line to estimate the relationship between two quantitative variables. If you have more than one independent variable, you will use multiple linear regression instead.
Simple linear regression analysis is concerned with two things. First, it tells you the strength of the relationship between the dependent and independent factors of the historical data. Second, it gives you the value of the dependent variable at a certain value of the independent variable.
Consider this linear regression example. A social researcher interested in knowing how individuals’ income affects their happiness levels performs a simple regression analysis to see if a linear relationship occurs. The researcher takes quantitative values of the dependent variable (happiness) and independent variable (income) by surveying people in a particular geographical location.
For instance, the data contains income figures and happiness levels (ranked on a scale from 1 to 10) from 500 people from the Indian state of Maharashtra. The researcher would then plot the data points and fit a regression line to know how much the respondents’ earnings influence their wellbeing.
Linear regression analysis is based on a few assumptions about the data. There are:
- Linearity of the relationship between the dependent and independent variable, i.e., the line of best fit is straight, not curved.)
- Homogeneity of variance, meaning the size of the error in the prediction, does not change significantly across different values of the independent variable.
- Independence of observations in the dataset, referring to no hidden relationships.
- Normality of data distribution for the dependent variable. You can check the same using the hist() function in R.
The Math Behind Linear Regression
y = c + ax is a standard equation where y is the output (that we want to estimate), x is the input variable (that we know), a is the slope of the line, and c is the constant.
Here, the output varies linearly based on the input. The slope determines how much x impacts the value of y. The constant is the value of y when x is nil.
Let’s understand this through another linear regression example. Imagine that you are employed in an automobile company and want to study India’s passenger vehicle market. Let’s say that the national GDP influences passenger vehicle sales. To plan better for the business, you might want to find out the linear equation of the number of vehicles sold in the country concerning the GDP
For this, you would need sample data for year-wise passenger vehicle sales and the GDP figures for every year. You might discover that the GDP of the current year affects the sales for next year: Whichever year the GDP was less, vehicle sales were lower in the subsequent year.
To prepare this data for Machine Learning analytics, you would need to do a little more work.
- Please start with the equation y = c + ax, where y is the number of vehicles sold in a year and x is the GDP of the prior year.
- To find out c and an in the above problem, you can create a model using Python.
Check out this tutorial to understand the step-by-step method
If you were to perform simple linear regression in R, interpreting and reporting results become much easier.
For the same linear regression example, let us change the equation to y=B0 + B1x + e. Again, y is the dependent variable, and x is the independent or known variable. B0 is the constant or intercept, B1 is the slope of the regression coefficient, and e is the error of the estimate.
Statistical software like R can find the line of best fit through the data and search for the B1 that minimises the total error of the model.
Follow these steps to begin:
- Load the passenger vehicle sales dataset into the R environment.
- Run the command to generate a linear model that describes the relationship between passenger vehicle sales and GDP.
- sales.gdp.lm <- lm(gdp ~ sales, data = sales.data)
- Use the summary() function to view the most important linear model parameters in tabulated form.
Note: The output would contain results like calls, Residuals, and Coefficients. The ‘Call’ table states the formula used. The ‘Residuals’ details the Median, Quartiles, minimum, and maximum values to indicate how well the model fits the real data. The first row of the ‘Coefficients’ table estimates the y-intercept, and the second row gives the regression coefficient. The columns of this table have labels like Estimate, Std. Error, t value, and p-value.
Learn Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
- Plug the (Intercept) value into the regression equation to predict sales values across the range of GDP numbers.
- Investigate the (Estimate) column to know the effect. The regression coefficient would tell you how much the sales change with the change in GDP.
- Find out the variation in your estimate of the relationship between sales and GDP from the (Std. Error) label.
- Look at the test statistic under (t-value) to know whether the results occurred by chance. The larger the t-value, the less likely it would be.
- Go through the Pr(>|t|) column or p-values to see the estimated effect of GDP on sales if the null hypothesis were true.
- Present your results with the estimated effect, standard error, and p-values, clearly communicating what the regression coefficient means.
- Include a graph with the report. A simple linear regression can be shown as a plot chart with the regression line and function.
- Calculate the error by measuring the distance of the observed and predicted y values, squaring the distances at each value of x, and calculating their mean.
With the above linear regression example, we have given you an overview of generating a simple linear regression model, finding the regression coefficient, and calculating the error of the estimate. We also touched upon the relevance of Python and R for predictive data analytics and statistics. Practical knowledge of such tools is crucial for pursuing careers in data science and machine learning today.
If you want to hone your programming skills, check out the Advanced Certificate Programme in Machine Learning by IIT Madras and upGrad. The online course also includes case studies, projects, and expert mentorship sessions to bring industry-orientedness to the training process.