Regression in Data Mining: Different Types of Regression Techniques [2021]

Supervised learning is a learning in which you train the machine learning algorithm using data that is already labeled. This means that the correct answer is already known for all the training data. After training, it is provided with a new set of unknown data which the supervised learning algorithm analyses, and then it produces a correct outcome based on the labelled training data. 

Unsupervised learning is where the algorithm is trained using information, for which the correct label is not known. Here the machine basically has to group together information according to the various patterns, or any correlations without training on any data beforehand. 

Regression is a form of a supervised machine learning technique that tries to predict any continuous valued attribute. It analyses the relationship between a target variable (dependent) and its predictor variable (independent). Regression is an important tool for data analysis that can be used for time series modelling, forecasting, and others.

Regression involves the process of fitting a curve or a straight line on various data points. It is done in such a way that the distances between the curve and the data points come out to be the minimum.

Though linear and logistic regressions are the most popular types, there are many other types of regression that can be applied depending on their performance on a particular set of data. These different types vary because of the number and type of all dependent variables and also on the kind of regression curve formed.

Check out: Difference between Data Science and Data Mining

Linear Regression

Linear Regression forms a relationship between the target (dependent) variable and one or more independent variables using a straight line of best fit.

It is represented by the equation:

Y = a + b*X + e,

where a is the intercept, b is the slope of the regression line and e is the error. X and Y are the predictor and target variables respectively. When X is made up of more than one variables (or features) it is termed as multiple linear regression.

The best-fit line is achieved using the Least-Squared method. This method minimizes the sum of the squares of the deviations from each of the data points to the regression line. The negative and positive distances do not get cancelled out here as all the deviations are squared.

Polynomial Regression

In polynomial regression, the power of the independent variable is more than 1 in the regression equation. Below is an example:

Y = a + b*X^2

In this particular regression, the line of best fit is not a straight line like in Linear Regression. However, it is a curve that is fitted to all the data points.

Implementing polynomial regression can result in over-fitting when you are tempted to reduce your errors by making the curve more complex. Hence, always try to fit the curve by generalizing it to the problem.

Logistic Regression

Logistic regression is used when the dependent variable is of binary nature (True or False, 0 or 1, success or failure). Here the target value (Y) ranges from 0 to 1 and it is popularly used for classification type problems. Logistic Regression doesn’t require the dependent and independent variables to have a linear relationship, as is the case in Linear Regression.

Read: Data Mining Project Ideas

Ridge Regression

Ridge Regression is a technique used to analyze multiple regression data that have the problem of multicollinearity. Multicollinearity is the existence of an almost-linear correlation between any two independent variables.

It occurs when the least squares estimates have a low bias, but they have high variance, so they are very different from the true value. Thus, by adding a degree of bias to the estimated regression value, the standard errors are greatly reduced by implementing ridge regression.

Lasso Regression

The term “LASSO” stands for Least Absolute Shrinkage and Selection Operator.

It is a type of linear regression that uses shrinkage. In this, all the data points are brought down (or shrunk) towards a central point, also called the mean. The lasso procedure is most suited for simple and sparse models that have comparatively fewer parameters. This type of regression is also well-suited for models that suffer from multicollinearity (just like a ridge).

Conclusion

Regression analysis basically allows you to compare the effects of different kinds of feature variables measured on a wide range of scales. Such as the prediction of house prices based on total area, locality, age, furniture, etc. These results largely benefit the market researchers or data analysts to eliminate any useless features and evaluate the best set of features to build accurate predictive models.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Prepare for a Career of the Future

UPGRAD AND IIIT-BANGALORE'S PG DIPLOMA IN DATA SCIENCE
Learn More

Leave a comment

Your email address will not be published. Required fields are marked *

×