An algorithm is a set of rules or instructions that are followed by a computer programme to implement calculations or perform other problem-solving functions. As data science is all about extracting meaningful information for datasets, there is a myriad of algorithms available to solve the purpose.
Data science algorithms can help in classifying, predicting, analyzing, detecting defaults, etc. The algorithms also make up the foundation of machine learning libraries such as scikit-learn. So, it helps to have a solid understanding of what is going on under the surface.
Commonly Used Data Science Algorithms
It is used for discrete target variables, and the output is in the form of categories. Clustering, association, and decision tree are how the input data can be processed to predict an outcome. For example, a new patient may be labelled as “sick” or “healthy” by using a classification model.
Regression is used to predict a target variable as well as to measure the relationship between target variables, which are continuous in nature. It is a straightforward method of plotting ‘the line of best fit’ on a plot of a single feature or a set of features, say x, and the target variable, y.
Regression may be used to estimate the amount of rainfall based on the previous correlation between the different atmospheric parameters. Another example is predicting the price of a house based on features like area, locality, age, etc.
Let us now understand one of the most fundamental building blocks of data science algorithms – linear regression.
3. Linear Regression
The linear equation for a dataset with N features can be given as: y = b0 + b1.x1 + b2.x2 + b3.x3 + …..bn.xn, where b0 is some constant.
For univariate data (y = b0 + b1.x), the aim is to minimize the loss or error to the smallest value possible for the returned variable. This is the primary purpose of a cost function. If you assume b0 to be zero and input different values for b1, you will find that the linear regression cost function is convex in shape.
Mathematical tools assist in optimizing the two parameters, b0 and b1, and minimize the cost function. One of them is discussed as follows.
4. The least squares method
In the above case, b1 is the weight of x or the slope of the line, and b0 is the intercept. Further, all the predicted values of y lie on the line. And the least squares method seeks to minimize the distance between each point, say (xi, yi), the predicted values.
To calculate the value of b0, find out the mean of all values of xi and multiplying them by b1 . Then, subtract the product from the mean of all yi. Also, you can run a code in Python for the value of b1 . These values would be ready to be plugged into the cost function, and the return value will be minimized for losses and errors. For example, for b0= -34.671 and b1 = 9.102, the cost function would return as 21.801.
5. Gradient descent
When there are multiple features, like in the case of multiple regression, the complex computation is taken care of by methods like gradient descent. It is an iterative optimization algorithm applied for determining the local minimum of a function. The process begins by taking an initial value for b0 and b1 and continuing until the slope of the cost function is zero.
Suppose you have to go to a lake that is located at the lowest point of a mountain. If you have zero visibility and are standing at the top of the mountain, you would begin at a point where the land tends to descend. After taking the first step and following the path of descent, it is likely that you will reach the lake.
While cost function is a tool that allows us to evaluate parameters, gradient descent algorithm can help in updating and training model parameters. Now, let’s overview some other algorithms for data science.
6. Logistic regression
While the predictions of linear regression are continuous values, logistic regression gives discrete or binary predictions. In other words, the results in the output belong to two classes after applying a transformation function. For instance, logistic regression can be used to predict whether a student passed or failed or whether it will rain or not. Read more about logistic regression.
7. K-means clustering
It is an iterative algorithm that assigns similar data points into clusters. To do the same, it calculates the centroids of k clusters and groups the data based on least distance from the centroid. Learn more about cluster analysis in data mining.
8. K-Nearest Neighbor (KNN)
The KNN algorithm goes through the entire data set to find the k-nearest instances when an outcome is required for a new data instance. The user specifies the value of k to be used.
9. Principal Component Analysis (PCA)
The PCA algorithm reduces the number of variables by capturing the maximum variance in the data into a new system of ‘principal components’. This makes it easy to explore and visualize the data.
The knowledge of the data science algorithms explained above can prove immensely useful if you are just starting out in the field. Understanding the nitty-gritty can also come in handy while performing day-to-day data science functions.
If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.