Machine learning is one of the most important topics in Artificial Intelligence. It is further divided into Supervised and Unsupervised learning which can be related to labelled and unlabeled data analysis or data prediction. In Supervised Learning we have two more types of business problems called Regression and Classification.
Classification is a machine learning algorithm where we get the labeled data as input and we need to predict the output into a class. If there are two classes, then it is called Binary Classification. If there are more than two classes, then it is called Multi Class Classification. In real world scenarios we tend to see both types of Classification.
In this article we will investigate a few types of Classification Algorithms along with their pros and cons. There are so many classification algorithms available but let us focus on the below 5 algorithms:
- Logistic Regression
- K Nearest Neighbor
- Decision trees
- Random Forest
- Support vector Machines
1. Logistic Regression
Even though the name suggests Regression it is a Classification Algorithm. Logistic Regression is a statistical method for classifying data in which there are one or more independent variables or features that determine an outcome which is measured with a variable (TARGET) that has two or more classes. Its main goal is to find the best fitting model to describe the relationship between the Target variable and independent variables.
1) Easy to implement, interpret and efficient to train as it does not make any assumptions and is fast at Classifying.
2) Can be used for Multi Class Classification.
3) It is less prone to over-fitting but does overfit in high dimensional datasets.
1) Overfits when observations are lesser than features.
2) Only works with discrete functions.
3) Non-linear problems cannot be solved.
4) Tough to learn complex patterns and usually neural networks outperform them.
2. K Nearest Neighbor
K-nearest neighbors (KNN) algorithm uses the technique ‘feature similarity’ or ‘nearest neighbors’ to predict the cluster that a new data point fall into. Below are the few steps based on which we can understand the working of this algorithm better
Join the Artificial Intelligence Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
Step 1 − For implementing any algorithm in Machine learning, we need a cleaned data set ready for modelling. Let’s assume that we already have a cleaned dataset which has been split into training and testing data set.
Step 2 − As we already have the data sets ready, we need to choose the value of K (integer) which tells us how many nearest data points we need to take into consideration to implement the algorithm. We can get to know how to determine the k value in the later stages of the article.
Step 3 − This step is an iterative one and needs to be applied for each data point in the dataset
- Calculate the distance between test data and each row of training data using any of the distance metric
- Euclidean distance
- Manhattan distance
- Minkowski distance
- Hamming distance.
Many data scientists tend to use the Euclidean distance, but we can get to know the significance of each one in the later stage of this article.
We need to sort the data based on the distance metric that we have used in the above step.
Choose the top K rows in the transformed sorted data.
Then it will assign a class to the test point based on the most frequent class of these rows.
Step 4 – End
- Easy to use, understand and interpret.
- Quick calculation time.
- No assumptions about data.
- High accuracy of predictions.
- Versatile – Can be used for both Classification and Regression Business Problems.
- Can be used for Multi Class Problems as well.
- We have only one Hyper parameter to tweak at Hyperparameter Tuning step.
- Computationally expensive and requires high memory as the algorithm stores all the training data.
- The algorithm gets slower as the variables increase.
- It is very Sensitive to irrelevant features.
- Curse of Dimensionality.
- Choosing the optimal value of K.
- Class Imbalanced dataset will cause problem.
- Missing values in the data also causes problem.
3. Decision Trees
Decision trees can be used for both Classification and Regression as it can handle both numerical and categorical data. It breaks down the data set into smaller and smaller subsets or nodes as the tree gets developed. Decision tree has output with decision and leaf nodes where a decision node has two or more branches while a leaf node represents a decision. The topmost node that corresponds to the best predictor is called the root node.
- Simple to understand
- Easy Visualization
- Less data Interpretation
- Handles both numerical and categorical data.
- Sometimes do not generalize well
- Unstable to changes in input data
4. Random forests
Random forests are an ensemble learning method that can be used for classification and regression. It works by constructing several decision trees and outputs the results by taking the mean of all decision trees in Regression or Majority voting in Classification problems. You can get to know from the name itself that a group of trees is called a Forest.
- Can handle large datasets.
- Will output the importance of variables.
- Can handle missing values.
- It is a black box algorithm.
- Slow real time prediction and complex algorithms.
5. Support vector machines
Support vector machine is a representation of the data set as points in space separated into categories by a clear gap or line that is as far as possible. The new data points are now mapped into that same space and classified to belong to a category based on which side of the line or separation they fall.
- Works best in High dimensional spaces.
- Uses a subset of training data points in decision function which makes it a memory efficient algorithm.
- Will not provide probability estimates.
- Can calculate probability estimates using cross validation but it is time consuming.
Also Read: Career in Machine Learning
In this article we have discussed regarding the 5 Classification algorithms, their brief definitions, pros and cons. These are only a few algorithms that we have covered but there are more valuable algorithms such as Naïve Bayes, Neural Networks, Ordered Logistic Regression. One cannot tell which algorithm works well for which problem, so that best practice is to try out a few and select the final model based on evaluation metrics.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
What is the main purpose behind using logistic regression?
Logistic regression is mainly used in statistical probabilities. It uses a logistic regression equation in order to comprehend the relationship between the dependent variables and independent variables present in the given data. This is done by estimating the individual event probabilities. A logistic regression model is very similar to the linear regression model, however, their use is preferred where the dependent variable given in the data is dichotomous.
How is SVM different from logistic regression?
Though SVM provides more accuracy than logistic regression models, it is complex to use and, thus, is not user-friendly. In the case of large amounts of data, the use of SVM is not preferred. While SVM is used to solve both regression and classification problems, logistic regression only solves classification problems well. Unlike SVM, over-fitting is a common occurrence when using logistic regression. Also, logistic regression is more vulnerable to outliers when compared to support vector machines.
Is a regression tree a type of decision tree?
Yes, regression trees are basically decision trees that are used for regression tasks. Regression models are used to comprehend the relationship between dependent variables and the independent variables that have actually arisen by the splitting of the initial given data set. Regression trees can be used only when the decision tree consists of a continuous target variable.