KNN Classifier For Machine Learning: Everything You Need to Know

Remember the time when artificial intelligence (AI) was only a concept limited to sci-fi novels and movies? Well, thanks to technological advancement, AI is something that we now live with every day. From Alexa and Siri being there at our beck and call to OTT platforms “handpicking” the movies we’d like to watch, AI has almost become the order of the day and is here to say for the foreseeable future. 

This is all possible thanks to advanced ML algorithms. Today, we’re going to talk about one such useful ML algorithm, the K-NN Classifier.

A branch of AI and computer science, machine learning uses data and algorithms to mimic human understanding while gradually improving the accuracy of the algorithms. Machine learning involves training algorithms to make predictions or classifications and unearthing key insights that drive strategic decision-making within businesses and applications. 

The KNN (k-nearest neighbour) algorithm is a fundamental supervised machine learning algorithm used to solve regression and classification problem statements. So, let’s dive in to know more about K-NN Classifier.

Supervised vs Unsupervised Machine Learning

Supervised and unsupervised learning are two basic data science approaches, and it is pertinent to know the difference before we go into the details of KNN. 

Supervised learning is a machine learning approach that uses labelled datasets to help predict outcomes. Such datasets are designed to “supervise” or train algorithms into predicting outcomes or classifying data accurately. Hence, labelled inputs and outputs enable the model to learn over time while improving its accuracy.

Supervised learning involves two types of problems – classification and regression. In classification problems, algorithms allocate test data into discrete categories, such as separating cats from dogs.

A significant real-life example would be classifying spam mails into a folder separate from your inbox. On the other hand, the regression method of supervised learning trains algorithms to understand the relationship between independent and dependent variables. It uses different data points to predict numerical values, such as projecting the sales revenue for a business.

Unsupervised learning, on the contrary, uses machine learning algorithms for the analysis and clustering of unlabelled datasets. Thus, there is no need for human intervention (“unsupervised”) for the algorithms to identify hidden patterns in data.

Unsupervised learning models have three main applications – association, clustering, and dimensionality reduction. However, we will not go into the details since it’s beyond our scope of discussion.

K-Nearest Neighbour (KNN)

The K-Nearest Neighbour or the KNN algorithm is a machine learning algorithm based on the supervised learning model. The K-NN algorithm works by assuming that similar things exist close to each other. Hence, the K-NN algorithm utilises feature similarity between the new data points and the points in the training set (available cases) to predict the values of the new data points. In essence, the K-NN algorithm assigns a value to the latest data point based on how closely it resembles the points in the training set. K-NN algorithm finds application in both classification and regression problems but is mainly used for classification problems.

Here’s an example to understand K-NN Classifier.

 

Source

In the above image, the input value is a creature with similarities to both a cat and a dog. However, we want to classify it into either a cat or a dog. So, we can use K-NN algorithm for this classification. The K-NN model will find similarities between the new data set (input) to the available cat and dog images (training data set). Subsequently, the model will put the new data point in either the cat or dog category based on the most similar features.

Likewise, category A (green dots) and category B (orange dots) have the above graphical example. We also have a new data point (blue dot) that will fall into either of the categories. We can solve this classification problem using a K-NN algorithm and identify the new data point category. 

Defining Properties of K-NN Algorithm

The following two properties best define the K-NN algorithm:

  • It is a lazy learning algorithm because instead of learning from the training set immediately, the K-NN algorithm stores the dataset and trains from the dataset at the time of classification.
  • K-NN is also a non-parametric algorithm, meaning it does not make any assumptions about the underlying data.

Working of the K-NN Algorithm

Now, let’s take a look at the following steps to understand how K-NN algorithm works.

Step 1: Load the training and test data.

Step 2: Choose the nearest data points, that is, the value of K. 

Step 3: Calculate the distance of K number of neighbours (the distance between each row of training data and test data). The Euclidean method is most commonly used for calculating the distance.

Step 4: Take the K nearest neighbours based on the calculated Euclidean distance.

Step 5: Among the nearest K neighbours, count the number of data points in each category.

Step 6: Allot the new data points to that category for which the number of neighbours is maximum.

Step 7: End. The model is now ready.

Choosing the value of K

K is a critical parameter in the K-NN algorithm. Hence, we need to keep in mind some points before we decide on a value of K. 

Using error curves is a common method to determine the value of K. The image below shows error curves for different K values for test and training data.

Source

In the above graphical example, the train error is zero at K=1 in training data because the nearest neighbour to the point is that point itself. However, the test error is high even at low values of K. This is called high variance or overfitting of data. The test error reduces as we increase the value of K., But after a certain value of K, we see that the test error increases again, called bias or underfitting. Thus, the test data error is initially high due to variance, it subsequently lowers and stabilises, and with further increase in the value of K, the test error again shoots up due to bias. 

Therefore, the value of K at which the test error stabilises and is low is taken as the optimal value of K. Considering the above error curve, K=8 is the optimal value.  

An Example to Understand the Working of K-NN Algorithm

Consider a dataset that has been plotted as follows:

Source

Say there is a new data point (black dot) at (60,60) which we have to classify into either the purple or red class. We will use K=3, meaning that the new data point will find three nearest data points, two in the red class and one in the purple class.

Source 

The nearest neighbours are determined by calculating the Euclidean distance between two points. Here’s an illustration to show how the calculation is done.

Source

Now, since two (out of the three) of the nearest neighbours of the new data point (black dot) lies in the red class, the new data point will also be assigned to the red class.

Join the Machine Learning Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

K-NN as Classifier (Implementation in Python)

Now that we’ve had a simplified explanation of the K-NN algorithm, let us go through implementing the K-NN algorithm in Python. We will only focus on K-NN Classifier.

Step 1: Import the necessary Python packages.

Source

Step 2: Download the iris dataset from the UCI Machine Learning Repository. Its weblink is “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”

Step 3: Assign column names to the dataset.

Source

Step 4: Read the dataset to Pandas DataFrame.

Source

Step 5: Data preprocessing is done using the following script lines.

Source

Step 6: Divide the dataset into test and train split. The code below will split the dataset into 40% testing data and 60% training data.

Source

Step 7: Data scaling is done as follows:

Source

Step 8: Train the model using KNeighborsClassifier class of sklearn.

Source

Step 9: Make a prediction using the following script:

Source

Step 10: Print the results.

Source

Output:

Source

What Next? Sign-up for The Advanced Certificate Programme in Machine Learning from IIT Madras and upGrad

Suppose you’re aspiring to become a skilled Data Scientist or Machine Learning professional. In that case, the Advanced Certification Course in Machine Learning and Cloud from IIT Madras and upGrad is just for you!

The 12-month online program is specially designed for working professionals looking to master concepts in Machine Learning, Big Data Processing, Data Management, Data Warehousing, Cloud, and deployment of Machine Learning models. 

Here are some course highlights to give you a better idea of what the program offers:

  • Globally accepted prestigious certification from IIT Madras
  • 500+ hours of learning, 20+ case studies and projects, 25+ industry mentorship sessions, 8+ coding assignments
  • Comprehensive coverage of 7 programming languages and tools
  • 4 weeks of industry capstone project
  • Practical hands-on workshops
  • Offline peer-to-peer networking

Sign up today to learn more about the program!

Conclusion

With time, Big Data continues to grow, and artificial intelligence becomes increasingly entwined with our lives. As a result, there is an acute rise in demand for data science professionals who can leverage the power of machine learning models to gather data insights and improve critical business processes and, in general, our world. No doubt, the field of artificial intelligence and machine learning looks indeed promising. With upGrad, you can rest assured that your career in machine learning and cloud is a rewarding one!

Why is K-NN a good classifier?

The primary advantage of K-NN over other machine learning algorithms is that we can conveniently use K-NN for multiclass classification. Thus, K-NN is the best algorithm if we need to classify data into more than two categories or if the data comprises more than two labels. Besides, it is ideal for non-linear data and has relatively high accuracy.

What is the limitation of the K-NN algorithm?

The K-NN algorithm works by calculating the distance between the data points. Hence, it is pretty obvious that it is a relatively more time-consuming algorithm and will take more time to classify in some instances. Therefore, it is best not to use too many data pointswhile using K-NN for multiclass classification. Other limitations include high memory storage and sensitivity to irrelevant features.

What are the real-world applications of K-NN?

K-NN has several real-life use cases in machine learning, such as handwriting detection, speech recognition, video recognition, and image recognition. In banking, K-NN is used to predict if an individual is eligible for a loan based on whether they have characteristics similar to defaulters. In politics, K-NN can be used to classify potential voters into different classes like “will vote to party X” or “will vote to party Y,” etc.

Enhance Your Career in Machine Learning and Artificial Intelligence

0 replies on “KNN Classifier For Machine Learning: Everything You Need to Know”

×