k nearest neighbors

In the previous tutorials, we have built a model using logistic regression that predicts the survival of a person in a titanic incident that happened long ago. In this tutorial, we will discuss the k nearest neighbors algorithm.

KNN:

K Nearest Neighbors is an algorithm that solves classification problems in a very simple and easy process.This technique predicts the value based on the surrounding values. Let us consider an example of income to different aged people based on their gender.

This plot shows income of different aged people, from the plot it could be said that female employees dont earn much and even they are less in count. So what if when few random points are given ( classification problem ) and asked to find their gender based on their age and income.

Here there are three predictions to be done:

Point 3 – This point shows clearly that it would be male most probably as it is surrounded by all red points.

Point 1 – This point clearly shows that it is female as it is surrounded by all blue points.

Point 2 – This is a difficult prediction as it is surrounded by blue and red points.

This is basically what knn algorithm does.

What is K and how to choose it:

We predict a value based on its number of surrounding values, sometimes we consider 20 surrounding values and sometimes we consider 40 values, that count is “k” , changing the value of k will really affect the prediction. 

In the above figure prediction was to be done on black dot, so when we consider k=3, prediction comes out to be ‘red’ but when we increase k value to 13 the predictions changes to ‘blue’. This shows selecting the correct value of k is much important.

There is a method known as elbow method to find the best value that suits k. We would discuss that while building our model.

Advantages of knn method:

AdvantagesDisadvantages
It’s very easyHigh prediction cost
Training is very lessNot good with high dimensional data
Works with multiple number of classesDoesn’t work good with categorical data
Easy to add extra data 
Very less parametersKDistance 

There are few distancing parameter types used in Knn predictions:

  • Euclidean 
  • Manhattan
  • Minkowski

Euclidean distance:

This is normal distance between two co-ordinate points. 

Manhattan distance:

Minkowski distance:

In the upcoming tutorial, we would build our first knn model by using the previously learnt data science concepts. This is the end of this tutorial, k nearest neighbors. For more information visit the Data Section.

Spread knowledge

Leave a Comment

Your email address will not be published. Required fields are marked *