AI Explained Using The K-Nearest Algorithm

Comfort Mba
3 min readJun 22, 2018

The K-nearest algorithm or the K-Nearest Neighbor algorithm is non-parametric(that is based on being distribution-free) method used for classification and regression. It can also be said to be a lazy learning,where the function is only approximated locally and all computation is deferred until classification.

There is no explicit training phase before classification.Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

A quick summary of KNN

The algorithm can be summarized as:

1) A positive integer k is specified, along with a new sample

2) We select the k entries in our database which are closest to the new sample

3) We find the most common classification of these entries

4) This is the classification we give to the new sample

The KNN is the simplest methods in machine learning. It’s simply classification by finding the most similar data point in the training date and then making a guess based on the classification.What this means is that it does not use the training data points to do any generalization.In other words, there is no explicit training phase or it is very minimal.

This also means that the training phase is pretty fast.You need to find a way to represent data point as feature vectors(a mathematical representation of data)since our desired data may not be inherently numerical, preprocessing and feature engineering maybe required in order to create these vectors.

Although we can immediately begin classification once we have our data, we might inherit some problem with this type of algorithm. We must be able to keep the entire training set in memory unless we apply some type of reduction to our data set.performing algorithm can be computationally expensive as the algorithm parse through all data points for each classification. So it’s best to work on smaller data with KNN algorithm.

The training data set can be represented as M x N matrix, M is the number of datapoint and N is the number of features then we can start our classification. Before making your classification two requirements must be met. One is the value of K that will be used either arbitrarily or you can try cross-validation to find the optimal value. The second is the distance metric that is to be used, this is the most complex.

Distance is a fairly ambiguous notion. the proper metric to use is always determined by the data-set and the classification task. two of these are Euclidean distance and Cosine similarity.

Euclidean distance is the magnitude of the vector obtained by subtracting the training data point from the point to be classified.

Another common metric is Cosine similarity, Cosine uses the difference in direction between two vectors.It’s always best to use a cross-validation to decide when choosing a metric.

The KNN algorithm can be used for regression task, and operate in a manner very similar to that of the classifier through averaging.

For Instance: Credit ratings — collecting financial characteristics vs. comparing people with similar financial features to a database. By the very nature of a credit rating, people who have similar financial details would be given similar credit ratings.

Therefore, they would like to be able to use this existing database to predict a new customer’s credit rating, without having to perform all the calculations.

--

--