K Means Clustering
This is an unsupervised machine learning algorithm ( which means it has no labeled data ). This will attempt to group similar clusters together. Let us consider few clustering examples
- Clustering similar documents
- Clustering customers based on features
- Market segmentation
- Identify similar physical groups
Complete clustering process means dividing the data into several groups such that each group has similar data.
K means algorithm:
Step 1: Choosing number of cluster “K” , which is little bit similar to Knn selection
Step 2: Then assign each point to the clusters randomly.
Step 3: Until clusters stop changing, repeat the cluster centroid by taking the mean vector point.
Assign each data point to the clusters for which the centroid is the closest.
There is no easy and perfect way to predict the best value of k, Here we will use a method which is known as “elbow method”, similar to Knn.
To use the elbow method we have to plot an elbow graph across “SSE” and number of “K” values.
SSE ( sum of squared error ) – Sum of squared distance between each data point of the cluster and the centroid.
If we plot the graph between SSE and K count, then there will be decrement in the SSE when K count increases, the logic behind is if the number of clusters increase , then each cluster size should decrease, so distortion would be less.
So we choose a value at which there is a sudden decrease in SSE or else there is no limit for choosing K count as it keeps on increasing.
This is some sort of pictorial representation to show the steps of K mean clustering.
How to choose k value:
There is no easy and perfect way to predict the best value of k, Here we will use a method which is known as “elbow method”, similar to Knn.
To use the elbow method we have to plot an elbow graph across “SSE” and number of “K” values.
SSE ( sum of squared error ) – Sum of squared distance between each data point of the cluster and the centroid.
If we plot the graph between SSE and K count, then there will be decrement in the SSE when K count increases, the logic behind is if the number of clusters increase , then each cluster size should decrease, so distortion would be less.
So we choose a value at which there is a sudden decrease in SSE or else there is no limit for choosing K count as it keeps on increasing.