Clustering is one of the techniques used to group the objects such that similar objects are in the same cluster. The objects in the same cluster are similar and vice versa. Clustering is widely used in the industry to solve problems. For example, if we have a lot of documents and we want to cluster them based on its domain. We can use clustering to group similar documents.

I will give you a real life example. You have 10 apples, 10 oranges and 10 bananas. All the fruits are mixed and finally you have 30 fruits. You want to separate them into apples, oranges and bananas. What would you do? Based on the color and shape, you will recognize the fruit and you can easily separate them. You have used features like shape and color of the fruit to separate them. Similarly, we create features for documents and cluster them if any two documents are similar.

This article assumes that you have discovered the features from the objects and you are ready with the features. In this article, I will discuss a simple and popular clustering algorithm “K means clustering” to cluster the data/features. Let’s get started.

K means Source: Google Images |

**K means algorithm:**

Let’s say we have 1000 data points and each data point is a m dimensional vector. For easy visualization, assume that m is 2. If you can visualize, you have 1000 data points that are plotted on a 2D plot.

- Firstly, select the number of clusters(K) that you want. This can be chosen based on your domain knowledge or the best K can be automatically calculated. We will discuss this in the end.
- After selecting K, randomly select K data points from the data. These are your initial cluster centroids. We will update the cluster centroids until we get a good set of clusters. Hold for a minute, you will understand how to update them.
- Take a data point and calculate the distance from each cluster centroid. You will get K distances. Assign the current point to the cluster which has the minimum distance. Similarly, assign all the points to one of the clusters based on the distance. Distance metric can be Euclidean distance.
- Now, you have K clusters and each point is assigned to one of the clusters. But, these are not the final clusters. Using the current assignment, calculate the cluster centroids using the data points that are assigned to each cluster. We have updated the cluster centroids.
- Repeat the above points 3 and 4 until you don’t see much variance in the cluster centroids. After repeating the above steps for some iterations, you will observe that the centroids will become stable and don't change much.

We have separated the data points into K clusters. Sometimes, the second step of selecting the random cluster centroids may not be good. It completely depends on our luck. So, it’s better to do the entire process multiple times.

One of the main disadvantages of K-means clustering algorithm is K. We have to select the number of clusters before hand. This is really a difficult task for us. To find the optimal K, we can use the Elbow method which helps us to identify the number of clusters.

**How to choose the number of clusters (k)?**

- We apply the K-means algorithm for different values of k. But, how to compare them? We need a single value metric to compare the best number of clusters.
- For each k, we calculate the sum of squared distances within each cluster and add them. To explain it better, take all the data points in the cluster and calculate distance from the centroid of that cluster. Do the same thing for each cluster and add them. Finally, for each k, we have calculated
**Sum of squared distances**. - Plot Sum of squared distances for each k. At some k, you will observe a bend in the plot which is the optimal k you are searching for. Please refer to the below plot.

Source: Google Images |

Now, you have understood how K means clustering algorithm work. As a next step from here, you can read DBSCAN clustering algorithm, which is also one of the most popular clustering algorithms.

Thank you so much for reading my blog and supporting me. Stay tuned for my next article.

**If you want to receive email updates, don’t forget to subscribe to my blog.**If you have any queries, please do comment in the comment section below. I will be more than happy to help you. Keep learning and sharing!!
Follow me here:

GitHub: https://github.com/Abhishekmamidi123

LinkedIn: https://www.linkedin.com/in/abhishekmamidi/

Kaggle: https://www.kaggle.com/abhishekmamidi

If you are looking for any specific blog, please do comment in the comment section below.

GitHub: https://github.com/Abhishekmamidi123

LinkedIn: https://www.linkedin.com/in/abhishekmamidi/

Kaggle: https://www.kaggle.com/abhishekmamidi

If you are looking for any specific blog, please do comment in the comment section below.

## Comments

## Post a Comment