K-Means Clustering Algorithm

Zanfina Svirca
DataDrivenInvestor
Published in
3 min readFeb 19, 2020

--

K-means is a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters K, which are represented by their centroids. K-means is also known as one of the simplest clustering algorithms.
I’m going to try to explain this algorithm by mentioning those key points of it and try it to simplify it a bit.

A description of k-means

This algorithm starts by choosing K representative points and these let us say K –points are taken as initial centroids. A centroid is the center of gravity of points in a cluster. Each point then is assigned to the closest centroid based on a particular proximity measure chosen. Proximity measures characterize the similarity or dissimilarity that exists between the objects. We need a proximity measure that quantifies the notation of closest and Euclidean distance is often used for data points in Euclidean space and cosine similarity is more suitable for documents. Once the clusters are formed, the centroids for each cluster are updated. The algorithm then iteratively repeats these two steps, which were: Assigning points to the nearest centroids and updating the centroids, this repetition continues until the centroids do not change. Usually, the similarity measures used for K-means are pretty simple because this algorithm repeatedly calculates the similarity of each point to each centroid.

A realistic illustration of the K-means algorithm

The gif above shows how the algorithm works. I couldn’t find a better way to illustrate how the algorithm operates. So here is a brief explanation of the above gif: In the first iteration, random initial centroids are assigned, to be more precise three centroids which means K is three, then the assignment of the points to those centroids. In subsequent iterations, the centroids change positions until convergence or in other words, the centroids do not change anymore. The centroids are indicated by the ‘X’ and all the points which belong to the same cluster have the same color.

So to conclude the K-means algorithm operates by following these steps:

1.Select k numbers of clusters.
2. Calculate the centroids of k clusters.
3. Compute Euclidean distance of each object in the dataset from each of the centroids.
4. Assign each object to the cluster that it is nearest to based on the distances calculated in the previous step.
5. Calculate the centroids of each cluster again.
6. Reassign the objects to clusters based on the least distance if no change then go to 7 or else go to 5.
7. Stop

The k-means algorithm is known to have a time complexity of O(n*k*l), where n is the number of patterns, k is the number of clusters and l is the number of iteration taken by the algorithm to converge.

Note: Even though most of the time when using k-means, the k is chosen randomly I must mention that the result is not very accurate. So it’s really important and you should be very careful when you choose: the number of clusters and the initial centroids because these two are the major facts that can impact the performance of the K-means algorithm.

References

Osama Abu Abbas. Comparisons Between Data Clustering Algorithms, p.322.

Cluster Analysis: Basic concepts and algorithms, p.497.

Data Clustering: Algorithms and Applications(Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, p.89.

Parth Ritin Saraiya, Yogita Ganage. Study of Clustering Techniques in the Data Mining Domain, p.34.

--

--

A dedicated Software Developer, currently pursuing a degree in Computer Science. Blessed with an inquisitive mind and an eager desire to learn new things.