K-Means Clustering

Let's try to get the bigger picture of k-means clustering algorithm.

Category Clustering | Tags: K-Means

Published: 13 December 2018

K-Means Clustering Statement

K-means tries to partition x data points into the set of k clusters where each data point is assigned to its closest cluster. This method is defined by the objective function which tries to minimize the sum of all squared distances within a cluster, for all clusters.

The objective function is defined as:






Where xj is a data point in the data set, Si is a cluster (set of data points and ui is the cluster mean(the center of cluster of Si)


K-Means Clustering Algorithm:

1. Choose a value of k, number of clusters to be formed.

2. Randomly select k data points from the data set as the intital cluster centeroids/centers

3. For each datapoint:

a. Compute the distance between the datapoint and the cluster centroid

b. Assign the datapoint to the closest centroid

4. For each cluster calculate the new mean based on the datapoints in the cluster.

5. Repeat 3 & 4 steps until mean of the clusters stops changing or maximum number of iterations reached.

 
Flowchart of K-Means Clustering

Understanding with a simple example

We will apply k-means on the following 1 dimensional data set for K=2.

Data set {2, 4, 10, 12, 3, 20, 30, 11, 25}

Iteration 1

M1, M2 are the two randomly selected centroids/means where

M1= 4, M2=11

and the initial clusters are

C1= {4}, C2= {11}

Calculate the Euclidean distance as

D=[x,a]=√(x-a)²

D1 is the distance from M1

D2 is the distance from M2

 
Iteration 1

As we can see in the above table, 2 datapoints are added to cluster C1 and other datapoints added to cluster C2

Therefore

C1= {2, 4, 3}

C2= {10, 12, 20, 30, 11, 25}

Iteration 2

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4)/3= 3

M2= (10+12+20+30+11+25)/6= 18

Calculating distance and updating clusters based on table below

 
Iteration 2

New Clusters

C1= {2, 3, 4, 10}

C2= {12, 20, 30, 11, 25}

Iteration 3

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10)/4= 4.75

M2= (12+20+30+11+25)/5= 19.6

Calculating distance and updating clusters based on table below

 
Iteration 3

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

Iteration 4

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10+12+11)/6=7

M2= (20+30+25)/3= 25

Calculating distance and updating clusters based on table below

 
Iteration 4

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

As we can see that the data points in the cluster C1 and C2 in iteration 3 are same as the data points of the cluster C1 and C2 of iteration 2.

It means that none of the data points has moved to other cluster. Also the means/centeroid of these clusters is constant. So this becomes the stopping condition for our algorithm.

How many clusters?

Selecting a proper value of ‘K’ is very difficult until we have a good knowledge about our data set.

Therefore we need some method to determine and validate weather we are using the right number of clusters. The fundamental aim of partitioning a data set is minimizing the intra-cluster variation or SSE. SSE can be calculated by –First take the difference between each data point with its centroid and then add up all the squares of the differences calculated

So to find optimal number of clusters:

Run k-means for different values of ‘K’. For example K varying from 1 to 10 and for each value of K compute SSE.

Plot a line chart K values on x axis and its corresponding values of SSE on y axis as shown below.

 
Elbow Method

SSE=0 if K=number of clusters, which means that each data point has its own cluster.

As we can see in the graph there is a rapid drop in SSE as we move from K=2 to 3 and it becomes almost constant as the value of K is further increased.

Because of the sudden drop we see an elbow in the graph. So the value to be considered for K is 3. This method is known as elbow method.

There are also many other techniques which are used to determine the value of K.

K-means is the ‘go-to’ clustering algorithm because it is fast and easy to understand.

Listing some drawbacks of K-Means

1. The result might not be globally optimal: We can’t assure that this algorithm will lead to the best global solution. Selecting different random seeds at the beginning affects the final results.

2. Value of K need to be specified beforehand: We can expect this value only if we have a good idea about our dataset and if we are working with a new dataset then elbow method can be used to determine value of K.

3. Works only for linear boundaries: K-means makes this assumption that the boundaries will be always linear. Hence it fails when it comes to complicated boundaries.

4. Slow for large number of samples: As this algorithm access each point of the dataset, it becomes slow when the sample size grows.

So this was all about K-Means. I hope now you have the better undrstanding of how k-means actually works. There are many other algorithms that are used for clustering in the industry.