Monday, 27 April 2015

k-Means: Step-By-Step Example


As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of seven individuals:

SubjectAB
11.01.0
21.52.0
33.04.0
45.07.0
53.55.0
64.55.0
73.54.5

This data set is to be grouped into two clusters.  As a first step in finding a sensible initial partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial cluster means, giving:
IndividualMean Vector (centroid)
Group 11(1.0, 1.0)
Group 24(5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a new member is added. This leads to the following series of steps:
Cluster 1Cluster 2
StepIndividualMean Vector (centroid)IndividualMean Vector (centroid)
11(1.0, 1.0)4(5.0, 7.0)
21, 2(1.2, 1.5)4(5.0, 7.0)
31, 2, 3(1.8, 2.3)4(5.0, 7.0)
41, 2, 3(1.8, 2.3)4, 5(4.2, 6.0)
51, 2, 3(1.8, 2.3)4, 5, 6(4.3, 5.7)
61, 2, 3(1.8, 2.3)4, 5, 6, 7(4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following characteristics:
IndividualMean Vector (centroid)
Cluster 11, 2, 3(1.8, 2.3)
Cluster 24, 5, 6, 7(4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster.  So, we compare each individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
IndividualDistance to mean (centroid) of Cluster 1Distance to mean (centroid) of Cluster 2
11.55.4
20.44.3
32.11.8
45.71.8
53.20.7
63.80.6
72.81.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1).  In other words, each individual's distance to its own cluster mean should be smaller that the distance to the other cluster's mean (which is not the case with individual 3).  Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
IndividualMean Vector (centroid)
Cluster 11, 2(1.3, 1.5)
Cluster 23, 4, 5, 6, 7(3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocations occur.  However, in this example each individual is now nearer its own cluster mean than that of the other cluster and the iteration stops, choosing the latest partitioning as the final cluster solution.
Also, it is possible that the k-means algorithm won't find a final solution.  In this case it would be a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.

1 comment: