Friday, 1 May 2015

Data Clustering: K-means and Hierarchical Clustering

What is Data Clustering?

* Data Clustering is an unsupervised learning problem

* Given: N unlabeled examples {x1, . . . , xN }; the number of partitions K

* Goal: Group the examples into K partitions



 * The only information clustering uses is the similarity between examples
 * Clustering groups examples based of their mutual similarities
*  A good clustering is one that achieves:
  * High within-cluster similarity
   * Low inter-cluster similarity

Notions of Similarity

* Choice of the similarity measure is very important for clustering
* Similarity is inversely related to distance
* Different ways exist to measure distances. Some examples:
      * Euclidean distance: d(x, z) = ||x − z||
      * Kernelized (non-linear) distance: d(x, z) = || (x) − (z)||


Types of Clustering

Flat or Partitional clustering (K-means, Gaussian mixture models, etc.)
 * Partitions are independent of each other

Hierarchical clustering (e.g., agglomerative clustering, divisive clustering)

* Partitions can be visualized using a tree structure (a dendrogram)
* Does not need the number of clusters as input
* Possible to view partitions at different levels of granularities (i.e., can
    refine/coarsen clusters) using different K

Flat Clustering: K-means algorithm (Lloyd, 1957)

* Input: N examples {x1, . . . , xN} (Xn E RD); the number of partitions K
* Initialize: K cluster centers μ1, . . . , μK . Several initialization options:
    * Randomly initialized anywhere in RD
    *Choose any K examples as the cluster centers
*Iterate:
    * Assign each of example xn to its closest cluster center
Ck = {n : k = arg min  ||xn − μk ||2}
(Ck is the set of examples closest to μk )

        * Recompute the new cluster centers μk (mean/centroid of the set Ck )
        * Repeat while not converged
* A possible convergence criteria: cluster centers do not change anymore

K-means: Limitations

* Makes hard assignments of points to clusters
   * A point either completely belongs to a cluster or not belongs at all
   *No notion of a soft assignment (i.e., probability of being assigned to each
    cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
   *Gaussian mixture models and Fuzzy K-means allow soft assignments

*Sensitive to outlier examples (such examples can affect the mean by a lot)
    *K-medians algorithm is a more robust alternative for data with outliers
    *Reason: Median is more robust than mean in presence of outliers

*Works well only for round shaped, and of roughtly equal sizes/density clusters

*Does badly if the clusters have non-convex shapes
   *Spectral clustering or kernelized K-means can be an alternative


Hierarchical Clustering

* Agglomerative (bottom-up) Clustering
     1 Start with each example in its own singleton cluster
     2 At each time-step, greedily merge 2 most similar clusters
     3 Stop when there is a single cluster of all examples, else go to 2

*Divisive (top-down) Clustering
     1 Start with all examples in the same cluster
     2 At each time-step, remove the “outsiders” from the least cohesive cluster
     3 Stop when each example is in its own singleton cluster, else go to 2


*Agglomerative is more popular and simpler than divisive (but less accurarate)



Flat vs Hierarchical Clustering

*Flat clustering produces a single partitioning
*Hierarchical Clustering can give different partitionings depending on the
  level-of-resolution we are looking at
* Flat clustering needs the number of clusters to be specified
* Hierarchical clustering doesn’t need the number of clusters to be specified
* Flat clustering is usually more efficient run-time wise
* Hierarchical clustering can be slow (has to make several merge/split
    decisions)

* No clear consensus on which of the two produces better clustering

Regards
Saurabh Khandelwal

1 comment: