Welcome to Business Analytics blog :): Data Clustering: K-means and Hierarchical Clustering

What is Data Clustering?

* Data Clustering is an unsupervised learning problem

* Given: N unlabeled examples {x1, . . . , xN }; the number of partitions K

* Goal: Group the examples into K partitions

* The only information clustering uses is the similarity between examples
* Clustering groups examples based of their mutual similarities
* A good clustering is one that achieves:
* High within-cluster similarity
* Low inter-cluster similarity

Notions of Similarity

* Choice of the similarity measure is very important for clustering
* Similarity is inversely related to distance
* Different ways exist to measure distances. Some examples:
* Euclidean distance: d(x, z) = ||x − z||
* Kernelized (non-linear) distance: d(x, z) = || (x) − (z)||

Types of Clustering

Flat or Partitional clustering (K-means, Gaussian mixture models, etc.)
* Partitions are independent of each other

Hierarchical clustering (e.g., agglomerative clustering, divisive clustering)

* Partitions can be visualized using a tree structure (a dendrogram)
* Does not need the number of clusters as input
* Possible to view partitions at different levels of granularities (i.e., can
refine/coarsen clusters) using different K

Flat Clustering: K-means algorithm (Lloyd, 1957)

* Input: N examples {x1, . . . , xN} (Xn E RD); the number of partitions K
* Initialize: K cluster centers μ1, . . . , μK . Several initialization options:
* Randomly initialized anywhere in RD
*Choose any K examples as the cluster centers
*Iterate:
* Assign each of example xn to its closest cluster center
Ck = {n : k = arg min ||xn − μk ||2}
(Ck is the set of examples closest to μk )

* Recompute the new cluster centers μk (mean/centroid of the set Ck )
* Repeat while not converged
* A possible convergence criteria: cluster centers do not change anymore

K-means: Limitations

* Makes hard assignments of points to clusters
* A point either completely belongs to a cluster or not belongs at all
*No notion of a soft assignment (i.e., probability of being assigned to each
cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
*Gaussian mixture models and Fuzzy K-means allow soft assignments

*Sensitive to outlier examples (such examples can affect the mean by a lot)
*K-medians algorithm is a more robust alternative for data with outliers
*Reason: Median is more robust than mean in presence of outliers

*Works well only for round shaped, and of roughtly equal sizes/density clusters

*Does badly if the clusters have non-convex shapes
*Spectral clustering or kernelized K-means can be an alternative

Hierarchical Clustering

* Agglomerative (bottom-up) Clustering
1 Start with each example in its own singleton cluster
2 At each time-step, greedily merge 2 most similar clusters
3 Stop when there is a single cluster of all examples, else go to 2

*Divisive (top-down) Clustering
1 Start with all examples in the same cluster
2 At each time-step, remove the “outsiders” from the least cohesive cluster
3 Stop when each example is in its own singleton cluster, else go to 2

*Agglomerative is more popular and simpler than divisive (but less accurarate)

Flat vs Hierarchical Clustering

*Flat clustering produces a single partitioning
*Hierarchical Clustering can give different partitionings depending on the
level-of-resolution we are looking at
* Flat clustering needs the number of clusters to be specified
* Hierarchical clustering doesn’t need the number of clusters to be specified
* Flat clustering is usually more efficient run-time wise
* Hierarchical clustering can be slow (has to make several merge/split
decisions)

* No clear consensus on which of the two produces better clustering

Regards
Saurabh Khandelwal

Welcome to Business Analytics blog :)

Friday, 1 May 2015

Data Clustering: K-means and Hierarchical Clustering

1 comment: