What is Data Clustering?
* Data Clustering is an unsupervised learning problem
* Given: N unlabeled examples {x1, . . . , xN }; the number of partitions K
* Goal: Group the examples into K partitions
* The only information clustering uses is the similarity between examples
* Clustering groups examples based of their mutual similarities
* A good clustering is one that achieves:
* High within-cluster similarity
* Low inter-cluster similarity
Notions of Similarity
* Choice of the similarity measure is very important for clustering
* Similarity is inversely related to distance
* Different ways exist to measure distances. Some examples:
* Euclidean distance: d(x, z) = ||x − z||
* Kernelized (non-linear) distance: d(x, z) = || (x) − (z)||
Types of Clustering
Flat or Partitional clustering (K-means, Gaussian mixture models, etc.)
* Partitions are independent of each other
Hierarchical clustering (e.g., agglomerative clustering, divisive clustering)
* Partitions can be visualized using a tree structure (a dendrogram)
* Does not need the number of clusters as input
* Possible to view partitions at different levels of granularities (i.e., can
refine/coarsen clusters) using different K
Flat Clustering: K-means algorithm (Lloyd, 1957)
* Input: N examples {x1, . . . , xN} (Xn E RD); the number of partitions K
* Initialize: K cluster centers μ1, . . . , μK . Several initialization options:
* Randomly initialized anywhere in RD
*Choose any K examples as the cluster centers
*Iterate:
* Assign each of example xn to its closest cluster center
Ck = {n : k = arg min ||xn − μk ||2}
(Ck is the set of examples closest to μk )
* Recompute the new cluster centers μk (mean/centroid of the set Ck )
* Repeat while not converged
* A possible convergence criteria: cluster centers do not change anymore
K-means: Limitations
* Makes hard assignments of points to clusters
* A point either completely belongs to a cluster or not belongs at all
*No notion of a soft assignment (i.e., probability of being assigned to each
cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
*Gaussian mixture models and Fuzzy K-means allow soft assignments
*Sensitive to outlier examples (such examples can affect the mean by a lot)
*K-medians algorithm is a more robust alternative for data with outliers
*Reason: Median is more robust than mean in presence of outliers
*Works well only for round shaped, and of roughtly equal sizes/density clusters
*Does badly if the clusters have non-convex shapes
*Spectral clustering or kernelized K-means can be an alternative
Hierarchical Clustering
* Agglomerative (bottom-up) Clustering
1 Start with each example in its own singleton cluster
2 At each time-step, greedily merge 2 most similar clusters
3 Stop when there is a single cluster of all examples, else go to 2
*Divisive (top-down) Clustering
1 Start with all examples in the same cluster
2 At each time-step, remove the “outsiders” from the least cohesive cluster
3 Stop when each example is in its own singleton cluster, else go to 2
*Agglomerative is more popular and simpler than divisive (but less accurarate)
Flat vs Hierarchical Clustering
*Flat clustering produces a single partitioning
*Hierarchical Clustering can give different partitionings depending on the
level-of-resolution we are looking at
* Flat clustering needs the number of clusters to be specified
* Hierarchical clustering doesn’t need the number of clusters to be specified
* Flat clustering is usually more efficient run-time wise
* Hierarchical clustering can be slow (has to make several merge/split
decisions)
* No clear consensus on which of the two produces better clustering
Regards
Saurabh Khandelwal
* Data Clustering is an unsupervised learning problem
* Given: N unlabeled examples {x1, . . . , xN }; the number of partitions K
* Goal: Group the examples into K partitions
* The only information clustering uses is the similarity between examples
* Clustering groups examples based of their mutual similarities
* A good clustering is one that achieves:
* High within-cluster similarity
* Low inter-cluster similarity
Notions of Similarity
* Choice of the similarity measure is very important for clustering
* Similarity is inversely related to distance
* Different ways exist to measure distances. Some examples:
* Euclidean distance: d(x, z) = ||x − z||
* Kernelized (non-linear) distance: d(x, z) = || (x) − (z)||
Types of Clustering
Flat or Partitional clustering (K-means, Gaussian mixture models, etc.)
* Partitions are independent of each other
Hierarchical clustering (e.g., agglomerative clustering, divisive clustering)
* Partitions can be visualized using a tree structure (a dendrogram)
* Does not need the number of clusters as input
* Possible to view partitions at different levels of granularities (i.e., can
refine/coarsen clusters) using different K
Flat Clustering: K-means algorithm (Lloyd, 1957)
* Input: N examples {x1, . . . , xN} (Xn E RD); the number of partitions K
* Initialize: K cluster centers μ1, . . . , μK . Several initialization options:
* Randomly initialized anywhere in RD
*Choose any K examples as the cluster centers
*Iterate:
* Assign each of example xn to its closest cluster center
Ck = {n : k = arg min ||xn − μk ||2}
(Ck is the set of examples closest to μk )
* Recompute the new cluster centers μk (mean/centroid of the set Ck )
* Repeat while not converged
* A possible convergence criteria: cluster centers do not change anymore
K-means: Limitations
* Makes hard assignments of points to clusters
* A point either completely belongs to a cluster or not belongs at all
*No notion of a soft assignment (i.e., probability of being assigned to each
cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
*Gaussian mixture models and Fuzzy K-means allow soft assignments
*Sensitive to outlier examples (such examples can affect the mean by a lot)
*K-medians algorithm is a more robust alternative for data with outliers
*Reason: Median is more robust than mean in presence of outliers
*Works well only for round shaped, and of roughtly equal sizes/density clusters
*Does badly if the clusters have non-convex shapes
*Spectral clustering or kernelized K-means can be an alternative
Hierarchical Clustering
* Agglomerative (bottom-up) Clustering
1 Start with each example in its own singleton cluster
2 At each time-step, greedily merge 2 most similar clusters
3 Stop when there is a single cluster of all examples, else go to 2
*Divisive (top-down) Clustering
1 Start with all examples in the same cluster
2 At each time-step, remove the “outsiders” from the least cohesive cluster
3 Stop when each example is in its own singleton cluster, else go to 2
*Agglomerative is more popular and simpler than divisive (but less accurarate)
Flat vs Hierarchical Clustering
*Flat clustering produces a single partitioning
*Hierarchical Clustering can give different partitionings depending on the
level-of-resolution we are looking at
* Flat clustering needs the number of clusters to be specified
* Hierarchical clustering doesn’t need the number of clusters to be specified
* Flat clustering is usually more efficient run-time wise
* Hierarchical clustering can be slow (has to make several merge/split
decisions)
* No clear consensus on which of the two produces better clustering
Regards
Saurabh Khandelwal
Data Clustering: K-means and Hierarchical Clustering useful blog spot!!!
ReplyDeleteDistributor Locations in Mumbai
Distributors Profile in Delhi
Distributor Lists in Gurugram
Geo-locations in Gurugram