Most naturally, perhaps, the model can be like that for single point estimation: there are some number k of actual data points, which the data model chooses among. The noise model then adds errors to these points, and we should estimate their locations by partitioning them somehow and then applying a single point estimation method to each subset.
Alternatively, the data model can be that there are k different overlayed random distributions of points (such as Gaussians), the noise model merely samples random points from each distribution, and the clustering task is to infer the distributions' parameters. The difference with this second model is that it no longer makes sense to sharply classify which point is part of which cluster; the best one can do is estimate the likelihood that a given point belongs to a given cluster. For instance, if the data form two crossed Gaussians, a point in the central crossing region might be equally likely to belong to either distribution.
Despite these principles, however, many clustering algorithms use ad hoc ideas of choosing a partition that optimizes some functional or other, without regard for how it fits into a data model. Even principled approaches such as K-means (which can be viewed as based on max likelihood ideas) can't be proven to converge to the correct clusters or even to find the global max likelihood solution.
The basic idea is to maintain two estimates: an estimate of the center locations for each cluster, and a separate estimate of the partition of the data points according to which one goes into which cluster. Then, one estimate can be used to refine the other.
If we have an estimate of the center locations, then (with reasonable prior assumptions) the max likelihood solution is that each data point should belong to the cluster with the nearest center. Here "nearest" should be measured according to a distance determined by the noise model, as in single point estimation, but in practice is always Euclidean distance. Therefore, from a set of center locations we can compute a new partition: form the Voronoi diagram of the centers, that is, a partition of space into the regions nearest each center, and make a cluster from the set of points in each Voronoi cell.
Conversely, if we have a partition of the data points into clusters, the max likelihood estimate of the center locations reduces to k independent single point estimation problems; if likelihood is related to variance, the max likelihood estimator is just the centroid.
Therefore the K-means algorithms proceeds by a sequence of phases in which it alternates between moving data points to the cluster of the nearest center, and moving center points to the nearest centroid. There are variants depending on whether points are moved one at a time or in batches; Kanungo et al. [KMNPSW99] have looked at applying computational geometry data structures to speed up each iteration of the algorithm.
For example, the two-center problem seeks to find a partition into two subsets that minimizes the maximum circumradius. One can come up with models for which this is a max-likelihood solution (e.g. the noise model is that the points may be moved arbitrarily within a bounded but unknown radius, with larger radii being less likely than smaller ones) but this seems to be working in the wrong direction: one should start with a noise model and derive algorithms, not vice versa. Since each point may safely be clustered with its nearest circumradius, the optimal partition is formed by the Voronoi diagram of the two points, which is just a line. Early two-center algorithms found the optimal partition by testing all O(n2) ways of dividing the data points by a line. More recently, it has been discovered that the problem can be solved in time O(n polylog(n)) by more sophisticated geometric searching algorithms [E97].
The problem of minimizing the sum of distances from each point to its center is known as the 2-median, since it generalizes the one-dimensional median problem (which minimizes the sum of distances to the single center). Again, the optimal partition is by a line, so a fast algorithm exists. ther problems for which such a line partition works, and therefore fast exact algorithms are known include minimizing the sum (or any monotone combination) of the circumradii [E92].
The problem of finding a partition that minimizes the sum of intra-cluster distances (or, equivalently, maximizes the sum of inter-cluster distances) is known as the "Euclidean max cut". The partition is always formed by a circle [Schulman?] and so the optimal solution can be found in polynomial time by testing all O(n3) ways of dividing the data points by a circle.
Other two-center-like problems for which efficient algorithms are known include finding partitions which minimize the maximum radius, enclosing square size, or enclosing rectangle area, perimeter, or diameter [HS91], minimize the sum of the two cluster diameters [H92], and minimizing the maximum width of the two clusters [AS94, GKS98]. This last problem can be thought of as one of clustering the data along two lines rather than two points (with the Linfinity criterion used within each cluster).
I vaguely recall that Smyth has done some work on the problem of inferring the parameters of a set of overlayed Gaussians, using some kind of Bayesian approach. References and details?
There are several ways of defining the problem formally. From the Bayesian point of view, one should attempt to find the set D of distributions that maximizes the log-likelihood
One important application of the overlayed distribution model is in the belief-propagation approach to coding theory: one assumes that message bits that are sent out as signals with two discrete values (0 and 1) come back as values drawn from two random distributions. One needs to estimate these distributions in order to derive beliefs (Bayesian probabilities) about which measurements are 0's and 1's; the beliefs are then modified by a belief network to perform error correction. This problem fits very well into the discrete algorithmic approach, since the number of data points (message length) is high, and (at least in the simplest cases) one only needs to find two clusters in one-dimensional data. However, I don't know of any theoretical-CS work on this sort of problem.
The work mentioned above on finding a clustering that minimizes the maximum cluster width can be also be viewed as a problem in which the answer consists of overlayed distributions (two infinite strips that together cover the point set).
Define a range space to consist of a "universe" U of objects (often but not always points in Rd) together with a family F of subsets of U (for instance, halfspaces of Rd). We'll call a set R in F a "range". For any finite set S (not necessarily in F), and any positive value epsilon, define an epsilon-net for S to be a subset N of S such that, for any range R,
Specifically, let FS denote the family of sets formed by intersecting ranges with S, and define the "scaffold dimension" of a range space to be the maximum of
Theorem: For any set S in a range space with scaffold dimension d, there exists an epsilon-net with size O(d/epsilon log(d/epsilon)) and an epsilon-approximation with size O(d/epsilon2 log(d/epsilon)).
The proof begins by observing that a random sample of size O(1/epsilon log |FS|) or O(1/epsilon2 log |FS|) is with constant probability an epsilon-net or epsilon-approximation respectively. Also, an epsilon/2-approximation of an epsilon/2-approximation is itself an epsilon-approximation. So, by induction, one can assume that there is a small epsilon/2-approximation, take a random sample of it, and get (with constant probability) an even smaller epsilon-approximation. Then once the result for epsilon-approximations is proven, one can take a random sample of a small epsilon-approximation to get a small epsilon-net. Although the above construction is randomized, there exist deterministic algorithms for constructing epsilon-nets and epsilon-approximations, with time bounds of the form O(n) whenever d and 1/epsilon are bounded by O(1) [citation to be filled in later].
It would be reasonable to use an epsilon-net for the range space of disks (or ellipses) as a form of clustering: the definitions above would imply that every disk-shaped or elliptical cluster of more than epsilon*n points would have a representative.
As I mentioned above, epsilon-nets and epsilon-approximations have been used extensively in geometric algorithms. As an example of a statistical application, consider the regression depth problem: the appropriate range space is the family of double wedges bounded by one vertical and one non-vertical hyperplane. The regression depth of a given hyperplane (measured as a fraction of the size of the overall data set) is within epsilon of the regression depth of the hyperplane relative to an epsilon-approximation of the data. Therefore, after linear-time preprocessing (computing an epsilon-approximation), we can compute good approximations to the regression depth in constant time. Further, the deepest plane with respect to the approximation must be within epsilon of the deepest plane for the overall data set, so we can approximate the depth of the deepest plane to within any (1-epsilon) factor in linear total time [SW98].
Another clustering-like problem is to find a single cluster among a set of points, ignoring other less tightly clustered points. Essentially, we discussed this already under robust single-point estimation, but the emphasis is different (here we might allow many more than n/2 outliers) so algorithmically one can optimize for the case when the number of points in the cluster is small.
David Eppstein,
Theory Group,
Dept. Information & Computer Science,
UC Irvine.
Last update: