ICS 280, Spring 1999:
Computational Statistics

Single Point Estimators

The simplest data model is one that just passes its parameters as data values, so that all the variation in the observed data is actually noise. What one wants then is some kind of average that removes the noise and returns the original value.

For a nontrivial elaboration of this model, consider the problem of subpixel-resolution imaging: if you take one digital image of a scene, you're pretty much stuck with the pixels of the image itself (well, you can use Bayesian methods to get some subpixel resolution, but only if you already have a good idea what you're looking at). But if you take several images, you can imagine their pixelations as being different applications of some kind of noise to the real image, and average them to get a higher-resolution image.

A related problem is consensus sequence estimation in molecular biology. In this problem, the data consists of a long sequence of characters drawn from a small alphabet; the noise model forms observed DNA sequences from this prototype by sequences of mutations. Another DNA single-point estimation problem is reconstruction of a whole sequence from fragments, as used in gene mapping systems.

Anyway, those examples have a very high dimension. I'm more interested in very low dimension cases where the parameter values really are geometric coordinates.

In one dimension, there are basically two choices for an estimator: the mean and the median. The median is obviously much less sensitive to outliers. As far as I can tell, even for the most well-behaved error model (Gaussian noise), both the median and the mean have the same accuracy (distance of estimation from original value): O(s/sqrt(n)) where s is the variance of the Gaussian. Further the median has better invariance properties (is distance-free). But it is not hard to come up with noise models where the mean is the better choice of the two. For instance, if you know nothing about the noise other than an a priori assumption that lower variances are more likely, then the mean is the max likelihood estimator (it minimizes the variance of the differences between the observation and the estimate, as can easily seen by noting that the variance is a unimodular quadratic function of the estimate and by computing its derivative).

There are various ways of generalizing these ideas to higher dimensions:

Centroid

This is the most natural generalization of the mean, and is found simply by computing the mean separately for each data coordinate. An alternate way of defining it is the least mean squares estimate: that is, it minimizes the mean (or equivalently sum) of the squared distances from the observation to the estimate. This can be seen from the fact that in the sum of squared distances, each coordinate can be treated independently of each other coordinate, and we have already seen that the mean has this property in one dimension.

Therefore the centroid is the max likelihood estimator for settings with unknown noise and an a priori assumption that lower variance is more likely. It also inherits the accuracy of the one-dimensional mean for a Gaussian noise model (because Gaussian noise has the nice property that it can be generated by treating each coordinate independently).

There is nothing interesting to say about algorithms for computing centroids.

Fermat-Weber Point

More generally, one can define an L_p estimate to be one that minimizes the sum of pth powers of distances of observations from the estimate. The centroid is the L₂ estimate; the next most commonly used estimate is the 1 estimate, also known as the Fermat-Weber point. It generalizes the median (since the median minimizes the sum of distances in one dimension, as can easily be seen by considering derivatives).

The uniqueness of any L_p estimate for p>1 follows since the pth power of distance from each observation is a strictly convex function, the sum of strictly convex functions is convex, and a convex function has a unique minimum. Unfortunately this argument doesn't quite work for p=1: if all the points are colinear, and there is an even number of points, then any point on a line segment between the two middle points has an equal sum of distances. But this is the only exception.

I'm not sure of a noise model for which this is optimal, but since it uses a lower power of distance it's less sensitive than the centroid to outliers. It's also commonly used in management science for facility location e.g. of a central store location minimizing the travel distance to each of a population of customers.

There is no good combinatorial algorithm for the Fermat-Weber point, since its exact location is the solution to a high-degree polynomial in the observation's coordinates [B88,CM69]. However, gradient descent methods converge very rapidly and give good numerical approximations to its location [W37].

Circumcenter

The limit as p goes to infinity of the pth root of the sum of pth powers of distances is just the maximum of the distances. And so, the limit of the L_p estimators is the L_infinity estimator, which minimizes the maximum distance from the observation to any data point. If one draws a circle or sphere around the estimate, with radius equal to that distance, it contains all the data points, and no smaller circle or sphere contains the same points, so this estimate is also called the circumcenter.

This is extremely sensitive to outliers and usually does not provide a good estimator. But one can imagine conditions under which it would be the appropriate choice; e.g. if we know nothing about the noise other than that each noisy observation is within some (unknown) bounded radius of the true estimate, then the circumcenter is the most accurate estimator (its maximum distance from the true data value is at most half the noise radius).

There exist several linear time algorithms for computing the circumcenter of a collection of data points, based on linear-programming type techniques (the problem is not itself a linear program, but can be solved with much the same algorithms [G95]). The current best time bounds have the form O(d² n + f(d) log n) where f is some subexponential function of the dimension and does not affect the main term of the running time. Bernd Gärtner has a good and well-documented C++ implementation available [G99] which can solve problems with hundreds of thousands of points, up to dimension 20 or 30.

Centerpoint

Any point set has a center point, that is, a point x such that any halfspace that doesn't contain x can contain at most a constant fraction dn/(d+1) of the observations. Equivalently, any halfspace containing x contains at least n/(d+1) observations. For a proof, see [ABET98, section 2.5].

This provides a robust distance-free method of estimation: if at most n/(d+1)-1 of the observations are outliers chosen by an adversary to fool our algorithm, then its estimate must be within the convex hull of the remaining points and is therefore not strongly affected by those outliers.

Center points also have applications within computational geometry, e.g. in finding graph separators, which have further applications in mesh partitioning, intersection graph construction, etc [EMT95].

For points in the plane, a centerpoint can be computed in linear time, and for points in three dimensions the time is O(n polylog(n)) [Citations to be filled in later]. For higher dimensions, the best known time bound is much slower: O(n^d+1). However, there exists a fast approximation algorithm [Citation to be filled in later].

Smallest subset methods
(least median of squares)

If one takes as the primary aim robustness, one can do even better than the centerpoint. The maximum number of outliers one could hope to detect is floor((n-1)/2), because if more than half the observations could be outliers, there wouldn't be any way in principle to distinguish the outlying half of the data from the unperturbed half. But one can tolerate exactly this many outliers, if one uses a method that finds the subset of ceiling((n+1)/2) points that minimizes any of various functionals: e.g. the circumradius, variance, or convex hull area. The non-outlying observations form a small subset under each of these functionals, and any other subset must include at least one non-outlier and so can only be as small if it is also near the unperturbed data.

A good question for discussion: what functional is most appropriate to optimize (for what noise models)?

The version in which one minimizes the circumradius is also called the least median of squares because it minimizes the median (squared) distance of any observation from the estimate.

Various algorithms are known for these problems. Most such work has gone into variants where the cardinality of the selected subset is very small (in which case this is better viewed as a form of clustering than of robust estimation) [EE94] or very large (only tolerant of a small number of outliers).

It is often possible to show that, for problems like this, the optimal solution consists of the k nearest neighbors of some estimated center point. For instance, in the least median of squares problem, the optimal subset consists of the nearest neighbors to the optimal subset's circumcenter. Similarly, the k-element subset minimizing the variance consists of the k points nearest the optimal subset's median. Obviously, we don't know this center point in advance, or we wouldn't need to use an estimation algorithm. But, one can use this idea to limit the number of subsets to examine. The "order-k Voronoi diagram" essentially encapsulates the family of possible subsets: it is defined as a partition of space into cells, such that any two points in the same cell have the same set of k nearest neighbors. In the plane, this diagram is known to have O(kn) cells, and can be constructed in time O(n log n + kn 2^O(log* n)) [R99]. The variance of any one cell can be updated from its neighbor in constant time, so the minimum variance k-element subset can be found in the same time bound -- nearly quadratic for the most statistically relevant case, when k is roughly n/2. By only computing these diagrams for certain subsets of points, one can remove the n 2^O(log* n) factor when k is a little smaller than n [EE94].

The same idea works for least median of squares, but is slower because it's more difficult to update the circumradius as one moves from cell to cell [E92]. Instead, the fastest known algorithms use parametric search (a complicated variant of binary search) based on a decision algorithm which tests a given circle radius by finding the largest subset of points that fit within that radius. The decision problem can be solved by drawing a circle of the given radius around each point, building the arrangement of these circles, and computing how many circles surround each cell (in constant time per cell by adding or subtracting one to the value from a neighboring cell). The best time bounds (again, for k roughly n/2, with some improvement possible for smaller k) are O(n² log n) for d=2 and O(n^d log²n) for d>2 [EE94].

What distance function to use?

All this has assumed that the distance between two points should be measured using the normal Euclidean distance, in which we square the difference in each coordinate, sum the squares, and take the square root of the sum. But it seems that this assumption should be based on our choice of noise model.

If we assume a priori some centrally symmetric noise distribution, or if we assume that the noise in each coordinate is an independent Gaussian (in which case the overall distribution ends up being centrally symmetric), then Euclidean distance makes sense. Otherwise, it seems that we should choose the distance function from p to q to be determined by the probability that a data point actually at p could be perturbed so that we measure it at q. There is no reason to expect such a distance function to obey the axioms of a metric or in any other way be well behaved.

Perhaps the fact that they avoid having to choose a distance function is one of the things that makes distance-free methods such as the centerpoint attractive.

NEXT: Linear regression

David Eppstein, Theory Group, Dept. Information & Computer Science, UC Irvine.
Last update:

ICS 280, Spring 1999: Computational Statistics