For a nontrivial elaboration of this model, consider the problem of subpixel-resolution imaging: if you take one digital image of a scene, you're pretty much stuck with the pixels of the image itself (well, you can use Bayesian methods to get some subpixel resolution, but only if you already have a good idea what you're looking at). But if you take several images, you can imagine their pixelations as being different applications of some kind of noise to the real image, and average them to get a higher-resolution image.
A related problem is consensus sequence estimation in molecular biology. In this problem, the data consists of a long sequence of characters drawn from a small alphabet; the noise model forms observed DNA sequences from this prototype by sequences of mutations. Another DNA single-point estimation problem is reconstruction of a whole sequence from fragments, as used in gene mapping systems.
Anyway, those examples have a very high dimension. I'm more interested in very low dimension cases where the parameter values really are geometric coordinates.
In one dimension, there are basically two choices for an estimator: the mean and the median. The median is obviously much less sensitive to outliers. As far as I can tell, even for the most well-behaved error model (Gaussian noise), both the median and the mean have the same accuracy (distance of estimation from original value): O(s/sqrt(n)) where s is the variance of the Gaussian. Further the median has better invariance properties (is distance-free). But it is not hard to come up with noise models where the mean is the better choice of the two. For instance, if you know nothing about the noise other than an a priori assumption that lower variances are more likely, then the mean is the max likelihood estimator (it minimizes the variance of the differences between the observation and the estimate, as can easily seen by noting that the variance is a unimodular quadratic function of the estimate and by computing its derivative).
There are various ways of generalizing these ideas to higher dimensions:
Therefore the centroid is the max likelihood estimator for settings with unknown noise and an a priori assumption that lower variance is more likely. It also inherits the accuracy of the one-dimensional mean for a Gaussian noise model (because Gaussian noise has the nice property that it can be generated by treating each coordinate independently).
There is nothing interesting to say about algorithms for computing
centroids.
The uniqueness of
any Lp estimate for p>1
follows since the pth power of distance from each observation is a
strictly convex function,
the sum of strictly convex functions is convex, and a convex function has a
unique minimum. Unfortunately this argument doesn't quite work for p=1:
if all the points are colinear, and there is an even number of points,
then any point on a line segment between the two middle points
has an equal sum of distances. But this is the only exception.
I'm not sure of a noise model for
which this is optimal, but since it uses a lower power of distance
it's less sensitive than the centroid to outliers.
It's also commonly used in management science for facility location e.g.
of a central store location minimizing the travel distance to each of a
population of customers.
There is no good combinatorial algorithm for the Fermat-Weber point,
since its exact location is the solution to a high-degree polynomial
in the observation's coordinates
[B88,CM69].
However, gradient descent methods
converge very rapidly and give good numerical approximations to its
location [W37].
This is extremely sensitive to outliers and usually does not provide
a good estimator. But one can imagine conditions under which it would
be the appropriate choice; e.g. if we know nothing about the noise other
than that each noisy observation is within some (unknown) bounded radius
of the true estimate, then the circumcenter is the most accurate
estimator (its maximum distance from the true data value is
at most half the noise radius).
There exist several linear time algorithms for computing the circumcenter
of a collection of data points, based on linear-programming type techniques
(the problem is not itself a linear program, but can be solved with
much the same algorithms [G95]).
The current best time bounds have the form
O(d2 n + f(d) log n) where f is some
subexponential function of the dimension and does not affect the main
term of the running time.
Bernd Gärtner
has a good and well-documented C++ implementation available
[G99]
which can solve problems with hundreds of thousands of points,
up to dimension 20 or 30.
This provides a robust distance-free method of estimation:
if at most n/(d+1)-1 of the observations are outliers
chosen by an adversary to fool our algorithm, then its estimate
must be within the convex hull of the remaining points and is therefore
not strongly affected by those outliers.
Center points also have applications within computational geometry,
e.g. in finding graph separators, which have further
applications in mesh partitioning, intersection graph construction,
etc [EMT95].
For points in the plane, a centerpoint can be computed in linear
time, and for points in three dimensions the time is O(n polylog(n))
[Citations to be filled in later].
For higher dimensions, the best known time bound is much slower:
O(nd+1). However, there exists a fast approximation
algorithm [Citation to be filled in later].
A good question for discussion: what functional is most appropriate
to optimize (for what noise models)?
The version in which one minimizes the circumradius
is also called the least median of squares
because it minimizes the median (squared) distance of any observation
from the estimate.
Various algorithms are known for these problems.
Most such work has gone into variants where the cardinality of the
selected subset is very small (in which case this is better viewed
as a form of clustering than of robust estimation)
[EE94]
or very large (only tolerant of a small number of outliers).
It is often possible to show that, for problems like this,
the optimal solution consists of the k nearest neighbors of some
estimated center point. For instance, in the least median of squares
problem, the optimal subset consists of the nearest neighbors
to the optimal subset's circumcenter. Similarly, the k-element subset
minimizing the variance consists of the k points nearest
the optimal subset's median. Obviously, we don't know this center
point in advance, or we wouldn't need to use an estimation algorithm.
But, one can use this idea to limit the number of subsets to examine.
The "order-k Voronoi diagram" essentially encapsulates the family of
possible subsets: it is defined as a partition of space into cells,
such that any two points in the same cell have the same set of k nearest
neighbors. In the plane, this diagram is known to have O(kn) cells,
and can be constructed in time
O(n log n + kn 2O(log* n))
[R99].
The variance of any one cell can be updated from its neighbor in
constant time, so the minimum variance k-element subset can
be found in the same time bound -- nearly quadratic for the
most statistically relevant case, when k is roughly n/2.
By only computing these diagrams for certain subsets of points,
one can remove the n 2O(log* n) factor
when k is a little smaller than n [EE94].
The same idea works for least median of squares, but is slower
because it's more difficult to update the circumradius as one moves
from cell to cell
[E92].
Instead, the fastest known algorithms
use parametric search (a complicated variant of binary search)
based on a decision algorithm which tests a given circle radius
by finding the largest subset of points that fit within that radius.
The decision problem can be solved by drawing a circle of the given
radius around each point, building the arrangement of these circles,
and computing how many circles surround each cell
(in constant time per cell by adding or subtracting one to the value
from a neighboring cell).
The best time bounds (again, for k roughly n/2, with some improvement
possible for smaller k)
are O(n2 log n) for d=2
and O(nd log2n) for d>2
[EE94].
If we assume a priori some centrally symmetric noise
distribution, or if we assume that the noise in each coordinate
is an independent Gaussian (in which case the overall distribution ends
up being centrally symmetric), then Euclidean distance makes sense.
Otherwise, it seems that we should choose the distance function
from p to q
to be determined by the probability that a data point actually at p
could be perturbed so that we measure it at q.
There is no reason to expect such a distance function to obey
the axioms of a metric or in any other way be well behaved.
Perhaps the fact that they avoid having to choose a distance
function is one of the things that makes distance-free methods
such as the centerpoint attractive.
David Eppstein,
Theory Group,
Dept. Information & Computer Science,
UC Irvine.Fermat-Weber Point
More generally, one can define an Lp estimate to be one
that minimizes the sum of pth powers of distances of observations from
the estimate. The centroid is the L2 estimate;
the next most commonly used estimate is the Circumcenter
The limit as p goes to infinity of the pth root of the sum of pth powers
of distances is just the maximum of the distances. And so, the limit of
the Lp estimators is the Linfinity estimator,
which minimizes the maximum distance from the observation to any data
point. If one draws a circle or sphere around the estimate, with radius
equal to that distance, it contains all the data points, and no smaller
circle or sphere contains the same points, so this estimate is also
called the circumcenter.
Centerpoint
Any point set has a center point,
that is, a point x such that any halfspace that doesn't contain x
can contain at most a constant fraction dn/(d+1) of the observations.
Equivalently, any halfspace containing x contains at least n/(d+1)
observations.
For a proof, see [ABET98, section 2.5].
Smallest subset methods
If one takes as the primary aim robustness, one can do even
better than the centerpoint. The maximum number of outliers one could hope
to detect is floor((n-1)/2), because if more than half the observations could
be outliers, there wouldn't be any way in principle to distinguish the
outlying half of the data from the unperturbed half.
But one can tolerate exactly this many outliers, if one uses a method
that finds the subset of ceiling((n+1)/2) points that minimizes
any of various functionals: e.g. the circumradius, variance, or convex
hull area. The non-outlying observations form a small subset under each of
these functionals, and any other subset must include at least one
non-outlier and so can only be as small if it is also near the
unperturbed data.
(least median of squares)What distance function to use?
All this has assumed that the distance between two points should
be measured using the normal Euclidean distance,
in which we square the difference in each coordinate,
sum the squares, and take the square root of the sum.
But it seems that this assumption should be based on our choice of noise
model.
NEXT: Linear regression
Last update: