Closest Pair Data Structures: Methods

We have designed, analyzed, and implemented several different closest pair data structures. See the papers or source code for more detail. We analyze the times in terms of two parameters: n, the number of objects in the set, and Q, the time per operation to perform insertions, deletions, and nearest neighbor queries in a dynamic set of objects. Our implementations use trivial nearest neighbor searching (just examine the distances to each objects), for which Q=O(n).

Previously known were:

Brute force. This simply maintains a list of objects, and finds closest pairs by examining all pairs of objects. Insertions and deletions are easy, and space is linear, but queries take time O(nQ) each.
Neighbor heuristic. We store the nearest neighbor to each object. Each insertion takes linear time, but deletions require recomputing neighbors for any object having the deleted object as its neighbor, and may take as much as O(nQ) time. Space is linear. In many applications, deleted objects are unlikely to have high degree, and the empirically observed time per operation is linear or close to linear.
Priority queue (not implemented). We store a priority queue data structure (such as a binary heap) containing all the entries in the distance matrix. Space is quadratic, but worst case time per update is O(n log n).

New closest pair data structures:

Quadtree. We group the entries of the adjacency matrix into 2x2 squares, compute the maximum in each square, interpret these maxima as the distances for a smaller set of n/2 objects, and continue recursively. Space is quadratic, but the worst-case time per operation is linear (optimal unless one assumes some further knowledge about the distance function). This would be the likely method of choice for relatively few objects with very expensive distance computations.
Conga line. We partition the objects into O(log n) subsets and maintain a graph in each subset, such that the closest pair is guaranteed to correspond to an edge in the graph. Each insertion creates a new subset for the new object; each deletion may move an object from each existing subset to a new subset. In each case, if necessary some pair of subsets is merged to maintain the desired number of subsets. Amortized time per insertion is O(Q log n); amortized time per deletion is O(Q log² n). Space is linear.
MultiConga. This is a simplification of the conga line data structure in which we allow the number of subsets to grow arbitrarily, and never merge pairs of subsets. The time per insertion is O(Q); amortized time per deletion is O(Q sqrt n). In practice this is faster than conga lines by a factor of three or more, usually roughly comparible to the neighbor heuristic, and fast even for problems that cause the neighbor heuristic to blow up.
FastPair. We further simplify conga lines by making separate singleton subsets for the objects moved to new subsets by a deletion. This can alternately be viewed as a modification to the neighbor heuristic, in which the initial construction of all nearest neighbors is replaced by a conga line computation, and in which each insertion does not update previously computed neighbors. Its time both theoretically and in practice is qualitatively similar to the neighbor heuristic, but it typically runs 30% or so faster.

David Eppstein, Information & Computer Science, UC Irvine, .