An Algorithm for Finding Best Matches in Logarithmic Expected Time

doi:10.1145/355744.355745

SLAC-PUB-1549

(Rev.)

STAN-CS-75-482

February

1975

Revised December

1975

Revised July

1976

AN ALGORITHM FOR FINDING BEST MATCHES

IN LOGARITHMIC EXPECTED TIME

Jerome H. Friedman

Stanford Linear Accelerator Center

Stanford University, Stanford, Ca. 94305

Jon Louis Bentley

Department of Computer Science

University of North Carolina at Chapel Hill

Chapel Hill, N.C. 27514

Raphael Ari Finkel

Department of Computer Science

Stanford University, Stanford, Ca. 94305

ABSTRACT

An algorithm and data structure are presented for searching

a file containing N records,

each described by k real valued keys,

for the m closest matches or nearest neighbors to a given query

record. The computation required to organize the file is propor-

tional to kNlogN.

The expected number of records examined in

each search is independent of the file size.

The expected compu-

tation to perform each search is proportional-to 1ogN. Empirical

evidence suggests that except for very small files, this algorithm

is considerably faster than other methods.

(Submitted to ACM Transactions on Mathematical Software)

Work supported in part by U.S. Energy Research and Development

Administration under contract

E(O43)515

The Best Match or Nearest Neighbor Problem

The best match or nearest neighbor problem applies to data files

that store records with several real valued keys or attributes. The pro-

blem is to find those records in the file most similar to a query record

according to some dissimilarity or distance measure.

Formally, given a

file of N recor,ds (each of which is described by k real valued attributes)

and

a dissimilarity measure D,

find the m closest records to a query

record (possibly not in the file) with specified attribute values.

A data file, for example,

might contain information on all cities

with post offices.

Associated with each city is its longitude and lati-

tude. If a letter is addressed to a town without a post office, the

closest town that has a post office might be chosen as the destination.

The solution to this problem is of use in many applications.

Infor-

mation retrieval might involve searching a catalog for those items most

similar to a given query item;

each item in the file would be cataloged

by numerical attributes that describe its characteristics.

Classification

decisions can be made by selecting prototype features from each category

and finding which of these prototypes is closest to the record to be

classified.

Multivariate density estimation can be performed by calcu-

lating the volume about a given point ccntaining the closest m neighbors.

Structures Used for Associative Searching

One straightforward technique for solving the best match or nearest

neighbor problem is the cell method.

The k-dimensional key space is di-

vided into small,identically sized cells.

A spiral search of the cells

from any query record will find the best matches of that record. Although

_

this procedure minimizes the number of records examined, it is extremely

costly in space and time,

especially when the dimensionality of the space

is large.

-l-

I

Burkhard and Keller [l] and later Fukunaga and Narendra [2] des-

cribe heuristic strategies based on clustering techniques.

These strate-

gies use the triangle inequality to eliminate some of the records from

consideration while searching the file.

Although no calculations of ex-

pected performance are presented,

simulation experiments indfcate

that

these techniques permit

a substantial fraction of the records to be

eliminated from consideration.

Friedman,

Basket-t, and Shustek

[3]

describe another strategy for

solving the nearest neighbor problem.

It involves forming a projection

of the records onto one or more keys,

keeping a linear list on those

keys,

and searching only those records that match closely enough on one

of the keys.

The

measures and does

They were able to

method is applicable to a wide variety of dissimilarity

not require that they satisfy the triangle inequality.

show that the expected computation required to search

1 1

the file with this method is proportional to kmk l$-*E .

Rivest [4] shows the optimality of an algorithm due to Elias which

deals with binary keys. That is, each key takes on only two values; the

distance function applied is the Hamming distance.

Shamos

[5]

employs the Voroni diagram (a general structure for

searching the plane) to the best match problem for the special case of

two keys per record (two dimensions) and Euclidean distance measure. He

presents two algorithms.

One can search for best matches in worst case

O[(logN)2] time,

after a file organization that requires storage propor-

tional to N and computation proportional to NlogN. The other algorithm

can perform the search in worst case O[logN] time, after a file organi-

zation that requires both storage and computation proportional to

N?

Unfortunately, these methods have not yet been generalized to higher

-2-

dimensionalities

or more general dissimilarity measures.

Finkel and Bentley

[6]

describe a tree structure, called the quad

tree, for the storage of composite keys.

It is a generalization of the

binary tree for storing data on single keys. Bentley l-71 develops a

different generalization of the same one-dimensional structure; it is

termed the k-d tree.

In his article, Bentley suggests that k-d trees

could be applied to the best match problem.

This paper introduces an optimized k-d tree algorithm for the pro-

blem of finding best matches.

This data structure is very effective in

partitioning the records in the file so that the average number of record

examinations

(1)

involved in searching the file for best matches is quite

small. This method can be applied with a wide variety of dissimilarity

measures and does not require that they obey the triangle inequality.

The storage required for file organization is proportional to N, while

computation is proportional to kNlogN. For large files, the expected

number of record examinations required for the search is shown to be in-

dependent of the file size, N.

The time spent in descending the tree

during the search is proportional to logN,

so that the expected time re-

quired to search for best matches with this method is proportional to

1ogN.

Definition of the k-d Tree

The k-d tree is a generalization of the simple binary tree used for

sorting and searching.

The k-d tree is a binary tree in which each node

represents a subfile of the records in the file and a partitioning of

that subfile.

The root of the tree represents the entire file. Each

nonterminal node has two sons or successor nodes.

These successor nodes

-3-

.,

represent the two subfiles defined by the partitioning.

The terminal

nodes represent mutually exclusive small subsets of the data records,

which collectively form a partition of the record space.

These terminal

subsets of records are called buckets.

In the case of one-dimensional searching, a record is represented

by a single key and a partition is defined by some value of that key.

All records in a subfile with key values less than or equal to the par-

tition value belong to the left son,

while those with a larger value be-

long to the right son. The keg variable thus becomes a discriminator for

assigning records to the two subfiles.

In k dimensions,

a record is represented by k keys.

Any one of

these can serve as the discriminator for partitioning the subfile repre-

sented by a particular node in the tree;

that is, the discriminating key

number can range from 1 to k.

The original k-d tree proposed by Bentley

[7]

chooses the discriminator for each node on the basis of its level in

the tree; the discriminator for each level is obtained by cycling through

the keys in order. That is,

D=Lmodk+l

where D is the discriminating key number for level L and the root node

is defined to be at level zero.

The partition values are chosen to be

random key values in each particular subfile.

This paper deals with choosing both the discriminator and partition

value for each subfile, as well as the bucket size, to minimize the ex-

pected cost of searching for nearest neighbors. This process yields

what is termed an optimized k-d tree.

-4-

An Algorithm for Finding Best Matches in Logarithmic Expected Time

Summary (1 min read)

Analysis of the Performance

Implementation

Citations

Cites methods from "An Algorithm for Finding Best Match..."

References

"An Algorithm for Finding Best Match..." refers background in this paper

Related Papers (5)