scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An Algorithm for Finding Best Matches in Logarithmic Expected Time

TL;DR: An algorithm and data structure are presented for searching a file containing N records, each described by k real valued keys, for the m closest matches or nearest neighbors to a given query record.
Abstract: An algorithm and data structure are presented for searching a file containing N records, each described by k real valued keys, for the m closest matches or nearest neighbors to a given query record. The computation required to organize the file is proportional to kNlogN. The expected number of records examined in each search is independent of the file size. The expected computation to perform each search is proportional to logN. Empirical evidence suggests that except for very small files, this algorithm is considerably faster than other methods.

Summary (1 min read)

Analysis of the Performance

  • The storage required for file organization is proportionalto the file size, N. The discriminating key number and partition value must be stored for each nonterminal node of the k-d tree.
  • The computation required to build the k-d tree is easily derived.
  • At each level of the tree, the entire set of key values must be scanned.
  • The expected time performance of the search is not so easily derived.

Implementation

  • The above discussion has centered on the expected number of records examined as the sole criterion for performance evaluation cf the algorithm.
  • The results in Figure 5 show that in two dimensions near-asymptotic behavior occurs even for files as small as 128 records.
  • The logarithmic behavior size increases is illustrated of the overall computation as the file for the k-d tree algorithm in Figure 5 , except that for eight dimensions the Comparison of Figure 3 to Figure 5 tation involved in building the tree increase is slightly faster.
  • The fraction of computation spent on preprocessing decreases with increasing dimensionality.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

SLAC-PUB-1549
(Rev.)
STAN-CS-75-482
February
1975
Revised December
1975
Revised July
1976
AN ALGORITHM FOR FINDING BEST MATCHES
IN LOGARITHMIC EXPECTED TIME
Jerome H. Friedman
Stanford Linear Accelerator Center
Stanford University, Stanford, Ca. 94305
Jon Louis Bentley
Department of Computer Science
University of North Carolina at Chapel Hill
Chapel Hill, N.C. 27514
Raphael Ari Finkel
Department of Computer Science
Stanford University, Stanford, Ca. 94305
ABSTRACT
An algorithm and data structure are presented for searching
a file containing N records,
each described by k real valued keys,
for the m closest matches or nearest neighbors to a given query
record. The computation required to organize the file is propor-
tional to kNlogN.
The expected number of records examined in
each search is independent of the file size.
The expected compu-
tation to perform each search is proportional-to 1ogN. Empirical
evidence suggests that except for very small files, this algorithm
is considerably faster than other methods.
(Submitted to ACM Transactions on Mathematical Software)
Work supported in part by U.S. Energy Research and Development
Administration under contract
E(O43)515

The Best Match or Nearest Neighbor Problem
The best match or nearest neighbor problem applies to data files
that store records with several real valued keys or attributes. The pro-
blem is to find those records in the file most similar to a query record
according to some dissimilarity or distance measure.
Formally, given a
file of N recor,ds (each of which is described by k real valued attributes)
and
a dissimilarity measure D,
find the m closest records to a query
record (possibly not in the file) with specified attribute values.
A data file, for example,
might contain information on all cities
with post offices.
Associated with each city is its longitude and lati-
tude. If a letter is addressed to a town without a post office, the
closest town that has a post office might be chosen as the destination.
The solution to this problem is of use in many applications.
Infor-
mation retrieval might involve searching a catalog for those items most
similar to a given query item;
each item in the file would be cataloged
by numerical attributes that describe its characteristics.
Classification
decisions can be made by selecting prototype features from each category
and finding which of these prototypes is closest to the record to be
classified.
Multivariate density estimation can be performed by calcu-
lating the volume about a given point ccntaining the closest m neighbors.
Structures Used for Associative Searching
One straightforward technique for solving the best match or nearest
neighbor problem is the cell method.
The k-dimensional key space is di-
vided into small,identically sized cells.
A spiral search of the cells
from any query record will find the best matches of that record. Although
_
this procedure minimizes the number of records examined, it is extremely
costly in space and time,
especially when the dimensionality of the space
is large.
-l-

I
Burkhard and Keller [l] and later Fukunaga and Narendra [2] des-
cribe heuristic strategies based on clustering techniques.
These strate-
gies use the triangle inequality to eliminate some of the records from
consideration while searching the file.
Although no calculations of ex-
pected performance are presented,
simulation experiments indfcate
that
these techniques permit
a substantial fraction of the records to be
eliminated from consideration.
Friedman,
Basket-t, and Shustek
[3]
describe another strategy for
solving the nearest neighbor problem.
It involves forming a projection
of the records onto one or more keys,
keeping a linear list on those
keys,
and searching only those records that match closely enough on one
of the keys.
The
measures and does
They were able to
method is applicable to a wide variety of dissimilarity
not require that they satisfy the triangle inequality.
show that the expected computation required to search
1 1
the file with this method is proportional to kmk l$-*E .
Rivest [4] shows the optimality of an algorithm due to Elias which
deals with binary keys. That is, each key takes on only two values; the
distance function applied is the Hamming distance.
Shamos
[5]
employs the Voroni diagram (a general structure for
searching the plane) to the best match problem for the special case of
two keys per record (two dimensions) and Euclidean distance measure. He
presents two algorithms.
One can search for best matches in worst case
O[(logN)2] time,
after a file organization that requires storage propor-
tional to N and computation proportional to NlogN. The other algorithm
can perform the search in worst case O[logN] time, after a file organi-
zation that requires both storage and computation proportional to
N?
Unfortunately, these methods have not yet been generalized to higher
-2-

dimensionalities
or more general dissimilarity measures.
Finkel and Bentley
[6]
describe a tree structure, called the quad
tree, for the storage of composite keys.
It is a generalization of the
binary tree for storing data on single keys. Bentley l-71 develops a
different generalization of the same one-dimensional structure; it is
termed the k-d tree.
In his article, Bentley suggests that k-d trees
could be applied to the best match problem.
This paper introduces an optimized k-d tree algorithm for the pro-
blem of finding best matches.
This data structure is very effective in
partitioning the records in the file so that the average number of record
examinations
(1)
involved in searching the file for best matches is quite
small. This method can be applied with a wide variety of dissimilarity
measures and does not require that they obey the triangle inequality.
The storage required for file organization is proportional to N, while
computation is proportional to kNlogN. For large files, the expected
number of record examinations required for the search is shown to be in-
dependent of the file size, N.
The time spent in descending the tree
during the search is proportional to logN,
so that the expected time re-
quired to search for best matches with this method is proportional to
1ogN.
Definition of the k-d Tree
The k-d tree is a generalization of the simple binary tree used for
sorting and searching.
The k-d tree is a binary tree in which each node
represents a subfile of the records in the file and a partitioning of
that subfile.
The root of the tree represents the entire file. Each
nonterminal node has two sons or successor nodes.
These successor nodes
-3-

.,
represent the two subfiles defined by the partitioning.
The terminal
nodes represent mutually exclusive small subsets of the data records,
which collectively form a partition of the record space.
These terminal
subsets of records are called buckets.
In the case of one-dimensional searching, a record is represented
by a single key and a partition is defined by some value of that key.
All records in a subfile with key values less than or equal to the par-
tition value belong to the left son,
while those with a larger value be-
long to the right son. The keg variable thus becomes a discriminator for
assigning records to the two subfiles.
In k dimensions,
a record is represented by k keys.
Any one of
these can serve as the discriminator for partitioning the subfile repre-
sented by a particular node in the tree;
that is, the discriminating key
number can range from 1 to k.
The original k-d tree proposed by Bentley
[7]
chooses the discriminator for each node on the basis of its level in
the tree; the discriminator for each level is obtained by cycling through
the keys in order. That is,
D=Lmodk+l
where D is the discriminating key number for level L and the root node
is defined to be at level zero.
The partition values are chosen to be
random key values in each particular subfile.
This paper deals with choosing both the discriminator and partition
value for each subfile, as well as the bucket size, to minimize the ex-
pected cost of searching for nearest neighbors. This process yields
what is termed an optimized k-d tree.
-4-

Citations
More filters
Journal ArticleDOI
TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

46,906 citations


Cites methods from "An Algorithm for Finding Best Match..."

  • ...Our keypoint descriptor has a 128-dimensional feature vector, and the best algorithms, such as the k-d tree (Friedman et al., 1977) provide no speedup over exhaustive search for more than about 10 dimensional spaces....

    [...]

Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations

Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: The multidimensional binary search tree (or k-d tree) as a data structure for storage of information to be retrieved by associative searches is developed and it is shown to be quite efficient in its storage requirements.
Abstract: This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A significant advantage of this structure is that a single data structure can handle many types of queries very efficiently. Various utility algorithms are developed; their proven average running times in an n record file are: insertion, O(log n); deletion of the root, O(n(k-1)/k); deletion of a random node, O(log n); and optimization (guarantees logarithmic performance of searches), O(n log n). Search algorithms are given for partial match queries with t keys specified [proven maximum running time of O(n(k-t)/k)] and for nearest neighbor queries [empirically observed average running time of O(log n).] These performances far surpass the best currently known algorithms for these tasks. An algorithm is presented to handle any general intersection query. The main focus of this paper is theoretical. It is felt, however, that k-d trees could be quite useful in many applications, and examples of potential uses are given.

7,159 citations

References
More filters
Book
01 Jan 1968
TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Abstract: A fuel pin hold-down and spacing apparatus for use in nuclear reactors is disclosed. Fuel pins forming a hexagonal array are spaced apart from each other and held-down at their lower end, securely attached at two places along their length to one of a plurality of vertically disposed parallel plates arranged in horizontally spaced rows. These plates are in turn spaced apart from each other and held together by a combination of spacing and fastening means. The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid. This apparatus is particularly useful in connection with liquid cooled reactors such as liquid metal cooled fast breeder reactors.

17,939 citations

Journal ArticleDOI
TL;DR: The multidimensional binary search tree (or k-d tree) as a data structure for storage of information to be retrieved by associative searches is developed and it is shown to be quite efficient in its storage requirements.
Abstract: This paper develops the multidimensional binary search tree (or k-d tree, where k is the dimensionality of the search space) as a data structure for storage of information to be retrieved by associative searches. The k-d tree is defined and examples are given. It is shown to be quite efficient in its storage requirements. A significant advantage of this structure is that a single data structure can handle many types of queries very efficiently. Various utility algorithms are developed; their proven average running times in an n record file are: insertion, O(log n); deletion of the root, O(n(k-1)/k); deletion of a random node, O(log n); and optimization (guarantees logarithmic performance of searches), O(n log n). Search algorithms are given for partial match queries with t keys specified [proven maximum running time of O(n(k-t)/k)] and for nearest neighbor queries [empirically observed average running time of O(log n).] These performances far surpass the best currently known algorithms for these tasks. An algorithm is presented to handle any general intersection query. The main focus of this paper is theoretical. It is felt, however, that k-d trees could be quite useful in many applications, and examples of potential uses are given.

7,159 citations

Journal ArticleDOI
TL;DR: An optimized tree is defined and an algorithm to accomplish optimization in n log n time is presented, guaranteeing that Searching is guaranteed to be fast in optimized trees.
Abstract: The quad tree is a data structure appropriate for storing information to be retrieved on composite keys. We discuss the specific case of two-dimensional retrieval, although the structure is easily generalised to arbitrary dimensions. Algorithms are given both for staightforward insertion and for a type of balanced insertion into quad trees. Empirical analyses show that the average time for insertion is logarithmic with the tree size. An algorithm for retrieval within regions is presented along with data from empirical studies which imply that searching is reasonably efficient. We define an optimized tree and present an algorithm to accomplish optimization in n log n time. Searching is guaranteed to be fast in optimized trees. Remaining problems include those of deletion from quad trees and merging of quad trees, which seem to be inherently difficult operations.

2,048 citations

Journal ArticleDOI
TL;DR: The proof to be given is relatively simple and the importance of this result can be measured in terms of the Jarge amount of effort that has been put into fmding efftient aJgorJthms for constructing optimal binary decision trees.

1,014 citations


"An Algorithm for Finding Best Match..." refers background in this paper

  • ...Such an optimization is known to be NP-complete [7] and thus very likely of nonpolynomial time complexity....

    [...]

  • ...HYAYIL, L., Am) RZVEST, R.L. Constructing optimal binary decision trees is NP-complete....

    [...]

  • ...Such an optimization is known to be NP-complete [7] and thus very likely of nonpolynomial time complexity....

    [...]

Journal ArticleDOI
TL;DR: The method of branch and bound is implemented in the present algorithm to facilitate rapid calculation of the k-nearest neighbors, by eliminating the necesssity of calculating many distances.
Abstract: Computation of the k-nearest neighbors generally requires a large number of expensive distance computations. The method of branch and bound is implemented in the present algorithm to facilitate rapid calculation of the k-nearest neighbors, by eliminating the necesssity of calculating many distances. Experimental results demonstrate the efficiency of the algorithm. Typically, an average of only 61 distance computations were made to find the nearest neighbor of a test sample among 1000 design samples.

776 citations