scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Some approaches to best-match file searching

TL;DR: Three file structures are presented together with their corresponding search algorithms, which are intended to reduce the number of comparisons required to achieve the desired result.
Abstract: The problem of searching the set of keys in a file to find a key which is closest to a given query key is discussed. After “closest,” in terms of a metric on the the key space, is suitably defined, three file structures are presented together with their corresponding search algorithms, which are intended to reduce the number of comparisons required to achieve the desired result. These methods are derived using certain inequalities satisfied by metrics and by graph-theoretic concepts. Some empirical results are presented which compare the efficiency of the methods.
Citations
More filters
Proceedings ArticleDOI
23 May 1998
TL;DR: In this paper, the authors present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces, for data sets of size n living in R d, which require space that is only polynomial in n and d.
Abstract: We present two algorithms for the approximate nearest neighbor problem in high-dimensional spaces. For data sets of size n living in R d , the algorithms require space that is only polynomial in n and d, while achieving query times that are sub-linear in n and polynomial in d. We also show applications to other high-dimensional geometric problems, such as the approximate minimum spanning tree. The article is based on the material from the authors' STOC'98 and FOCS'01 papers. It unifies, generalizes and simplifies the results from those papers.

4,478 citations

Journal ArticleDOI
TL;DR: An algorithm and data structure are presented for searching a file containing N records, each described by k real valued keys, for the m closest matches or nearest neighbors to a given query record.
Abstract: An algorithm and data structure are presented for searching a file containing N records, each described by k real valued keys, for the m closest matches or nearest neighbors to a given query record. The computation required to organize the file is proportional to kNlogN. The expected number of records examined in each search is independent of the file size. The expected computation to perform each search is proportional to logN. Empirical evidence suggests that except for very small files, this algorithm is considerably faster than other methods.

2,910 citations


Cites background from "Some approaches to best-match file ..."

  • ...Burkhard and Keller [2] and later Fukunaga and Narendra [6] described heuristic strategies based on clustering techniques....

    [...]

  • ...Burkhard and Keller [2] and later Fukunaga and Narendra [6] described heuristic strategies based on clustering techniques....

    [...]

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, and achieved remarkable performances in both lexicon free and lexicon-based scene text recognition tasks.
Abstract: Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

2,184 citations

Journal ArticleDOI
TL;DR: A unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework, and presents a quantitative definition of the elusive concept of "intrinsic dimensionality".
Abstract: The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without cross-knowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of "intrinsic dimensionality." We also present a unified view of all the known proposals to organize metric spaces, so as to be able to understand them under a common framework. Most approaches turn out to be variations on a few different concepts. We organize those works in a taxonomy that allows us to devise new algorithms from combinations of concepts not noticed before because of the lack of communication between different communities. We present experiments validating our results and comparing the existing approaches. We finish with recommendations for practitioners and open questions for future development.

1,337 citations


Cites background or methods or result from "Some approaches to best-match file ..."

  • ...For example, in BKTs and FQTs we can begin at the root and measure i =d( p, q)....

    [...]

  • ...On the right, the .rst level of a BKT with u11 as root....

    [...]

  • ...BKT....

    [...]

  • ...The same effect would be obtained if we had a mixture between BKTs and FQTs, so that for k levels we had .xed keys per level, and then we allowed a different key per node of the level k + 1, continu­ing the process recursively on each sub­tree of the level k + 1....

    [...]

  • ...Note that, historically, FQTs and FHQTs are an evolution over BKTs. 8.2....

    [...]

Proceedings ArticleDOI
01 Jan 1993
TL;DR: The up-tree (vantage point tree) is introduced in several forms, together‘ with &&ciated algorithms, as an improved method for these difficult search problems in general metric spaces.
Abstract: We consider the computational problem of finding nearest neighbors in general metric spaces. Of particular interest are spaces that may not be conveniently embedded or approximated in Euclidian space, or where the dimensionality of a Euclidian representation 1s very high. Also relevant are high-dimensional Euclidian settings in which the distribution of data is in some sense of lower dimension and embedded in the space. The up-tree (vantage point tree) is introduced in several forms, together‘ with &&ciated algorithms, as an improved method for these difficult search nroblems. Tree construcI tion executes in O(nlog(n i ) time, and search is under certain circumstances and in the imit, O(log(n)) expected time. The theoretical basis for this approach is developed and the results of several experiments are reported. In Euclidian cases, kd-tree performance is compared.

1,145 citations


Cites background from "Some approaches to best-match file ..."

  • ...The ZPS distribution restriction is key to achieving them; and our overall outlook in which nite cases are imagined to be drawn from a larger more continuous space, distinguishes in part this work from the discrete distance setting of [7, 11]....

    [...]

  • ...This work is thus highly related to the constructions of [7]....

    [...]

  • ...Burkhard and Keller in [7] present three le structures for nearest neighbor retrieval....

    [...]

References
More filters
Book
01 Jan 1969
TL;DR: The aim of this book is to seek general results from the close study of abstract version of devices known as perceptrons.
Abstract: Cambridge, Mass.: MIT Press, 1972. 2nd. ed. The book's aim is to seek general results from the close study of abstract version of devices known as perceptrons

3,004 citations


"Some approaches to best-match file ..." refers background in this paper

  • ...The problem has been discussed in [ 3 ], but no solutions proposed....

    [...]

  • ...Minsky and Papert [ 3 ] refer to this as the "best match" problem and comment on its...

    [...]

  • ...One use concerns keys which are possible outcomes of tests in large switching networks, such as the Bell System No. 1 ss [ 3 ]....

    [...]

Book
01 Jun 1981
TL;DR: The Revised Edition of Shift Register Sequences contains a comprehensive bibliography of some 400 entries which cover the literature concerning the theory and applications of shift register sequences.
Abstract: From the Publisher: Shift register sequences are used in a broad range of applications, particularly in random number generation, multiple access and polling techniques, secure and privacy communication systems, error detecting and correcting codes, and synchronization pattern generation, as well as in modern cryptographic systems. The first edition of Shift Register Sequences, published in 1967, has been for many years the definitive work on this subject. In the revised edition, Dr. Golomb has added valuable supplemental material. The Revised Edition contains a comprehensive bibliography of some 400 entries which cover the literature concerning the theory and applications of shift register sequences. Written in a clear and lucid style, Dr. Golomb's approach is completely mathematical with rigorous proofs of all assertions. The proofs, however, may be omitted without loss of continuity by the reader who is interested only in results. Dr. Golomb is considered one of the foremost experts in the world with respect to combinatorial and geometrical aspects of coded communications.

2,501 citations

Journal ArticleDOI
TL;DR: Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored and two algorithms have been tested that find maximal complete subgraphs.
Abstract: Several graph theoretic cluster techniques aimed at the automatic generation of thesauri for information retrieval systems are explored. Experimental cluster analysis is performed on a sample corpus of 2267 documents. A term-term similarity matrix is constructed for the 3950 unique terms used to index the documents. Various threshold values, T, are applied to the similarity matrix to provide a series of binary threshold matrices. The corresponding graph of each binary threshold matrix is used to obtain the term clusters.Three definitions of a cluster are analyzed: (1) the connected components of the threshold matrix; (2) the maximal complete subgraphs of the connected components of the threshold matrix; (3) clusters of the maximal complete subgraphs of the threshold matrix, as described by Gotlieb and Kumar.Algorithms are described and analyzed for obtaining each cluster type. The algorithms are designed to be useful for large document and index collections. Two algorithms have been tested that find maximal complete subgraphs. An algorithm developed by Bierstone offers a significant time improvement over one suggested by Bonner.For threshold levels T ≥ 0.6, basically the same clusters are developed regardless of the cluster definition used. In such situations one need only find the connected components of the graph to develop the clusters.

241 citations


"Some approaches to best-match file ..." refers methods in this paper

  • ...An algorithm known as the Bierstone algorithm for computing the set of all cliques of an undirected graph is given in [ 8 , 9]....

    [...]

Journal ArticleDOI
Robert Morris1
TL;DR: L'article donne une presentation didactique sur les methodes connues utilisees par ceux qui ecrivent les assembleurs and compilateurs de maniere a reduire les temps de recherche dans les tables de symboles.
Abstract: On rencontre de temps a autre, un article qui resume un nouveau domaine de recherche, qui eclaire les principaux resultats et les rend plus evidents. L'article de Morris est de ce type. L'article donne une presentation didactique sur les methodes connues utilisees par ceux qui ecrivent les assembleurs et compilateurs de maniere a reduire les temps de recherche dans les tables de symboles

218 citations

Journal ArticleDOI
TL;DR: The counterexamples to their and the modified version of the Bierstone algorithm for finding the set of cliques of a finite undirected linear graph are presented.
Abstract: Recently Augustson and Minker presented a version of the Bierstone algorithm for finding the set of cliques of a finite undirected linear graph. Their version contains two errors. In this paper the counterexamples to their version and the modified version of the Bierstone algorithm are presented.

75 citations


"Some approaches to best-match file ..." refers methods in this paper

  • ...An algorithm known as the Bierstone algorithm for computing the set of all cliques of an undirected graph is given in [8, 9 ]....

    [...]